Data Modeling: One Big Table vs Kimball vs Relational for data engineers

By Data with Zach

Summary

Topics Covered

Relational Excels at Production Integrity
Snapshots Unlock Deleted Data Analytics
SCD Type 2 Crushes Snapshot Waste
One Big Table Eliminates Shuffle

Full Transcript

everyone to one big table versus Kimble versus relational data modeling this is a really meaty topic there's a lot of things to cover we're going to have a lot of ground to cover here these are essentially the three main ways that you

can model data there are other options like Data Vault and other things that I am not covering in this presentation today because I don't think they're as important so yeah let's uh let's dig

into this so what will we cover today um how does data modeling generally happen I think that that's there's kind of some mystique and mystery around like how it

actually happens especially given the fact that like data modeling is a part art and a part science so that is something that I think a lot of people don't really get we're going to be

covering that we're going to be covering why do we need to model data in so many ways like why is there one big table and Kimble and relational and Data Vault and like why is there like a trillion million different ways aren't we just like modeling rows and columns like why

the hell is there so many different options so we're going to cover exactly why why there are so many options and then we're going to go over an in-depth example of comparing these three techniques and seeing like what the

benefits and drawbacks are of Kimble of one big table of relational so you can kind of understand and then we're going to also talk about like when does Kimble data model when when does it fail and oh then I have another extra line there my

bad that was supposed to be deleted uh let's go okay why are there so many uh data modeling techniques um the main reason for it is that each one has strengths

and weaknesses that's what it really comes down to so when we uh get to the the nuts and bolts of everything uh you will notice that we have uh relational

data modeling that is uh like probably the most common uh relational data modeling you can think of as like another word I like to use for relational data modeling is production data modeling so almost every production

application is going to model data in this way it's really really good when you are filtering the data down to a small number of Records right think like

filtering down to like one user like one user or like um a dozen events it's really good for that and it's really good at maintaining Integrity because

you have things like foreign keys and primary keys and D duplication and all this stuff where it enforces data Integrity for most like relational database systems so if you model your

data in this way like you get to uh leverage a lot of those quality guarant and that's what production datab bases are all about and I mean every company you know Netflix Facebook Airbnb they

almost all use some sort of relational database as their uh production data like whether it be Facebook uses a sharded mySQL database so does um Airbnb

there's a lot of different um SQL relational databases out there so what are the cons of relational database uh relational data modeling uh generally speaking uh there's two big ones the

first big one is it if you're doing an analytical question you might need to join like 30 tables together I remember when I was working at Airbnb trying to

figure out like what are all the pricing settings that are necessary to gather to understand like what things impact price

and that was um a 25 table join and it's like imagine if that was uh every time an an an analyst needs to freaking uh understand what um is impacting price

they have to do a 25 table join so that's probably where you don't want to use relational like so generally speaking relational data modeling does very poorly once you uh Branch into

analytics when you're in analytics or you're in historical data relational data modeling is going to suck right and the other big thing is relational data modeling generally is only for the

latest state of the data it doesn't have historical data okay so Kimble Kimble and or dimensional data modeling this is probably the post child this is like

probably the one that uh as data Engineers you should know the most it's the most important to know as a data engineer if you've ever had a data engineering interview then you know for

sure that uh dimensional data modeling is asked you're going to be uh like you'll be given like a some Source data and then asked about like what are the dimension tables what are the fact

tables what are the aggregate tables and you have to come up with these little fancy diagrams and I mean know I've been there I've had to come up with those goddamn fancy diagram so many times at Facebook at Netflix at Airbnb I've had

to do that many times in my interview processes so that's great uh it's really great for analytical processes especially when you're converting to olap right if you're trying to like just get um total Aggregates and stuff like

that it's really great it's also pretty good for um large amounts of data and historical data uh I found that like if you can do this stuff the right way this

is going to solve most of your problems like dimensional data modeling and uh dimensional and like dimensions and facts are going to cover like I don't know like 80 to 90% of your use cases

that you want to do as a data engineer and then there's going to be 10% or so that are going to be covered with like one big table and then relational is kind of outside that's like if you're doing application data modeling so uh

that's going to be the main things so what are the the um the drawbacks of dimensional data modeling like I mean transactions um should be modeled relationally like uh like transactions

like updates don't really happen with um diens and facts like you want to be like doing like a daily poll of this data we we're going to be looking at an architecture diagram here in just a

second to see what is going on and what I mean by that um let's just talk about uh the last one here uh one big table data modeling uh this is going to be on

top of Kimble like so you can think of it as like almost an extension of Kimble it's like taking Kimble and denormalizing even more that's the main thing that you can get out of like one

big table it's not like like one big table and Kimble are really that different they are a little bit different and we're going to go into like where um so what one big table is

really great for is you can answer questions minimizing joints and Shuffle so a lot of times like you can think of one big table as like a table within a table so and the way that that works a

lot of times is a table within a table means that you need to use complex data types uh in this case the complex data types are array and stru those are going to be the two and a lot of times like

the table within a table here is actually an array of struct so you actually put a table in a column that's essentially what you do uh with one big table it's kind of crazy but what that

does is it allows you to be able to do aggregations and sums without Group by you can actually just sum the array and you can get all that stuff out of it we're going to be going over some examples here in just a little bit if

that it doesn't seem like something that makes that much sense uh this is really really great for long-term analysis like if you're working with a lot of data over a long period of time it's really great because it minimizes shuffle and

once you the when if you're doing a long-term analysis generally speaking you have a lot of data that you're looking at so um it can minimize Shuffle and make your queries more performant uh

it's bad because uh complex data types suck like I don't know if y'all have ever tried to query array columns or struct columns or teaching people how to query array instruct columns people are like this I don't want to I don't

want to learn this this is not what I'm about out uh that's a big thing uh I'd say that's the biggest con um another very important con here is uh you

essentially suck all of the fact data into the dimensional um space so a lot of times you end up copying a lot of the fact data over again and so you end up

denormalizing and copying data again and that uh that's the whole idea with one big table is that like you are trading storage for compute right so you're

saving a lot on compute by not having to do joins and not having to do Shuffle but what where you're paying the penalty is in storage because you're going to end up having like another copy of the

data and this can be very good though because like I don't know if y'all know but St storage generally speaking especially in the data lake is really cheap a compute is a little bit more

expensive so uh we'll go into a little bit more detail there later on okay so in big Tech and everywhere kind of in the world

uh if you are at a company that has any sort of uh any sort of advanced data architecture at all this is the architecture that almost every single

company uses like it's kind of weird how how consistent this architecture is across the board so on the left you have an application maybe that's Facebook maybe that's Netflix maybe that's Airbnb

maybe that's freaking Uber or Google or whatever and then you have you or me and then we interact with the production database by through an application maybe we are booking an Airbnb or we're adding

a friend on Facebook and we are updating and changing our production data um that is u a very common thing that is the production that that one Arrow here you see this one Arrow right here there's a

lot going on there a lot of like but that's not our job as data Engineers our job is in other places this is like a software engineer front end back end full stack engineer this Arrow has a lot of complexity to it that I'm not trying

to minimize I'm just trying to show what other things happen here so when you do clicks and when you when you click buttons and stuff on an application or website it also generates data like

generates fact data like Zach clicked this button on February 29th at 8:00 P.M

and then that goes and gets dumped to kofka and kofka you can think of kofka as a bunch of just like data waiting in a line that's all Kafka is and uh and then we want to use Kafka to uh

effectively capture all of the event data and then uh but the event data isn't enough cuz we also need the dimmension data too we need we need facts and dimensions in order to do

dimensional data modeling so generally speaking uh you're playing around with the production data and then on the back end usually at 5:00 p.m. Pacific or

midnight UTC there are jobs that what they do is they look at the production data and they literally like do a select star from all the tables in production so that they know what the the state of

the data was at that moment in time so like at midnight UTC this is what was in production and then they take that snapshot and then they dump it to a data

Lake and then that is partitioned right that t the the the data uh the production snapshot for each table is in the data Lake but it's partitioned by snapshot date a lot of times like at

Facebook and Netflix and Airbnb they actually call it DS so it stands for like date stamp they don't call it snapshot date but it's like DS you can think of that as just like as of this

moment in time along with uh the the the snapshotted dimensions that get dumped into the lake you have the fact data that gets replicated from Kafka uh a lot

of times there's a bunch of different things that happen here um at Facebook there's like an hourly spark job that dumps it to the data Lake and then there's an hourly partitioned table there some places do daily uh but

whatever the whole idea is we moved the data from Kafka put it into the lake and now we are finally at a point where we can start doing our dimensional data

modeling because we have all of the data in the data Lake we have our freaking um we have our fact data from Kafka and we have our daily snapshotted data from

production from uh my generally speaking MySQL but it can be other uh databases as well so then after that we go on to do our data modeling right we have

Master data olap cubes and metrics generally speaking obviously there's other things here like feature engineering feature stores machine learning stuff if you wanted to like do other types of modeling as well I'm

trying to stick to more fundamental stuff in this presentation so this uh diagram I'll probably refer back to this uh at various points throughout this

presentation so that uh if this was a lot to um digest in the last two minutes uh don't worry we will be uh kind of coming back okay so we're going to be

covering an example here of user growth and engagement so um just imagine we are Facebook right and we have have essentially uh two tables in production we have the users table and the posts

table those are going to be our two tables and then we have a logging table right we have something that gets logged to kofka we have this post actions table

and post actions has a the post ID the user ID the action type and the action Tim stamp of like when and the action type might be like comment share whatever it's just like a an action

that's done on a post and then you have the user table you know ID name country all that stuff you have the post table uh this is great so far right so

honestly like from a relational perspective that this is right right because you see the post table has a foreign key uh with the user ID column

great so that means that there's a one to many relationship between users and posts um a post can only have one user but a user can have many posts that's what uh that relationship is talking

about that's why it's already modeled relationally and generally speaking as data Engineers we don't have to do very much relational modeling I mean unless you're at a very small company then you're going to have to worry about this

stuff as well like for example uh data expert. I do this modeling as well I do

expert. I do this modeling as well I do the relational modeling and the dimensional modeling uh for my company and so uh then you have the uh we were talking about the the post actions that

gets dumped to Kafka so the good news here is this data is already relationally pretty much correct so you'll see um yeah we had that one to many relationship and uh

luckily for us the the software engineer who modeled these tables uh they did a good job so we don't have to be frustrated with them so keeping in mind

that like if these tables are not modeled correctly in production it's going to mess up your analytics so a lot of times like if you have bad data that

you snapshot then that just just is just bad data on top of bad data and you have a problem right so that's where solving these problems in the relational layer is important it's actually something

that uh Joe Reese talked about in his uh data modeling stuff he's talked about how really uh you have to normalize your data sets first in order to be able to

then denormalize and do a lot of analytics on them and I totally agree with that perspective and so um if if you're Upstream production data is not modeled correctly relationally you're

going to have a pain in the ass time dealing with this um whether or not you choose one big table or you choose Kimble this is still G to be a massive pain in the ass for you so make sure that like when you're looking at the

production data that you look at the keys make sure that may maybe they even have like a an ER diagram that models out like what the one to many and many to many many to many relationships

actually are so that you can use that data Downstream when you are uh doing your dimensional modeling or your one big table modeling so that's a very important part of this data modeling

process is you got to talk with the people who own the data especially the that Upstream data the data that's owned by the software Engineers whether that be the logging right the logging that's

going to kofka or it can be the freaking dimensional snapshots or the production data like I just want to give a quick aside here real quick so when I worked at Facebook uh we had this table called

notifications delivery and what it had in it was it logged every time a notification event happened you can think about like a notification was generated it was sent it was delivered

it was clicked right all those different like uh events in the notification funnel and one of the things that I uh I didn't realize until I had been working there for a year was that they were

logging every single experiment that every single experiment that a user was in uh at the notification grain though so there was this big array that was all

the um experiments that a user was in and they were logging it 50 billion times to day and I learned that like after uh working in notifications for about a year I discovered that and I was

like we should delete that and we we don't need to do that we can do a join between the user experiment table and the notification table and then we don't have to have that big ass array

replicated 50 billion times and uh when we did that when we actually deleted that column we saved three pedabytes of space in a month three pedabytes and so

and that was just deleting one column and then doing a joint so like a lot of times like that's a big thing that I think data Engineers don't do enough of is they think that they

have to solve everything in the data pipeline layer they have to do it with spark with airf flow that's where they need to solve their problems but a lot of the time you need to go upstream and you need to have conversations with

people and talk with people and have them fix the logging have them fix the snapshots and if you do that that will um be more fruitful that one conversation save Facebook three

pedabytes right and because I discovered that and so that's just an example of where if you can do this stuff right you can find all sorts of problems because

software Engineers aren't that well versed in like Cloud efficiency and a lot of these like dimensional data modeling things they don't really think about that quite as much they think about like what is the way that we want

to make this work and especially what like software Engineers are going to make more errors in logging than they do in the dimensional space and the dimensional snapshot space mainly

because the logging doesn't impact their job right logging to them is almost like we're just sending these events off to be handled by somebody else it's kind of like sending your kid to military school

or like kicking them out of the house it's like it's the throwing it over the fence it's someone else's problem now and that someone else is the data engineer and we get to deal with that and so a lot of times you want to go

back to them and be like hey man like can you fix this problem this is a mess and so that generally happens more often with like logging and fact data than it does in with dimensional data because if they mess up the dimensional data like

the their production app breaks and so like then their app doesn't run and then like you probably don't even get that data because the app is broken so that's just a a thing to think about when you're going through this these data

modeling exercises okay what if we wanted to count the number of deleted posts that contain SQL sucks that's what we want to

do here um with this relational data store so if we go back you you can see we have this uh post text column that's probably what we would want to do but if

it's the number of deleted posts if we just look at production uh we we can't find it because it's deleted it will be removed from production so we can't solve this question with uh dimensional

data modeling right or with a relational data modeling because of the fact that that data is actually deleted so we actually won't be able to solve that problem so um and the other thing to think about is like we could maybe

figure it out with the logged data because the log data would have like a um a post action which is a user deleted that post so we maybe maybe we be able

to find it there but the problem with that is we need to find we need to know the the post contents and the logged data will only have the post identifier it's not going to have the post contents

because that's bad right because that's again duplicating uh logs and like we don't want to do that so um how are we going to do this um if the deleted post is going to disappear from from

production when it's deleted so yeah let's dig into it see how we're going to do that so um delet this deleted post question is actually pretty simple uh it's solved with snapshots like I was

talking before every day we take a snapshot of production at midnight UTC and uh that gets dumped into the data lake so then we have access to all the

data for analytical purposes whether it be um uh like deleted or not in production we know we still have a copy of it uh in our data Lake um generally

speaking this is one of those things that is kind of sensitive because it depends on your company's privacy policy on how this works so at Facebook how it works is if you have deleted data like

that you don't have a user ID you have an anonymized user ID so there's no way to link the user back to the post there's no way to do that because it's an anonymized user ID so uh the

analytical data can still use it because it's anonymized and doesn't have any pii in it so so um but that depends on the company and the policy but generally speaking uh for the rest of this

presentation just recognize that uh the Privacy policies that at this company are fairly laxed um so remember these data these database snapshots happen

once a day so we actually don't capture all the deleted posts because if you create a post and then you delete it like in five minutes the the snapshotting process is not going to

capture that but you will be able to capture that in the log right the logs will capture that you created and deleted a post but we just don't w we won't have any information about that post and that's a very common thing

where at a lot of companies when you're measuring change change that happens within a day if if something is created and deleted within a day a lot of times like they just don't care about it it's

only like like day overday change that companies care about uh and that's why these daily snapshots are very very useful um it doesn't it doesn't capture the whole new Nuance of everything but

it does give you a pretty solid analytical picture into what's going on in your application okay so let's think about it this way so we say we are doing our

snapshots right and now we have a new table here this is going to be our our new set of tables right which are going to be our snapshotted tables and our table uh from Kafka so now you see the

user table here we have a snapshot date and then we have posts which has a snapshot date and these are going to be partitions in our data Lake because we need to have folders for this data uh

because we don't want to have to like query all the snapshots together because that doesn't make sense because there's going to be a lot of duplicated duplicated data here and uh we'll talk a little bit more about the benefits and

risks of that uh and then obviously the post actions table is going to have an action date partition as well so those are going to be our three tables that

are now in our data Lake that are ready to do uh dimensional data modeling and one big table data modeling and we're going to talk about the two different

ways that we could possibly do this okay so remember we're trying to get the deleted posts so to get the deleted posts you could look at two Snapshot dates and do a full add or join between

today and yesterday and then you can say uh where today. poost ID is null because if it's null today that means it's not null yesterday and that means it was deleted sometime between yesterday and

today uh this is a very common pattern uh this is a a part of what's called cumulative table design uh I will have a link for that if you go to my GitHub you can find my cumulative table design I

have a a lot longer write up on that pattern specifically it's a very powerful pattern that I highly recommend you check out um one of the things you'll notice about this uh snapshot

stuff right is if we have a daily snapshot of the user table how often does your name email and Country change I mean I don't know for me my country is

never changed my email on Facebook has never changed my name has my I went from Zachary to Z so I made one change right on like those columns and but we're taking a snapshot every day of my data

even though it's not changing so that's actually kind of a waste right so one of the things that we need to do is notice that and notice like hey these things

are not changing very much they're changing dare I say very slowly and so when uh if they're not changing very much and we're just taking all this snapshot we're going to be having a lot

of data in the cloud that we just don't really need or that are is just a lot of the same so um what we can do is we can convert these uh daily snapshotted Dimensions into slowly changing

dimensions and this is a very important part of dimensional data modeling um and this is also a big part of one big table uh we can talk about that in a little

bit as well so what what we can do in that case right is Imagine uh we have two tables here where we have uh the

users SCD table and the post SCD table so in this case what we have is uh with the user table The Columns that can change is name email and Country right you can change your name you can change

your email you can move to a new country and so those are the columns that can change so that's why we can model those as an SCD so what this will do is it will take those daily snapshots and kind

of collapse them so imagine if we um in a year right so there would be 365 records for me in a year because there's 365 daily snapshots but I only change my

name once and that's it and my email and Country don't change I change my name once email and Country don't change so if you model the SD table your 365

records should collapse into two records because you have the one record when uh you had Zach and then you have another record when you had Zachary and those are going to be the two records for when I change my name and then uh so that's

what's really cool about modeling stuff as an SCD is you saw there like going from 365 records to two records that is a massive massive performance gain in

terms of uh storage right and you can imagine the same thing for posts right in this case there's only two columns really there's the is deleted column that would be modeled as an SCD and then

you have the post text column that would be modeled as an SD because you know how you can like edit a post so then you could have kind of both um in there as the two different uh columns that would

be modeled as an SCD and then everything else in these is the same so this modeling here with the start date end date and is current this is what's called SCD type 2 uh slowly changing

Dimension type two which uh according to Airbnb is the only way to model slowly changing Dimensions because it's the only way that captures all of history and it's the only way that you can

correctly backfill SDS so just remember that like so SD type 2 is going to be the gold standard even though there is like SCD type 1 type 3 Type four type 5

type 6 Ty there's a lot of other types like type two is like the only one that you really need to know and you can essentially derive all the other types from type two which I find pretty

interesting so anyways those are going to be the two STD tables that we could model if we were kind of going about this from a dimensional data modeling perspective right so let's uh let's

let's dig a Little Deeper here okay so now it's easy to count the number of deleted posts uh that contain SQL sucks right because we can just say

select count star from posted where post text likes equal sucks and is deleted is true and partition date equals this date because and then that will give us the

current number of deleted posts that um are uh that have those those words in them so that is a very powerful way that you can model things you see how that is

a much better and simpler way to solve this problem versus uh doing it from the relational world where like we don't have access to that right because we could do it from the data the daily

snapshot data as well the problem with the daily snapshot data is we would have to read in a lot of partitions we'd have to read in a lot of partitions on a lot of extra data that like we don't necessarily need and that's where this

part uh slowly changing Dimension can be a very powerful way to solve this problem so um that's going to be uh uh the one thing that I want to talk about

uh in terms of dimensional data modeling okay so what if we wanted to ask something more complicated right what if we wanted to ask how many likes

does a post containing SQL sucks get before it gets deleted this is going to be um a really solid question right because in this

case we would need to do a join because you need to get the join between posts SCD and post actions that's how you're going to have to solve this problem because that's going to be the only way

that you can know uh to do that but there is a big problem with this right and this is where Kimble kind of falls apart a little bit uh assuming that post

actions is huge we're going to assume that we're at the Facebook scale uh the number of likes comments and shares on uh this on posts is just very very high so in this case so say you have a post

that uh you want to know all of them together all the posts that contain SQL sucks that were deleted and we want to uh do a join with post actions to

understand um uh like how many likes each post got before they got deleted so in that case what you're going to need to do right is

you would need to figure out the minimum created date and the maximum created date for posts that contains SQL sucks which could be a massive window right

cuz someone in 2009 and someone in freaking 2018 they could be I mean you know how many people hate squl like it's a very consistent stream of hate right so like there's going to be posts every

single day about it so uh in that case you could do a join to figure this out with post actions but you're going to have to look at essentially all of history in order to

all of history for post actions in order to uh to to get this right or or I mean the other way to do this right would be okay we want to look at a specific time

frame be like and this is where you can kind of uh get away with this right is like okay I only want to look at all the posts uh that contains sequel sucks in the last week last month last six months

or you pick like a a specific time frame but then you also need a p pick like a the other end like uh like for how long because a post can get a like right I

mean like I get likes on some of my LinkedIn posts from three years ago so like you're going to have to uh have like a a forward-looking cap as well

like so it's like how many posts did something with SQL sucks um from from this six-month window get for the next two years or the next next one year

right and you have to kind of put a freaking cap on it that's going to be the way that you would solve this as as efficiently as you could with Kimble

data modeling right because post actions is already modeled the way that you're supposed to do Kimble data modeling let's go back just to make sure that y'all remember that so you see this post

action table it has the post ID on it which is going to be that foreign key it's going to have that link to that Dimension and that's what we are that's

exactly the correct way to model uh Kimble data Kimble fact data so it's this is already a properly modeled

Kimble fact data so let's go back uh to um this uh example here oh got to I was I clicked a lot there okay so now uh

like if you think about it this way uh that's going to be a join an inefficient join and that's where Kimble's is going to kind of fall apart so what if we minimize the number of

table scans here by incrementally building up arrays at the post grain so in this case imagine what we have here is a schema like this we have one that

has a post ID actions and a partition date and then actions is every action that happens on a post that is what this new table we're going to call this post

OBT it's going to be a one big table post so it's going to have the post ID and then every action that was taken on it and so you can think about it this

way so every day what you do is you look at the new actions that come in and then you add them to this array every just every day you just do like an array

concat between like the historical data and then the data that came in today so you kind of incrementally build up this array every day so then you can have access to all of the data that way so if

we do it this way this can be a much more efficient way to know if you actually have all your data and what you don't have to worry about time frames quite as much either because this is going to this is going to be a

cumulative table this has every post every action but keeping in mind again that's where like this is a little bit inefficient because we are sucking all of post actions into post OBT you see we

have that action type and action timestamp that's going to be sucked into this table but then we can do really powerful things so let's go into some queries here that I think y'all will

find pretty interesting so post OBT it just has every single new event from post actions into an array of

struct and then we can reduce this array to get the current like count and we let's let me show you how that is done so here's how you would do it uh with

and in this case uh uh we're going to be using trino trino Presto both for this presentation and for the last today so you'll notice here uh we have post OBT

where uh the date we have the partition date and one of the things you'll see is we are actually getting the like count here because what we do is we just take that array and we essentially uh we

start at zero you see the zero here and then we do a case when on the action type inside of the struct and we say okay if the action type is like then add

one to this otherwise just have it still be the same and that's what this red use function ends up doing so this is a way that you can get a like count without

using group bu right one of the things I want you to really notice here about this query is there is no group bu right because the the all of the events are

already grouped they're in that array so it's already grouped we've already done that work so since there's no Group by this query is what's called infinitely

scalable because it doesn't need any Shuffle there's no Shuffle that is necessary here right this is this is the same as like a simple select that's literally what's going on here it's a

simple select so since it's a simple select this is going to be a very very fast query so I leveraged this concept both at Facebook and at Airbnb for

various things because this allows you to look at very large time scales and very long um time periods without using very much compute at all because you

don't have to shuffle cuz Shuffle is where you're going to eat a lot of the pain and suffering of doing a lot of these long-term analyses so let's go into the uh the next the next uh page

here to go into a little bit more detail here so here is going to be kind of that final uh this is going to give you that final uh example here of like how to get

it right so in this case we have our properly modeled posts SCD table which we're still using but then we have a little CTE here that gets our post count

or or our like count and then we have uh we do a join with uh We join post SD with the uh with that count and then we just have the other conditions here and then that is how we are going to go

about doing this right that is going to give us everything that we're looking for that is exactly the way that we would want to do this so uh that is going to be a very e this is like a very

very efficient way to solve this problem but what I'm trying to say here is is uh one big table like is only very useful for a couple cases but let's dig a little bit deeper here because some of

y'all probably are thinking of like more efficient ways to do this without without arrays and you're totally right you're totally right because guess what there's another way to do this there's another way to do this that isn't

crazy right so instead of sucking up all the event data into the post grain you could do ACC cumulated daily count so imagine you created a table that's post likes cumulative and

what it is is post ID and then like count and then this count just every day gets added from it it it looks at the new incoming data and it updates the like count and then what we can do is we

can just use this table bada bing bada boom done right and then we don't have to do any crazy reduced we don't have to do any arrays no structs and

then we can solve that problem that very narrow problem right this seems like a much simpler solution than everything I just showed you and you're probably like exact like what are you smoking right

now like why are you showing me one big table so this is the better way to go if you are thinking about like one metric or like a couple metrics right

especially if they're metrics that don't have any other dimensional context that you care about that's another thing to think about so if we go in let's go to the next slide where I mean by

dimensional context right so imagine we had we we did this other Design This this cumulative design but then the data scientist comes back to us and is like we want to now to give me all the likes

that happened between 420 p.m. and 4:25

p.m. each day to catch the stoned people liking the squl sucks posts so um now it's more complicated right where where now if we have to have we have to keep

the time stamp around of when the likes happened then this way of doing things this post likes cumulative is no longer going to work it's no longer going to

work and it's not as flexible right this only gives you that total for the day or the the total amount right it you lose

the um the factual like that that fact layer context if you uh aggregate up like this and make it simpler and I'm not saying that this is wrong and for in

a lot of cases this is going to be 100% right and the correct way to do this right and that's the key thing to remember here is that this stuff is nuanced right that sometimes you do want

to do this and other times you want to uh maybe do one big table to keep the dimensional context so let's think about this a little bit so you see how like we we we have we have someone who wants

just this F minute window each day so you can imagine if we go back to this example with this reduce function what we can do is we can add just a little

bit here so we can say case when X doaction Type equals like and x. action

timestamp um like the the hour and the minute of x. action Tim stamp is between 420 and

x. action Tim stamp is between 420 and 425 so you would just add a very very small addition to this query and then you would have that context and you

wouldn't need to write another pipeline right you can just use this one again and then you just have to throw it in this reduce function and then you're good to go and you don't have to do you don't have to build another Pipeline and

so the one thing that this does right is this like keeps the fact level data right you can understand more of like each individual like and each individual thing that's happening you can be aware

of those things and you can go ahead and uh figure that stuff out as well so that's essentially what I had for today

in terms of my like one big table versus dimensional modeling right so we like the thing one big table pre preserves the fact uh the other columns and the

fact table that a lot of times you might need uh a lot of times these columns are going to be like uh like timestamps or maybe devices or uh there could be all

sorts of different other columns that you might want to also bring in here so that's where this is going to be a much better way to do things versus the two other ways to do this right one is to

just do it as normal fact data the problem with that is that you're going to have to do a many many you're going to have to scan hundreds of partitions of data in order to join it effectively

and that's not very fun uh the other way to do this is going to be to do the cumulation with Kimble and if you do the cumulation the problem there is that you

lose out on some of the event level context and that's and so if you have a data scientist who does want a very

solid analytical architecture to work with this is the way to go one big table is the way to go um I have an I want to give another example here so when I

worked at Facebook uh in 2017 uh they Facebook saw the first drop in growth in the in the entire history of the company so what happened was we

wanted to dig through all of the data and understand all the slow burning all the long-term trends and what I did was I set up a one big table architecture so

people can use these array functions and map them to Dimensions so then they can read 10 years of data very very efficiently and they don't have to shuffle

Loading...

Loading video analysis...