How to pick between Kimball, One Big Table, and Relational Modeling as a data engineer

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 ต.ค. 2024
  • We'll be covering:
    When to use One Big Table modeling vs Kimball
    How to use Struct and Array and Array of Struct to get what you want
    Make sure you have a DataExpert.io account to get the most out of this session.
    Join www.DataExpert... to get the queries!

ความคิดเห็น • 112

  • @meryplays8952
    @meryplays8952 วันที่ผ่านมา

    His mention of fixing normalization in the relational layer is very crucial.

  • @sf-spark129
    @sf-spark129 5 หลายเดือนก่อน +33

    Literally this is the ONLY video on TH-cam on how to data model with a hands-on, real-world example. All other videos talk about concepts, which sounds very abstract.
    BTW, this guy seems high but how come he can be this sharp at the same time? amazing

    • @chandanjha3205
      @chandanjha3205 5 หลายเดือนก่อน +3

      When he is high, he chooses right data model. When he is not, he simply dumps anything in a DB and ask analytics to figure out

  • @scottbrwnng
    @scottbrwnng 6 หลายเดือนก่อน +21

    Please upload more of these long form videos that go in depth

  • @TechyBuild
    @TechyBuild 7 หลายเดือนก่อน +4

    Once again you killed it Zach! You just made my data engineering journey easy.

  • @Tyf13
    @Tyf13 4 หลายเดือนก่อน +1

    Wow thank you this was super digestible and really helped me get a deeper understanding of modeling. This is really going to help me on the data science side of things.

  • @nileshk611
    @nileshk611 5 หลายเดือนก่อน +18

    41 mins of absolute data modeling concepts explained.

  • @NeumsFor9
    @NeumsFor9 7 หลายเดือนก่อน +8

    Bro, when prod data isn't modeled correctly, asking a data engineer to address all data quality issues is like asking your plumber to get you clean water. That all said, the right integration of ETL Process, Data Quality, and Business metadata (and competent boss that can support you in doing so) can help solve many DQ problems. Populating that metadata repo requires lots of end user interaction and, if your company is mature, data governance.

  • @shraddhathapa9446
    @shraddhathapa9446 5 หลายเดือนก่อน +7

    Hi Zach, please keep making these, your videos are incredibly insightful.

  • @ManishJindalmanisism
    @ManishJindalmanisism 7 หลายเดือนก่อน +6

    hi Zach, I have been working on SCD sfrom long, but I didnt understand partition_date column in your example. Is it the date of data ingestion in SCD table or sth else?
    For finding deleted number of posts in your example why would someone use partition_date where clause. Once a post is deleted it cant be undeleted. So why not simply Select count(*) from table where text = 'Sql_sucks' and is_deleted= True.
    And if i want to know how many posts where deleted on a specific day why would I again use partition_date, I will leverage start_date and end_date columns.
    Select count(*) from table where text = 'Sql_sucks' and is_deleted= True and start_date = current_date. That is the purpose of keeping these startdate and enddate columns in first place.

    • @EcZachly_
      @EcZachly_  7 หลายเดือนก่อน

      Data lakes pull the data forward though.
      Partition_date here still has all the data since SCD type 2 is cumulative.

    • @mikekenneth4773
      @mikekenneth4773 6 หลายเดือนก่อน +2

      ​@@EcZachly_Thanks, this clarifies my concern.So, the SCD will still de-duplicate the Data but hold the last partition_date of that record.

  • @shubhamsharma-ne2ke
    @shubhamsharma-ne2ke 6 หลายเดือนก่อน +17

    Wow. What a video. Randomly popped in feed and never completed a random 40 min video in one go.

    • @EcZachly_
      @EcZachly_  6 หลายเดือนก่อน +2

      Glad you liked it!

    • @shubhamsharma-ne2ke
      @shubhamsharma-ne2ke 6 หลายเดือนก่อน

      @@EcZachly_ loved it. Planning to take your course soon. Need to brush up the prerequisites:)

  • @WiktorJurek
    @WiktorJurek 26 วันที่ผ่านมา +1

    Love the video. Hits some important concepts. I wonder how you consider your dimensional models when using a semantic layer like LookML?

  • @farahiyahsyarafina2183
    @farahiyahsyarafina2183 6 หลายเดือนก่อน +2

    thanks for this, zach

  • @rachelryan5231
    @rachelryan5231 2 หลายเดือนก่อน +1

    Yes, i agree data scientists do ask sometimes "how many stoned people liked SQL Suck posts" at 4 am

  • @bradj229
    @bradj229 หลายเดือนก่อน +1

    Awesome explanations. Thanks!

  • @tadbronson
    @tadbronson 7 หลายเดือนก่อน +1

    Thanks for this video! It was very informative and the examples really illustrated the different use cases. Can you post a follow up or link to help clarify why the partitions are needed on SCD and OBT tables?

    • @mikekenneth4773
      @mikekenneth4773 6 หลายเดือนก่อน

      I had same question. Because with partition_date in SCD, the number of records doesn't really reduce

  • @matthewd4972
    @matthewd4972 7 หลายเดือนก่อน +5

    This "sql sucks" example hits hard bruv.

  • @AsifKhan-jq8zg
    @AsifKhan-jq8zg 4 หลายเดือนก่อน +1

    Why do you have field partition_date in user_scd and posts_scd?

  • @Joey-vy1iu
    @Joey-vy1iu 7 หลายเดือนก่อน +1

    Thanks for the informative video on such a important (more so now than ever before) topic Zach! Question: why not just only keep the posts_obt table and upsert that when there's updates? Then you don't need the join

  • @anandahs6078
    @anandahs6078 7 หลายเดือนก่อน +4

    😮 amazed by this OBT concept. I am definitely going to use this. Thanks Zach for this. You are an amazing engineer who teaches us real world scenarios 😍

  • @arpittapwal4651
    @arpittapwal4651 5 หลายเดือนก่อน +3

    Thank you very much Zach for explaining these concepts with real life examples. This content is gold mine and extremely helpful. Please keep making more and more such informative and insightful videos 😃

  • @Vaibhav19921
    @Vaibhav19921 3 หลายเดือนก่อน +1

    I liked the video very much. It gave me complete understanding of a denormalised table. I have one query, if lets say I have to create a big denormalised table(not being a fact table), which contains some 1000s of columns which gets created by joining 80 tables(30 large tables and 50 very small lookups ), What is the best way to make it real-time ?

    • @EcZachly_
      @EcZachly_  3 หลายเดือนก่อน

      If these aren’t facts, you should use CDC to make the mutations real time. The lookup tables are easy side inputs

  • @michelabifadel1474
    @michelabifadel1474 2 หลายเดือนก่อน +2

    Love your videos man ! Can you please give me your insights / point of view regarding Event Driven Architecture?

  • @mayravaldes89
    @mayravaldes89 15 วันที่ผ่านมา +2

    Zach, I'm learning so much. Your videos are amazing, please add more for us, the ones that can not afford to pay your course 🙏🙏

    • @EcZachly_
      @EcZachly_  15 วันที่ผ่านมา

      My course is

    • @mayravaldes89
      @mayravaldes89 15 วันที่ผ่านมา

      @@EcZachly_ Sorry Zach, I admire your content, but not eveybody lives in the US and has a US dollar salary. The first thing I see in the course is a one time payment of $2000 which is impossible for people that have a non US salary and has kids and house to pay. Then there is the $125/month (billed annually) which is still $1500 which is pretty impossible too. Then it comes the $200 per month which is more affordable but there's just no way to consume all that content within a month when you have a full time job, and kids and a life, so it's $200 per maybe 3 months = $600.. which is.. well.. hard to pay.

    • @EcZachly_
      @EcZachly_  14 วันที่ผ่านมา

      ​@@mayravaldes89 If you use code EARLYBIRD you can make it 30% off that $200. So it'd be $140/month.

  • @TheEjaay
    @TheEjaay 4 หลายเดือนก่อน +1

    Couldn't all 1 to 1 relations to scd in reality just reside in the OBT? For example here. All posts table data could really just be another column of arrays.

    • @EcZachly_
      @EcZachly_  4 หลายเดือนก่อน

      Glad this got you thinking

  • @2account134
    @2account134 24 วันที่ผ่านมา +1

    it's like this guy really likes to say like... he's definitely high on something, why else would his voice be so obnoxious and annoying

    • @EcZachly_
      @EcZachly_  24 วันที่ผ่านมา

      Like, I’m just built different okay

  • @raphaeldayan
    @raphaeldayan 4 หลายเดือนก่อน +2

    This video is priceless, thank you for sharing

  • @oyindamolavictor
    @oyindamolavictor 6 หลายเดือนก่อน +2

    Amazinggg…. Very in-depth, kept rewinding for parts I didn’t understand…. Got to know about this from your Twitter page…Please more or this🚀

  • @Enzo-jj5te
    @Enzo-jj5te 7 หลายเดือนก่อน +2

    Hey ! Great content as always, thank you very much ❤
    So you and the community use the term "relational data model" to refer implicitly to "at least 3NF-compliant relational data model", right ? (strictly speaking, both dimensional models and OBTs are relational models too, right ?)

  • @muhammadraza3290
    @muhammadraza3290 6 หลายเดือนก่อน +2

    Very insightful content! Especially the examples from your work like Facebook really help drill in the use of the various functions. Really appreciate the content :)

  • @medon910
    @medon910 5 หลายเดือนก่อน +2

    This is the exact video I am looking for! kudos for sharing great content.

  • @fernandostahelin2972
    @fernandostahelin2972 2 หลายเดือนก่อน +2

    Amazing class, thank you Zach

  • @pcl1923
    @pcl1923 6 หลายเดือนก่อน +2

    This is one of the best dimensional data modelling primers out there. Using it for my Meta DE onsite, and it has helped me to clearify some concepts, that I thought I "kind of" understood but with this video now I clearly see the reasonign and real applications. Thank you!

  • @Victor-yn6lv
    @Victor-yn6lv หลายเดือนก่อน +1

    Thank you so much for sharing such excellent video!!!

  • @AnnChu-tb4hp
    @AnnChu-tb4hp 4 หลายเดือนก่อน +1

    Zach, what is your opinion on Bitemporal Modelling?

  • @manuelalejandropereztrujil2391
    @manuelalejandropereztrujil2391 7 หลายเดือนก่อน +3

    Wonderful content, Zach! God bless you!

  • @jacobli2676
    @jacobli2676 14 ชั่วโมงที่ผ่านมา

    Sorry, I might have a different point of view. In terms of Kimball's method, it is admittedly might not be the one-fit-all solution. But in this case, it seems you are making a mistake specified in Kimball's book in his book chapter 6, header/line pattern to avoid, which is, using a header/line (in this case, the post_id) as a dimension table with such a high cardinality. Please correct me if I am wrong.

  • @Milhouse77BS
    @Milhouse77BS 7 หลายเดือนก่อน +6

    Glad you were able to get a version out with sound. Thanks.

    • @EcZachly_
      @EcZachly_  7 หลายเดือนก่อน

      Thanks! I felt so bad that the stream was silent

    • @murugesanrajasekaran5032
      @murugesanrajasekaran5032 7 หลายเดือนก่อน +5

      @@EcZachly_no worries. You’re doing a wonderful job for the DE community. Much appreciated

  • @NeumsFor9
    @NeumsFor9 7 หลายเดือนก่อน +5

    I like Kimball a lot, but I thought it was pretty funny when the group published "facing the re-keying crisis" near the time of his retirement and NOT a long time before that. That all said, I thought that group was the most humble and accepting of other methodologies.

  • @rishadm1771
    @rishadm1771 5 หลายเดือนก่อน +2

    An absolute gem of a video!

  • @rachelannballesteros3055
    @rachelannballesteros3055 23 วันที่ผ่านมา +1

    No one can explain it better than you. Great work 🥳

  • @strawhatSEO
    @strawhatSEO 7 หลายเดือนก่อน +1

    Great content thank you Zach! Is there a reason why it cuts out at the end seemingly abruptly?

  • @Sfgarcia
    @Sfgarcia 4 หลายเดือนก่อน +1

    Hey zach! Great video, I really enjoyed it! In your experience, what's the best way to overcome the limitation of the maximum array size in some SQL engines? For example, in Amazon Athena (which uses Trino) the maximum array size is 254 (due to a Java limitation). In your example, a post could probably contain more thans these number of actions. One simple approach would be to make a new array column every time you hit the limit, but I don't think that is very scalable. What's your opinion?

    • @EcZachly_
      @EcZachly_  4 หลายเดือนก่อน +1

      Trino doesn’t have a 254 limit. I used trino at Facebook. The limit is 65k I thought? I definitely used Trino and 365 item arrays at Facebook.
      Maybe it’s an Athena thing?

  • @muhammadraza3290
    @muhammadraza3290 6 หลายเดือนก่อน +1

    Where does the is_deleted column come from? How is it generated? Is there a custom logic behind it or some tooling that takes care of it?

  • @srinubathina7191
    @srinubathina7191 7 หลายเดือนก่อน +2

    Thanks Zach for amazing content

  • @Levy957
    @Levy957 7 หลายเดือนก่อน +1

    thank you for the content, bring more Kimball please!

  • @babuganesh2000
    @babuganesh2000 7 หลายเดือนก่อน +1

    Can you explain more and put a youtube shorts or a video about the enrollment and what it covers it will benefit whom e.t.c
    Full-access Live Boot Camp starting May 6th ($2000)
    Analytics-Focused Live Boot Camp starting May 6th ($1650)
    Self-paced Data Engineering Course V4 Combined ($1750)
    Infrastructure-Focused Live Boot Camp starting May 6th ($1650)

    • @babuganesh2000
      @babuganesh2000 7 หลายเดือนก่อน

      I am amazed about your youtube content, couldn't stop watching your shots
      FACEPAD - Funnel, Aggregatioon, Clustering, Experimentation,Prediction,Accumulation, Derivative - you are an amazing teacher mentor

  • @krzysztoflinke3870
    @krzysztoflinke3870 6 หลายเดือนก่อน +1

    Hi Zach great content! I am wondering how are you appending data in OBT. Are you using UPSERT or APPEND mode in your table? Is it on daily on hourly basis?

    • @EcZachly_
      @EcZachly_  6 หลายเดือนก่อน

      Depends on the use case

  • @tonyyeung9481
    @tonyyeung9481 6 หลายเดือนก่อน +1

    Hi Zach, clear and practical sharing! May I know about transaction consistency across the snapshot tables? I wonder if the snapshots are in an inconsistent state when they are being selected & loaded to the data lake. For example, selecting "A" table at 5:00am, and selecting "B" table at 5:05 am with new updates. A join of 'A' & 'B' table would cause inconsistent result.

    • @EcZachly_
      @EcZachly_  6 หลายเดือนก่อน +1

      Big tech generally doesn’t care about that. The way to avoid is minimize the time in between related table snapshots by dumping them in a single run

  • @fzrbigman
    @fzrbigman 7 หลายเดือนก่อน +2

    thanks for the great content Zach!

  • @princebansal7500
    @princebansal7500 7 หลายเดือนก่อน +1

    Great Content Zach. Thanks. one question is to why we can't or shouldn't do CDC on source relational tables(May be by changed datetimestamp) to bring only new and updated source records rather than daily snapshots to build SCD type2 in DW layer?

    • @EcZachly_
      @EcZachly_  7 หลายเดือนก่อน

      Every big tech company does daily snapshots. It’s duplicative but much simpler than CDC

  • @mihirit7137
    @mihirit7137 3 หลายเดือนก่อน +1

    so basically in the end are you saying that the interactions of millions of users on apps such as facebook or instagram , i am repeating again millions of users are stored in one big table ? imagine the number of rows on that table 😧😧

    • @EcZachly_
      @EcZachly_  3 หลายเดือนก่อน

      Export it to excel and watch your computer melt before your eyes

  • @jameshizon4861
    @jameshizon4861 7 หลายเดือนก่อน +1

    Oo great topic

  • @davidoh0905
    @davidoh0905 2 หลายเดือนก่อน

    @36:28 is it a typo to have s.is_current = TRUE and s.is_deleted = TRUE?

    • @EcZachly_
      @EcZachly_  2 หลายเดือนก่อน

      Nah. Is_current means the most recent state. Is_deleted being the most recent state.

  • @kentmaxwell1976
    @kentmaxwell1976 7 หลายเดือนก่อน +1

    This is fantastic content. Do you cover OBT in your data engineering boot camp? Where can I find the lab and lab demonstration you referred to in the video?

    • @EcZachly_
      @EcZachly_  7 หลายเดือนก่อน +1

      We spend a week on SCDs and OBT yep!

  • @keremen
    @keremen 7 หลายเดือนก่อน +1

    20:55 what if the post contains PII? Does Facebook anonymise post content as well?

    • @EcZachly_
      @EcZachly_  7 หลายเดือนก่อน +1

      Yep scrubs it of PII

  • @KPYCHIK
    @KPYCHIK 7 หลายเดือนก่อน +1

    Great talk! However I have couple questions. Even though partition_scd will be comparebly small, but wouldn’t its partitions grow infinitely? Or there going to be some rules to retain posts for the last n periods? And also I was wondering what are drawbacks of modelling posts as dim table? Probably I’m missing something but It more reminds me factless fact table.

    • @EcZachly_
      @EcZachly_  7 หลายเดือนก่อน

      Yeah you only keep the last week of partitions around. Correct!

    • @matiasmicenmacher1564
      @matiasmicenmacher1564 3 หลายเดือนก่อน

      Hey Zack … general speaking with the snapshot_ts on daily dimensions, Do we need to have the same snapshot_ts for the facts tables as well? But keep all the partitions no more than a week ? Thanks!!

  • @mridulbhardwaj3116
    @mridulbhardwaj3116 7 หลายเดือนก่อน +1

    Really appreciate your content man

  • @datbooii
    @datbooii 4 หลายเดือนก่อน +1

    This is good shit

  • @Milhouse77BS
    @Milhouse77BS 7 หลายเดือนก่อน +1

    3:55: Kimball

  • @mitchtherighteous
    @mitchtherighteous 7 หลายเดือนก่อน +1

    appreciate the content Zach but the feed just spontaneously cuts out.

    • @EcZachly_
      @EcZachly_  7 หลายเดือนก่อน

      This was a live that I learned had no audio.

  • @sanooosai
    @sanooosai หลายเดือนก่อน

    thank you sir

  • @thiffanyp
    @thiffanyp 5 หลายเดือนก่อน +1

    superrrr thanks!

  • @srinubathina7191
    @srinubathina7191 7 หลายเดือนก่อน +1

    Hi Zach i couldn't find your git hub things can you please provide that i will learn the things from that as you mentioned in the video

    • @EcZachly_
      @EcZachly_  7 หลายเดือนก่อน +1

      The GitHub I use is www.github.com/DataEngineer-io
      There is alot of content on www.DataEngineer.io that you can watch that isn’t on TH-cam as well if you make a free account.

    • @srinubathina7191
      @srinubathina7191 6 หลายเดือนก่อน +1

      Sure Zach Thanks for your support

  • @YannickSacherer
    @YannickSacherer 6 หลายเดือนก่อน +1

    Hey zach thanks for your guidance it was super interesting and now I fully understand the advantage of obt's. do you have a tip how to translate this query into snowflake?

    • @EcZachly_
      @EcZachly_  6 หลายเดือนก่อน

      They have similar array functions in snowflake. ChatGPT is your friend

    • @YannickSacherer
      @YannickSacherer 6 หลายเดือนก่อน

      @@EcZachly_ haha :D
      chatgpt hates me and always produces grouped by queries but I will proceed :)

    • @tobelesa
      @tobelesa 3 หลายเดือนก่อน

      ​@@EcZachly_ yo, data engineer working with snowflake here. And it seems that snowflake does not handle array map and reduce very well. It does not have built-in array functions for array map and reduce

  • @berellevy2
    @berellevy2 6 หลายเดือนก่อน +1

    23:30 instead of snapshots why not do change data capture?

    • @EcZachly_
      @EcZachly_  6 หลายเดือนก่อน +1

      CDC isn't as reliable

    • @berellevy2
      @berellevy2 6 หลายเดือนก่อน

      @@EcZachly_ and it sounds like most companies dont need the added granularity…

    • @TheEjaay
      @TheEjaay 4 หลายเดือนก่อน

      @@EcZachly_ ? care to explain

  • @chandanjha3205
    @chandanjha3205 5 หลายเดือนก่อน

    Zach- one of the best tutorials I have seen giving a real world example.
    Quick question: 1) What does the partition_date for the posts_obt signify? Is it post creation date?
    2) At 33:48 why do we have a filter in the query for partition_date. Here we are simply trying to find number of likes on a given post(using post_id). I am not able to think the role of a date column (same column in my first question)

    • @EcZachly_
      @EcZachly_  5 หลายเดือนก่อน +1

      1. No. Partition date is the date of the run of the pipeline
      2. You need to pick a version of your OBT. Is it todays or yesterdays? You don’t overwrite here since that’s risky. Keeping 5-6 days of OBT data is advised.

    • @Escrotazo66
      @Escrotazo66 4 หลายเดือนก่อน

      What is the point of saving partition date in kimpball? It is going to be equal either the start date if it is not true or the end date it it is true

  • @arashriazi9597
    @arashriazi9597 6 หลายเดือนก่อน +1

    this is was super helpful, learned a lot

    • @EcZachly_
      @EcZachly_  6 หลายเดือนก่อน +1

      Glad you liked it! Any other topics you think I could cover?

    • @arashriazi9597
      @arashriazi9597 6 หลายเดือนก่อน

      @@EcZachly_ a video like this around big data would be very helpful as there are so many tools and approaches out there which makes it a bit difficult to enter to the big data space

  • @jameshizon4861
    @jameshizon4861 7 หลายเดือนก่อน +1

    "Give me all the likes that happened between 4:20PM & 4:25PM each day to catch the stoned people..."

    • @EcZachly_
      @EcZachly_  7 หลายเดือนก่อน +2

      One big table followed by one big joint

    • @jameshizon4861
      @jameshizon4861 7 หลายเดือนก่อน +1

      @@EcZachly_ lol

  • @mx953
    @mx953 5 หลายเดือนก่อน +1

    there are people who hate SQL?👀

    • @abacusyou
      @abacusyou 4 หลายเดือนก่อน +1

      With a passion!