Kimball in the context of the modern data warehouse: what's worth keeping, and what's not

แชร์
ฝัง
  • เผยแพร่เมื่อ 13 ธ.ค. 2024

ความคิดเห็น • 81

  • @chrism3790
    @chrism3790 3 ปีที่แล้ว +132

    I don't quite agree here.
    I think that star/snowflake schemas still have their place today. You're right that the storage cost argument is largely moot today due to cheap, cloud-based columnar data warehouses, but I think there's more to Kimball's point about understandability than I think you gave him credit for.
    Their biggest selling point, in my opinion, is still conformity. It's the same old "source of truth" problem, which is why data warehouses get built in the first place.
    Modeling wide tables directly from source data may work for small warehouses, where only a few such tables get exposed to end users. But if you work on a bigger project, by using source data directly you will soon find yourself having real trouble with data consistency. That's where a dimensional model comes in, offering easily managed conformity.
    When data gets really big, like in the hundreds of TB or a few PB, the key idea is not to forget everything Kimball taught us, but to slightly adjust the order of operations. You still employ a star schema, but instead of joining at runtime, you pre-join the data once (and only once) - storing the whole thing. You're essentially embedding your dimension into every fact table. You do store a lot more data, but it's cheap and very compression-friendly in columnar databases. In return, you get very good performance.
    My point is that wide fact tables are nice, but only if a good dimensional model supports them, which will still look like a star schema. I think that's what he meant by the Operational/Staging schemas around 14:45, he's just referring to it with a different name.

    • @MrBoomBoom225
      @MrBoomBoom225 3 ปีที่แล้ว +13

      I feel the same way. Isn't the benefit of dimensional tables to help with change? If we have 5 fact tables that utilize the same dimension table all we need to do is update the single dimension table, but now with the wide tables we have the update each and every single one of them.

    • @LuisEstebanGomezAlduncin
      @LuisEstebanGomezAlduncin 2 ปีที่แล้ว +13

      @@MrBoomBoom225 Yes, the ideal model for me is to treat the staging schemas as dimensions and fact tables, and to make use of wide tables for analytics and visualization!

    • @raughle3066
      @raughle3066 2 ปีที่แล้ว +15

      By saying that the top two benefits of a dimensional model are cost and performance, it feels like Mr. Fowler is building a straw man just so he can knock it down and claim victory. Sure, cost and performance were much bigger considerations back in the day (good riddance to the constraints of expensive on-prem databases!), but the biggest benefit to the star schema has always always always been consistency, conformity, and understandability.
      Here's the rub: Building a big wide table and just loading up (denormalizing) all your customer/product/etc values right onto it does NOT mean you no longer have a dimensional model. It just means your dimension relationships now exist implicitly, INVISIBLY, and probably inconsistently across a bunch of tables in a way that's easy to query but gives you no vision into how or when or why they changed, and no way to enforce that they are consistent from one business measurement to the next. To make it worse, the dimensions are changing slowly, but those changes cannot be recreated later when the business questions inevitably change and you need to go back and restructure and rebuild your big wide tables.
      Taking the time to model your business and build out dimension tables that explicitly track changes gives you that consistency. After you do that, hey, go and denormalize that stuff into giant thousand-column tables to your heart's content and make all your users happy by saving them from writing the joins. Don't forget to call it a Modern Data Stack.

    • @adambrown6224
      @adambrown6224 2 ปีที่แล้ว +3

      @@LuisEstebanGomezAlduncin Totally agree with this. His staging tables sound very much like facts and dimensions. Then you put a view on top of those for your wide tables or tables of consumable data that's easy to understand, self-service is key. Also, dimensions change, new hierarchies come along, it's much easier to manage them if they are in their own table. So I disagree, The Data Warehouse Toolkit is the perfect primer for data modelling and then it's experience of what works best.

    • @MrVitaminP
      @MrVitaminP 2 ปีที่แล้ว +5

      I'd also point out that the storage cost argument is a bit of a strawman, that is not really what Kimball was arguing back in the day. This can easily be seen if you look at the arguments of other contemporaries like Inmon. Their normalized models were designed to reduce redundancy of data and overall reduce the needed storage footprint. As such, the Kimball design was often attacked for being wasteful. So, what is the cost aspect referring to? It was mainly referring to two things: the project cost associated with a top down approach and the compute cost associated with the increased join complexity in normalized relational models. While cloud storage today is comparatively cheap cloud compute is not. Balancing the compute that you bring to a data process is a continuous challenge in today's cloud environments. I'm not arguing that a dimension model is cheaper in terms of cloud compute (the etl need is often quite costly) but to frame the cost issue as being related to storage is just wrong.

  • @jeremybeard9797
    @jeremybeard9797 3 ปีที่แล้ว +31

    Few unaddressed points about fully denormalized vs dimensional: how to efficiently return distinct values of a dimension column when it's only on the facts? How to provide users with dimension values that don't yet have at least one associated fact (e.g. customers that have not yet made an order)? How to efficiently update a dimension value when it is replicated across potentially millions/billions of facts over multiple tables? How to provide users with the history of a value, like SCD type 2 or bi-temporal, without resorting to horrendous aggregation queries over the facts?
    Reality is that fully denormalized tables can solve performance problems but can create performance problems too, and they are less expressive (see unused value point). Star schemas aim to strike a balance between the pros and cons of the normalized and denormalized extremes for analytics workloads.

    • @BertBrijs
      @BertBrijs 2 ปีที่แล้ว +4

      It's even worse Jeremy: you can implement star schemas on key value stores, column stores (for the record Dave Fowler, Ralph founded Red Brick Warehouse in 1990, a coluimn store) in other words: the star schema is more about business and IT architecture than merely a patch for technical problems in the 90's. I dedicated an entire blog article but Dave keeps removing it from the replies. Let's see if this sticks. I'll make a screenshot for eternity :-)

    • @morinho96
      @morinho96 ปีที่แล้ว

      Would love to see an answer to these questions...

    • @nlopedebarrios
      @nlopedebarrios 11 หลายเดือนก่อน

      I couldn't agree more. It's not the first time I hear about this ("benefits of Kimball approach are no longer valid"). You were the only one who mentioned SCD type 2, how would you implement that with OBTs? On another note, the one thing that I found really confusing in the Data Warehouse Toolkit were the ETL chapters, which I think modern data architectures come to facilitate.

  • @RaystyleIII
    @RaystyleIII 2 ปีที่แล้ว +32

    One of the most valuable approaches that Kimball proposes is to dissect the business processes into facts, that’s the whole idea of it. When you have a one big table you basically lose visibility about the business process that you are representing, unless you create big tables for everything, which makes it extremely hard to track consistency and is a waste of space.

  • @yasirbam
    @yasirbam 3 ปีที่แล้ว +29

    Thank you, this is a subject that I have been thinking about for years and I have come to a conclusion that wide tables cannot replace a good data model. You glossed over the Data Warehouse layer, which is where I would put my data model (star schema, data vault or any other) to provide a consistent, single version of truth for the enterprise data. Views/ materialized views can be build on top of the model to expose data as wide tables to the BI/ end users if you want to avoid end users writing joins but this is not a replacement for a good model. I agree wide tables are easier for end users/ BI and with the advances in columnar store data warehouse there is no longer limitations of performance/ cost.

    • @RandomForestGump
      @RandomForestGump 3 ปีที่แล้ว +1

      Obsolete and old ways will die. You can’t avoid it. That book harms.

  • @danishamjad5751
    @danishamjad5751 3 ปีที่แล้ว +13

    What about slowly changing dimensions?

  • @robwozniak9912
    @robwozniak9912 2 ปีที่แล้ว +10

    Kimball's methodology is still relevant when Master Data Management and Slowly Changing Dimensions come into play.

  • @1988YUVAL
    @1988YUVAL 2 ปีที่แล้ว +6

    The comments here are amazing, I'm learning a lot. Such an interesting topic.

  • @dlb8685
    @dlb8685 2 ปีที่แล้ว +8

    I don’t think Kimball is out of sate, but that cheap storage lets you add a final step where the star schema is concerted to a wide table for easier usage.
    From the end user’s perspective, you just have wide tables that are intuitive to less technical users, but behind the scenes you still have testable staging tables and engineering best practices in place.

  • @soy2471
    @soy2471 2 ปีที่แล้ว +12

    Kimball and DV is all about standardisation and prevention of tech debt through thoughtful modelling and means of resolving keys, temporality and conforming. So while I agree with the three points highlighted here, a warehouse is More than just a data mart in the presentation layer. It is the enterprise history, modelled and prepped in a standardised way. Speed and storage are technical considerations. While valid, is only one part to consider. But that said.... Much respect to you for speaking openly and being Willing to push the norm. Once again it all depends on the context and use case.

    • @robeaston4063
      @robeaston4063 2 ปีที่แล้ว

      Well said.

    • @decoyslois
      @decoyslois 2 ปีที่แล้ว

      Hear hear.

    • @fishsauce7497
      @fishsauce7497 6 หลายเดือนก่อน

      "a warehouse is More than just a data mart in the presentation layer" ---- True but many data modellers follow the easy route of preparing data model from the reports, with audacity to claim (or push that claim) it is feasible across the whole data life cycle in data warehouse!

  • @Frickenchies
    @Frickenchies 2 ปีที่แล้ว +14

    Ok but what about conformed dimensions? That is one of the biggest benefits of Kimball. In our data warehouse, the concept of a 'customer' comes from a variety of sources. We have a customer dimension because it is conformed, and can be used with any other fact table that needs to know something about the customer. We never have to repeat ourselves about what a customer is, because it has been established in the customer_dim. If all you have are huge wide tables, you would have to repeat that same logic defining a customer over and over, leading to data inconsistencies. I kept waiting for him to mention something about data conformity but didn't see it. If you create a table in your staging layer called 'customer' that is meant to be reused for all these other wide tables, you have essentially just created a customer_dim. You haven't abandoned Kimball, you just aren't calling it a dim anymore.
    What we do is create the dimensional model with facts and dims, then create flattened tables on top of it for analysis. It should just be a fact table joined to the dims, and that's it. If you find yourself needing to make changes in that layer, you should update the underlying facts or dims and make it a part of your actual model.

    • @amarkhaliq5020
      @amarkhaliq5020 11 หลายเดือนก่อน

      Would like to ask a question as it sounds like you know your stuff.
      I have been looking into conformed dimensions lately so say i have made a conformed dimension customer table, i have to make two reports with two different fact tables, would I bring the dimensions customer table and further reduce the rows based on what is in the fact table for each report or would I keep the tables as it is?

  • @BosonCollider
    @BosonCollider ปีที่แล้ว +5

    The thing with the star schema in Kimball is that you don't have a single star though. The purpose of the dimensions is to have a bus of conformed dimensions, so that new fact tables can easily be added or refactored. And a fairly common usecase is questions like "find all users who satisfy queries using fact tables A, B, and (C or D)", or "correlate these two facts along dims X and Y".
    For that kind of queries, wide tables aren't nearly as useful. Similarly, for small to medium sized companies they are very useful in OHLP databases (i.e. the trend of Postgres + citus/timescaledb), where you aren't really using a dedicated analytics DB, and where the dimensions might be materialized views that depend on the operational schema.

  • @abdeo777
    @abdeo777 ปีที่แล้ว +2

    Most data swamps occur today when people move away from modeling their data and start reporting on flat tables. End of the data warehouse is subject oriented which means you are modeling your business into a data model. Without you can build a reporting solution but in long run you will end up creating technical debt without a proper model which provides source of truth, reference data , quality rules etc

  • @aeimei
    @aeimei 3 ปีที่แล้ว +7

    For large data volume, creating a big fat tables in semantic layer, is not mistake at all, it's SIN!!

  • @phamnguyen3083
    @phamnguyen3083 ปีที่แล้ว +2

    What is grain of a wide table? How to use SCD in a wide table?

  • @Milhouse77BS
    @Milhouse77BS 2 ปีที่แล้ว +1

    3:43 1st edition of Kimball Data Warehouse Toolkit came out in 1996. I’ve a used version coming my way.

  • @haribharadwaj1
    @haribharadwaj1 2 ปีที่แล้ว +9

    David, a rudimentary understanding of how Power BI or even an older (and quite widely used) platform like Microstrategy expect their models will highlihgt how ciritcal star schemas are. The analytical provess of DAX on a tabular model alone is worth a well structured star schema. I get a sense that you aren't that familiar with how some of these BI tools work.

    • @nickmcdermaid3015
      @nickmcdermaid3015 6 หลายเดือนก่อน

      .. like so many people expounding on the internet today

  • @alanclift2507
    @alanclift2507 ปีที่แล้ว +2

    I stopped watching at the example of the city dimension since it was so far off on the ease of use benefit of a Star Model, not space savings. Space savings applies more to the transactional model, where city would be stored in a single place. Although the bigger benefit for that approach is ease of updating, because it's in one place. (Normalized). If the ease of use in the Data Warehouse is to have city in multiple rows, then that is what is best.

  • @AlexLobaev
    @AlexLobaev ปีที่แล้ว +2

    Do you understand that your approach is a simple Hybrid approach of Kimpball & Inmon's models together (second 16:26)? Where "WIDE TABLES" is simply a datamart? And "STAGING SCHEMAS" seems just a redundant part?

  • @martininakilazatin7390
    @martininakilazatin7390 2 ปีที่แล้ว +7

    Building these wide tables had burned me so many times. What would have been an update to a few rows in the dimensions had turned into a reprocessing of months of data in the wide table.
    Wide tables had costed me so many man-hours.
    I also had to manage my business users while our team was reprocessing the wide table.
    I would use wide tables very sparingly, because they are just not good design for 99% of your use case. I would use wide tables for very special use cases only. They are an exception, not the norm.

  • @invictusOne
    @invictusOne 2 ปีที่แล้ว +2

    I don't quite agree either. I just came across a table which has nearly 500 columns. I have to find the SME that put it together just to understand its potential dimensions. The next question is to understand if there is conformity across all tables.
    I agree with some of the other posts in which the conformity and the understandability of dimensional design is an advantage to a new analyst or non-technical executives.

  • @FrostSpike
    @FrostSpike 2 ปีที่แล้ว +3

    In your "Product 'Wide' Table" what does "Total Sales" represent? Total of that product sold globally? Total in the city? Total for all people with the same firstname? Are you saying that the wide table is essentially a cross-join of all possible combinations of "interesting" attributes, with a bunch of other aggregations thrown in at all of the levels of hierarchies of other attributes? It's striking me as being time consuming to generate and liable to misinterpretation when end-users are trying to see numbers at intersections of the other attributes e.g. by city, by person, by product, by date range. Certainly, "wide" tables can be used for some reporting purposes to allow easy filtering and grouping within the the bounds that the table provides, but there's still a need to be able to capture and store the underlying data in an unambiguous form. You talk about having "Staging Schemas" that seem to work a bit like a Linstedt Data Vault. You're also not covering the massive complexity there is on the "data integration" aspects in providing that "clean source of data" across a myriad of operational systems.

  • @tiagoamarquesmusic
    @tiagoamarquesmusic 3 ปีที่แล้ว +7

    I am not very sure I understood wide tables in this context. Are they replacements to "Data Marts"? Also, how do you keep track of mutations, if you ditch dimensions, and by consequence slow changing dimensions?

    • @lattelover7186
      @lattelover7186 3 ปีที่แล้ว +5

      It looks like wide table is just another name for Fact Table with all the dimensions treated as Degenerated Dimensions, so we no longer need separated dimension tables. The mutations can still be tracked by its column keys.

    • @keim3548
      @keim3548 3 ปีที่แล้ว +1

      Re data marts, I don't think that's really it. .When people talk about about analytics apart from data marts they're often talking about doing analytics on a data lake, or some other variant like data vault. Regarding handling slow changing dimensions, they still exist, but you'd probably be more precise to call it slowly changing data, but I expect it will continue to be referred to as SCD.

  • @martingeschke4984
    @martingeschke4984 2 ปีที่แล้ว +4

    I also disagree with this. As other comments before mentioned, this doesn't pick up the complex requirements of businesses to combine several fact tables. The results of this approach are more or less on the level of a report you can get directly from the ERP system. This is not the idea of business intelligence as I understand it. Not talking about slowly changing dimension concepts and so on...

  • @wexwexexort
    @wexwexexort 3 ปีที่แล้ว +7

    One cannot simply coalesce Kimball

  • @davidnaden7202
    @davidnaden7202 2 ปีที่แล้ว +2

    Kimball's first DW Toolkit was published in 1996

  • @MrSungE
    @MrSungE ปีที่แล้ว +1

    Engineer perspective, I agree with you but data modeler perspective, I cannot.
    I think you also need to provide a solution how to manage the duplicated data in the wide table against changing requests from the business.
    For example, dbt provide a dependency graph to help you to know where you need to change. But... you still need to go everywhere and change all queries. It's more complicated and inefficient than the dimensional model.
    If the BI analyst finds a different value for the same column name, it's getting a bigger problem.
    We know building the new DWH is easier than changing the existing DWH :)

  • @kalasharshad
    @kalasharshad 5 หลายเดือนก่อน

    for enterprise, each wide table for each business process and connect these larger table to each other,. similar to Data Vault.. what do you recommended for health care organizations for data warehousing...

  • @douglasbrito2921
    @douglasbrito2921 2 ปีที่แล้ว +1

    how wide table is scalable on tools like Power Bi. how to take advantage from DAX with a single wide table

  • @cubsat
    @cubsat 3 ปีที่แล้ว +5

    Honestly, there's no clear winner here, though Big tables have their own 'good' use case. If changing dimensions is a rarity, big tables would rule the roost. However, when changing dimensions start becoming more frequent, big tables would become a nightmare to manage. Also, I would love to debate with people who think writing massive rambling views to transition data through Bronze/Silver/Gold zones is any easier to maintain as compared to keeping a set of fact-dimension tables.

  • @nivantha
    @nivantha 5 หลายเดือนก่อน

    Wide tables might serve the last mile. But to get there you need to have the dimensional model or data vault model built in the warehouse.

  • @MrBetainc
    @MrBetainc 3 ปีที่แล้ว +3

    Great presentation, one area that I can see will become more important over time is the area you call "staging" (we know it as the integration layer where many data sources come together). I think this layer will need to achieve conformity, codification and standardization in a way which enables re-usable single/consistent code path sets of data for guaranteed down stream data consistency.

  • @NikZavada
    @NikZavada 2 ปีที่แล้ว +3

    Having the wide tables sounds like a lot of pain if you use Looker for example. Cannot imagine recreating all the dimensions in every explore we have and when you add one more precalculates customer flag you need to go and upgrade all of them again. It's a nightmare :)
    I would say that the fact table's flags and attributes should stay in the fact table. I don't see any reason to create all those only fact-related dimensions and especially junk dimensions anymore. Just keep it in the fact table and make your life easier without unnecessary joints.

  • @Banana_Boo_
    @Banana_Boo_ 2 ปีที่แล้ว +1

    This video sent me 'the Earth is flat' vibe.

  • @shawn-jung
    @shawn-jung 2 ปีที่แล้ว

    The presentation resonates with me, Dave. I like to create an operational table that I call BFT (Big Freaking Table), and it aligns with the Operational SST concept. I agree with you that Mr. Kimball is a legend, but we will need something else now.

  • @hallkbrdz
    @hallkbrdz 6 หลายเดือนก่อน

    I rather agree with your Operational SST model - pre doing anything else with the data. Hybrid databases that can hold or consume the 3nf transactional database data, but present it as wide-table (columnar) are a big win for simplification. Oracle with In-Memory is a great example of leveraging modern technology to bridge the use cases between operations and analytics / reporting. No need to store redundant data due to columnar compression, but still access it as if it was a fully fleshed out wide table as you call it.

  • @MrKane101111
    @MrKane101111 หลายเดือนก่อน

    Great video, your strong opinion helped me clarify understanding a lot.

  • @ЕвгенийПермяков-щ8п
    @ЕвгенийПермяков-щ8п 18 วันที่ผ่านมา

    wide tables get very messy if you need to track changes

  • @raymondcui3932
    @raymondcui3932 2 ปีที่แล้ว +2

    It is not reasonable to waste storage since it becomes cheaper. DBT is great tool to built Transformation in ELT. But it is not methodology to build DW, like Star schema and DV 2.0. Do not think the wide table could work without appropriate methodologies for big or large data warehouses.

  • @00EagleEye00
    @00EagleEye00 3 ปีที่แล้ว +4

    Good day sir.
    I would like to ask on how to include a dimension on a hierarchical design (recursion) to a fact?
    Kindly set an example and provide explanation if possible.
    Thank you.

  • @floreknayt
    @floreknayt 2 ปีที่แล้ว +4

    Disturbing video, this just looks like an attempt to promote the book on the basis of Kimball's popularity. Also the title is misleading - you're saying to forget about it and just put everything into one giant table, without any constrains etc, doesn't matter - it's gonna be better. I'd expect somebody who cares about Kimball's legacy would actually make the effort of pointing the valuable points from the book, of which there are many. Even the basics, like naming, department specific tables etc.

  • @obaysch3774
    @obaysch3774 4 ปีที่แล้ว +4

    like the (very interesting) video says, kimball published tool*kit* books and the most recent edition of the data warehouse toolkit came out in 2013, not 15 years ago.

  • @alexd.1091
    @alexd.1091 2 ปีที่แล้ว

    Is star db somewhat a relational db

  • @montollo1098
    @montollo1098 ปีที่แล้ว

    Well, this qualifies as a eureka moment for me listening to your explanation about the differences between star schemas and the wide table approach. The company I work for has struggled to understand why star schemas are no longer beneficial from a cost, performance, or understandability perspective. You've given enough information to conduct more research and perhaps understand how wide tables can help us achieve what we're looking for. Thanks!

  • @top_1_percent
    @top_1_percent ปีที่แล้ว

    Excellent video. Learnt a lot from this one video. Thank you very much.

  • @bobhoover1066
    @bobhoover1066 ปีที่แล้ว +2

    Think you took a swing at the king and missed here. Very brave for doing so perhaps but you should back it up better than this. As the other comments mention there are tons of reasons why your position doesn't hold up.

  • @machtmann2881
    @machtmann2881 3 ปีที่แล้ว +2

    Man, it is so easy to have a big, wide table these days. I don't think everything should be a wide table but we are no longer constrained to follow Kimball's dimensional modeling with today's cloud technologies either. Much like any other tool, use case will always depend. It's nice to have the organizational structure of star schema. But it's also nice to just look at a user table and have everything important in it be in the same row without having to join to like 7+ different tables.

  • @sergeysukharev4939
    @sergeysukharev4939 ปีที่แล้ว

    Date like is an extra expensive and unnecessary element in the architecture of data warehouse. Should Load data from data sources directly into the stage area

  • @thamalupiyadigama1216
    @thamalupiyadigama1216 5 หลายเดือนก่อน

    I can't agree. If you develop a data warehouse and it is going to stay the same ever, then OBT is nice. But in real world we have to add new fields, fix errors (some may have been inherited from sources), do changes, build slowly changing dimensions. Then OBT is a nightmare.

  • @corymaklin7864
    @corymaklin7864 2 ปีที่แล้ว

    Good stuff!

  • @venkatramachandran6345
    @venkatramachandran6345 3 ปีที่แล้ว

    Great video. I read the latest version of 'Dimensional Modeling in Depth' in 2014. At that time itself, found it as obsolete design.

  • @joyo2122
    @joyo2122 3 ปีที่แล้ว +3

    not correct

  • @far_n
    @far_n ปีที่แล้ว

    Wow
    Thatvwas amazing

  • @nickmcdermaid3015
    @nickmcdermaid3015 6 หลายเดือนก่อน

    I'm reading the comments here and I can see this is not worth 20 minutes of my time. Especially if you lead with "column compression" as some kind of differentiator. Column compression has been around for decades, long before data lake or cloud concepts.

  • @obiyobi3200
    @obiyobi3200 ปีที่แล้ว

    Poor arguments. Dimensional modelling still relevant in data lakehouses

  • @bellanewler6611
    @bellanewler6611 3 ปีที่แล้ว

    haha. have you got a yourself a good bankruptcy plan? Sell them to the people going with your approach of a single dataframe. btw, are you that crying baby people just hand you food, the single table?