How to Create a Data Modeling Pipeline (3 Layer Approach)
ฝัง
- เผยแพร่เมื่อ 25 ก.ค. 2024
- A data warehouse acts as the main hub for most data teams, yet it often becomes a mess.
While there are many different strategies to handle this, in this video I want to share the approach I follow.
It's based around a simple 3-layered design to take raw source data into meaningful data marts ready for analytics.
Enjoy!
►► The Starter Guide for Modern Data (Free PDF)→ bit.ly/starter-mds
Simplify “modern” architectures + better understand common tools & components
Timestamps:
0:00 - Intro
0:28 - High Level Overview
1:04 - Staging Layer
1:37 - Warehouse Layer
1:54 - Marts Layer
3:13 - Example: Staging
6:05 - Example: Warehouse
7:35 - Example: Marts
8:23 - Importance of Modeling
Title & Tags:
[How to Create a Data Modeling Pipeline (3 Layer Approach)
#kahandatasolutions #dataengineering #datamodeling
►► The Starter Guide for Modern Data → bit.ly/starter-mds
Simplify “modern” architectures + better understand common tools & components
This is a three layer approach that I like. The weakness I believe is the integration and data loading performance of a star schema, which is more complex then a more normalized model like the Data Vault for example. With the flat wide DM tables in any case you ensure performance in extracting the data from the DWH, so why don't use a normalized model instead of a star schema, to ensure loading performance and agile integration of the model, and then use a star schema or flat tables in the the DMs to ensure performance in extracting the data? Thanks for your video, I like the format and the way you present these data model topics. And luckily we have still someone who strongly believe in data modelling!
Thanks for the comment! Star Schema/Dimensional Modeling is just one of many approaches you could take in that middle "warehouse" layer. It's personally what I'm most comfortable with but you could certainly use other techniques.
Straight to the point, with clear examples. It does help me alot. Thanks Kahan
Thanks for watching!
Really really clearly explained. I like the idea of using separating out a staging view for trivial rename/cleaning vs. business logic in the warehouse layer loads. I push for a very similar architecture but I'm going to integrate that detail going forward.
Glad it was helpful!
Simple as great thanks for this, but from my point of view, the staging layer is the raw data tables (incremental one-one from sources) and the enforced reading mechanize with VIEW is perfect with a retention for staging after 2-3 days.
what do u think?
Thanks a lot! I watch all your vids and I've subscribed. Could you share the layers for DEV and CI, please?
Check out this video - th-cam.com/video/a7H9Rkz2kqo/w-d-xo.htmlsi=e-qIKGRHuo34dvwu
@@KahanDataSolutions Thanks a bunch!
Hi @KahanDataSolution. How did you merge this organization with dbt folder structure recommandations especially for the intermediate and marts layers/folders. Do you advise to create a new folder named warehouse? Did you advise not letting end users having access to facts and dimensions? you're speaking about OBT, is this the way to go?
I tend to break out warehouse as a separate directory. In dbt it'll look something like this:
models/
marts
staging
warehouse
For intermediate, these are more like "helper" models. So you could organize them in a separate directory right under models/ or nest it under the particular layer it's supporting. For example:
models/
marts
staging
warehouse
intermediate
As mentioned in the video, I prefer to keep end-user access limited to Marts but some organizations allow direct fact/dimension access. It really depends on the user base, their understanding & familiarity with those concepts.
Overall, there is no single "right" approach but will be a mixture based on your particular company. But this 3 layer design has worked well for me and others as a great starting point.
For more on the different "warehouse" models, check out this video - th-cam.com/video/IdCmMkQLvGA/w-d-xo.htmlsi=ZyfIVFSII7qznVBI
Hope this helps!
This is a idempotent pipeline. Can you do incremental one via dbt?
So where in this example would you implement the dbt snapshotting?
At the client im working for atm we have the raw source data coming in, in the landing schema. Then in the staging schema the history is build (dbt snapshot, with a select * from landing.table).
The snapshotted tables are then used in the models, where (among other things ofc) we rename columns and clean some data.
I'd probably look to do something like this:
source > snapshot > staging > warehouse > mart
What’s the benefit of setting up staging as view instead of a regular table?
To avoid duplicate storage since typically there are no other joins involved. Think of it like a glorified select statement. However, if performance becomes an issue you may look to deploy as a table.
Views will always grab the latest data upstream, they'll always be fresh relative to the source data.
Where do you feel Entity Resolution fits into this? For example, I have multiple data sources pertaining to the same data type; does ER fit between staging and the DW in this 3 Layered Approach?
Sounds like that would be handled in the Warehouse layer. Each table in the different sources would have its own Staging view, then combined in the Warehouse to create the single entity.
Thanks alot for Practical pov !!
Do you still suggest reading Kimball or are there better and more modern & Practical books on modelling
Love your content man
I still like to follow a star schema for the core "warehouse" model for clarity & organizational purposes. Then create wider user-facing "marts" on top of those to leverage the capabilities of more modern cloud DBs.
@@KahanDataSolutionsThanks for your insight.
I heard about Lakehouse modelling being a Hype but It only happens on Larger companies. So is it important to get thoroughly in touch with the basics before heading towards such Complex Topics?
@@chiragpramod308 Learning the basics before moving to complex topics is always a good strategy.