Love how you present your thought process and how you iteratively improve it and also that you don't take out things that didn't work in the first place. Keep up the great work so we mere mortals can learn as well :-)
Great video. I got 2 questions - 1. If there is a cycle in the dependencies a->b->c->a does databricks detect and flag that? 2. How does Restartability work? can we restart from the point of failure? I would be curious to know more on these points.
Honestly, haven't tried circular references but I'd hope it throws an error during dependency processing before the pipeline commences. From a restart point of view, you can restart workflows, but I can't recall if you can restart a DLT pipeline. Time for a quick experiment I think!
Hi Simon, i love your videos..its str8 to the point and covers the points i am looking pointer too (like the missing features :).. anyways with regards to this dlt framework that u ahve built, lets say i want to load the curated data and skip the dependency of the silver tables refresh ..basically execute just the goldnotebook component in the dlt pipeline.....is that possible ?
Great demo thanks - hoping (guessing you will one day) delve into 2 key DLT topics 1) DLT testing strategies and patterns 2) CI/ CD and topics like roll back (e.g can we time travel table data & metadata along side code)
Certainly some elements we'll dig into - the tables created are just delta tables so you can use all the usual delta tricks. For CI/CD & Rollback, it's interesting as they're mainly just databricks notebooks, which can be deployed using repos, devops pipelines etc, but I've not seen a formal CI/CD story around them!
@@AdvancingAnalytics great thanks for reply. Defo need to dig into operational behaviours (e.g updating a pipeline will DB let all batches being processed in different pipeline steps finish first then change it? Manual replays off quarantined data, roll back. Mechanics of ci/cd are probably easy enough to figure out it’s the behaviours I’m a little unclear of ATM. Also Emerging patterns / best practice around DLT testing too this feels super important. Love your work thanks heaps for the great content!!!!
Hi Simon. DLT features looks great but my concern is about balance between features and costs. DLT is something like 3x more expensive in terms of DBUs. What's the best approach for cost efficient designs? DLT or classic workflows?
Great Demo Simon. Will you help in integrating this gold/silver data to visualization like SuperSet or Ploty? If you have covered, please suggest the url/tutorial.
You rock! I have a problem with switching from streaming (appending) to batch, where I have fact tables and I need to use historical data and update. What is the best practice for transition from stream to batch?
We generally stream into our "Base" layer (call it silver, clean, ODS, whatever) and then batch read from there into our "Curated" (gold, model etc). If we are worried about historical records in our curated layer, we will often store type 2 or type 4 data in Base and ensure our Curated is built in such a way to apply those changes and rebuild the history en-masse. There's no current way to do that using DLT, but I'd hope to see things like merges coming in the future so we can apply that kind of pattern! Simon
The actual loading happens concurrently. Unless there is a dependency between 2 different loading. DLT will understand this and load the dependency first.
They need to be in the same "DLT Pipeline" for the dependencies to figure themselves out. But if you run it as an entirely separate job based on the hive tables created by DLT, should still work if it's scheduled to run after the initial one.
Hi and thanx for all the hard work you put in to these presentations. A Delta Live Table will load all new data that arrive in the source for instance landing zone defined, but will the data allready loaded need to be present in the source in the future or can these records be removed or moved after initial load to the bronze layer? Thanx!
If you ever need to rerun the pipeline with full refresh, that is the only time that the historical data will be retrieved. Let's say you delete the old data, if you re run the pipeline with full refresh, only the data that is present will be loaded. So yes, the raw csv files should be present for as long as you intend to keep the history on the tables for.
With most of the streaming patterns, the unfortunate answer is to delete the checkpoint and reprocess, which can sometimes lead to HUGE reprocessing time/cost. If you're working with standard streaming you can include the delta transaction starting version, but with DLT you don't have any option if you want to restart, it's incremental loading or a complete rebuild!
wanted to know about how CICD would look like, deploying notebooks should be easy enough, But how to schedule (trigger) would be deployed, is there any way to call the workflow notebook from the data factory and use triggers, also pass configs from there to either workflow or notebook directly
We are considering dlt, I suspect for a file ingest + validation pipeline. So very interested in the templating but need more exploration on the exceptions and corresponding logging as we have a number of validations and some are complex. Also, we logically split the data flow into valid and invalid data sets (both end up stored) so how would that work here is a question
The expectations are great, provided you can express it as a SQL statement, but if you want to retain & split, I'd instead use similar logic to add a calculated column to the data frame, then define as two table destinations (aka - do it the same as you would in a normal spark batch). Possibly something you could do that's a little more baked into the process, but there certainly approaches!
Thanks, great demo Simon. I like the for loop you created in the first notebook (bronze to silver) to make it a generic pipeline. Would the loop run in parallel or in sequential? Thinking whether a failure in one of the DLT would have any impact to other tables.
The definition of the table is performed sequentially, but that doesn't have much relevance. The actual loading of data will be performed in parallel (as much as your cluster sizing allows for), as it works out dependencies & table definitions before kicking any actual spark jobs off. Failure is an interesting one - if one of the definitions failed, I assume the whole pipeline will fail to resolve. If there is a failure whilst loading a specific table, it would be interesting to see how the rest of the dependencies are managed! I assume running jobs will complete but no more jobs will be kicked off, but I haven't tested it! Simon
1. Does it mean to have visibility of all end to end lineages? We need to include all notebooks in single pipeline? 2. When is it going to support update into a table?
Yep, to see all lineages together, it would have to be in a single pipeline - although the demo slides around unity catalog showed similar lineage, so that might bridge that gap? Maybe? Regarding updates I have no idea of timescales, but Databricks know there's a lot of use cases for people needing merge functionality!
Simon I love the tutorials! If data contains GDPR fields or there was a need to to add surrogate keys for Dim tables in the Gold level, I'm not sure if this approach would be realistic. Seems A lot more metadata and complexity would be required? Let me know if I am being short-sighted on this. Maybe those would require different patterns?
Oh for sure, the examples I'm running through are the most basic "I need a simple data model" patterns while I kick the tyres on DLT. As consultants, advancing analytics use a home-grown set of python libraries, metadata stores and custom notebooks to automate the complexity you're talking about, and you can't get that inside DLT yet. It's heading in a promising direction, but it's still the "easy path" solution for now! Simon
I tried similar pattern 2 weeks ago but the "second layer" table was unable to refer to the "first layer" table since they weren't created in the same function call. I was also unable to refer live table to another live table created in different notebook (although part of the same pipeline). Got weird "out of graph" or something error.. Also interested to know any info on update instead of append/overwrite? Appreaciate your contributions Simon!
Hrm, weird - were you referring to the tables using their full hive name, or using live. as the "database"? Seemed to work fine as long as everything was live.tablename, whether they were in different functions, notebooks or whatever, as long as they're all in the same pipeline! No news on updates, the databricks team know it's a big customer ask, but I'll share when there's news!
@@AdvancingAnalytics I used live.tablename. It worked fine if DLT was declared in absolute terms (i.e. not parametrized inside func). I gotta retry then..
Hi Simon, do those different bronze to Silver extractions happen in Parallel or is it a sequence? If its a sequence perhaps an ADF based parameterization where you can call multiple bronze to silver ingestions parallelly a better approach?
Bronze to silver will happen in parallel if there is no dependency. If there is a dependency, that will be honoured and the processing will happen in sequence.
Firstly Simon, thanks a lot for tackling & demonstrating such framework; much appreciated. Just curious if you've played around with passing in parameters to a DLT pipeline such as via ADF and wondering what are your thoughts on it; i.e: 1) pass a JSON output from an ADF activity as a string to the 'configuration' property of the DLT's pipeline JSON via REST API Web activity 2) retrieve the parameters passed in within the DLT notebooks (parse the 'JSON' string as you see fit/relevant) 3) essentially your 'tableList' is derived from the passed in parameters Keen to know your experiments/approach in implementing a parameterised DLT from an external caller. Thanks in advance
You can very well do this. But you need 2 API calls. 1 is to update the pipeline with the new configuration and then the actual call to start the pipeline. The rest API to start the pipeline does not have any ability to pass parameters to it. Also, only 1 instance of a DLT pipeline can execute at a time. You cannot treat this as a workflow where you can have the same workflow triggered multiple times with different parameters.
Great content as always Simon! One question on adding notebook list(aka libraries) in pipeline configuration, does the notebook execution carried out only sequentially? Is there a way that my 2 gold notebooks which are not dependent on each other can run in parallel after completing the first (bronze + silver) notebook? Also, if my gold table is dependent on only 3-4 silver tables, will it still wait for the first notebook(bronze+silver) to complete?
DLT is intelligent enough to know what the sequence if at all. If there is no dependency, the notebooks will be executed together provided there is enough computing power to do so.
Probably one of the best examples of creating a framework using DLT. I was looking for something like this for some time. Thanks!
Exactly! I’m not seeing anything in the documentation on how to do this.
Love how you present your thought process and how you iteratively improve it and also that you don't take out things that didn't work in the first place. Keep up the great work so we mere mortals can learn as well :-)
I will keep making mistakes as I go, don't you worry :)
Great work mate!!! I am very thankfull for you video content and all the work you are putting into it
Hi Simon, Many thanks. You are the first to demo a metadata-driven framework using DLT. Could you share the code used in this demo?
Beautiful approach.. really liked!
Thank you so much for sharing !! Great presentation
Great video. I got 2 questions -
1. If there is a cycle in the dependencies a->b->c->a does databricks detect and flag that?
2. How does Restartability work? can we restart from the point of failure?
I would be curious to know more on these points.
Honestly, haven't tried circular references but I'd hope it throws an error during dependency processing before the pipeline commences. From a restart point of view, you can restart workflows, but I can't recall if you can restart a DLT pipeline.
Time for a quick experiment I think!
Hi Simon, i love your videos..its str8 to the point and covers the points i am looking pointer too (like the missing features :).. anyways with regards to this dlt framework that u ahve built, lets say i want to load the curated data and skip the dependency of the silver tables refresh ..basically execute just the goldnotebook component in the dlt pipeline.....is that possible ?
Great demo thanks - hoping (guessing you will one day) delve into 2 key DLT topics 1) DLT testing strategies and patterns 2) CI/ CD and topics like roll back (e.g can we time travel table data & metadata along side code)
…and considering SQL too! :)
Certainly some elements we'll dig into - the tables created are just delta tables so you can use all the usual delta tricks. For CI/CD & Rollback, it's interesting as they're mainly just databricks notebooks, which can be deployed using repos, devops pipelines etc, but I've not seen a formal CI/CD story around them!
@@AdvancingAnalytics great thanks for reply. Defo need to dig into operational behaviours (e.g updating a pipeline will DB let all batches being processed in different pipeline steps finish first then change it? Manual replays off quarantined data, roll back. Mechanics of ci/cd are probably easy enough to figure out it’s the behaviours I’m a little unclear of ATM. Also Emerging patterns / best practice around DLT testing too this feels super important. Love your work thanks heaps for the great content!!!!
Hi Simon. DLT features looks great but my concern is about balance between features and costs. DLT is something like 3x more expensive in terms of DBUs. What's the best approach for cost efficient designs? DLT or classic workflows?
Great Demo Simon. Will you help in integrating this gold/silver data to visualization like SuperSet or Ploty?
If you have covered, please suggest the url/tutorial.
You rock! I have a problem with switching from streaming (appending) to batch, where I have fact tables and I need to use historical data and update. What is the best practice for transition from stream to batch?
We generally stream into our "Base" layer (call it silver, clean, ODS, whatever) and then batch read from there into our "Curated" (gold, model etc). If we are worried about historical records in our curated layer, we will often store type 2 or type 4 data in Base and ensure our Curated is built in such a way to apply those changes and rebuild the history en-masse.
There's no current way to do that using DLT, but I'd hope to see things like merges coming in the future so we can apply that kind of pattern!
Simon
Does it run the tables in parallel (concurrently) or in sequence (one at a time)?
had the same question, did you figure this out Jordan
The actual loading happens concurrently. Unless there is a dependency between 2 different loading. DLT will understand this and load the dependency first.
Hi, it is a great video. Could you also let us know how to drop a delta live table.
Great demo / example.
Is how does dependencies on individual tables in the pipeline work for a downstream pipeline dependent on such?
They need to be in the same "DLT Pipeline" for the dependencies to figure themselves out. But if you run it as an entirely separate job based on the hive tables created by DLT, should still work if it's scheduled to run after the initial one.
Hi and thanx for all the hard work you put in to these presentations.
A Delta Live Table will load all new data that arrive in the source for instance landing zone defined, but will the data allready loaded need to be present in the source in the future or can these records be removed or moved after initial load to the bronze layer? Thanx!
If you ever need to rerun the pipeline with full refresh, that is the only time that the historical data will be retrieved. Let's say you delete the old data, if you re run the pipeline with full refresh, only the data that is present will be loaded. So yes, the raw csv files should be present for as long as you intend to keep the history on the tables for.
Another question, in a big scale with hundreds of tables, is how to recover the broken stream? and fix the staging tables
With most of the streaming patterns, the unfortunate answer is to delete the checkpoint and reprocess, which can sometimes lead to HUGE reprocessing time/cost. If you're working with standard streaming you can include the delta transaction starting version, but with DLT you don't have any option if you want to restart, it's incremental loading or a complete rebuild!
wanted to know about how CICD would look like, deploying notebooks should be easy enough, But how to schedule (trigger) would be deployed, is there any way to call the workflow notebook from the data factory and use triggers, also pass configs from there to either workflow or notebook directly
is it possible to use writeStream inside delta live tables?
Hi Simon, Could you please provide your code link used in this demo.
We are considering dlt, I suspect for a file ingest + validation pipeline. So very interested in the templating but need more exploration on the exceptions and corresponding logging as we have a number of validations and some are complex. Also, we logically split the data flow into valid and invalid data sets (both end up stored) so how would that work here is a question
The expectations are great, provided you can express it as a SQL statement, but if you want to retain & split, I'd instead use similar logic to add a calculated column to the data frame, then define as two table destinations (aka - do it the same as you would in a normal spark batch). Possibly something you could do that's a little more baked into the process, but there certainly approaches!
Can you use a metadata based framework with delta live tables? If so, how would you do that?
Thanks, great demo Simon. I like the for loop you created in the first notebook (bronze to silver) to make it a generic pipeline. Would the loop run in parallel or in sequential? Thinking whether a failure in one of the DLT would have any impact to other tables.
The definition of the table is performed sequentially, but that doesn't have much relevance. The actual loading of data will be performed in parallel (as much as your cluster sizing allows for), as it works out dependencies & table definitions before kicking any actual spark jobs off.
Failure is an interesting one - if one of the definitions failed, I assume the whole pipeline will fail to resolve. If there is a failure whilst loading a specific table, it would be interesting to see how the rest of the dependencies are managed! I assume running jobs will complete but no more jobs will be kicked off, but I haven't tested it!
Simon
How can we merge incoming data to silver layer using delta live tables?
1. Does it mean to have visibility of all end to end lineages? We need to include all notebooks in single pipeline?
2. When is it going to support update into a table?
Yep, to see all lineages together, it would have to be in a single pipeline - although the demo slides around unity catalog showed similar lineage, so that might bridge that gap? Maybe?
Regarding updates I have no idea of timescales, but Databricks know there's a lot of use cases for people needing merge functionality!
Simon I love the tutorials! If data contains GDPR fields or there was a need to to add surrogate keys for Dim tables in the Gold level, I'm not sure if this approach would be realistic. Seems A lot more metadata and complexity would be required? Let me know if I am being short-sighted on this. Maybe those would require different patterns?
Oh for sure, the examples I'm running through are the most basic "I need a simple data model" patterns while I kick the tyres on DLT. As consultants, advancing analytics use a home-grown set of python libraries, metadata stores and custom notebooks to automate the complexity you're talking about, and you can't get that inside DLT yet. It's heading in a promising direction, but it's still the "easy path" solution for now!
Simon
I tried similar pattern 2 weeks ago but the "second layer" table was unable to refer to the "first layer" table since they weren't created in the same function call. I was also unable to refer live table to another live table created in different notebook (although part of the same pipeline). Got weird "out of graph" or something error..
Also interested to know any info on update instead of append/overwrite?
Appreaciate your contributions Simon!
Hrm, weird - were you referring to the tables using their full hive name, or using live. as the "database"? Seemed to work fine as long as everything was live.tablename, whether they were in different functions, notebooks or whatever, as long as they're all in the same pipeline!
No news on updates, the databricks team know it's a big customer ask, but I'll share when there's news!
@@AdvancingAnalytics I used live.tablename. It worked fine if DLT was declared in absolute terms (i.e. not parametrized inside func). I gotta retry then..
Hi Simon, do those different bronze to Silver extractions happen in Parallel or is it a sequence? If its a sequence perhaps an ADF based parameterization where you can call multiple bronze to silver ingestions parallelly a better approach?
Bronze to silver will happen in parallel if there is no dependency. If there is a dependency, that will be honoured and the processing will happen in sequence.
Can we create 4 different delta live tables in a single same pipeline ?
Yep, you can create as many tables as you want, across separate notebooks even
Firstly Simon, thanks a lot for tackling & demonstrating such framework; much appreciated. Just curious if you've played around with passing in parameters to a DLT pipeline such as via ADF and wondering what are your thoughts on it; i.e:
1) pass a JSON output from an ADF activity as a string to the 'configuration' property of the DLT's pipeline JSON via REST API Web activity
2) retrieve the parameters passed in within the DLT notebooks (parse the 'JSON' string as you see fit/relevant)
3) essentially your 'tableList' is derived from the passed in parameters
Keen to know your experiments/approach in implementing a parameterised DLT from an external caller. Thanks in advance
Hi Luke, did you got something regarding passing JSON as a parameter into dlt from an external source.
Thanks
@Ayush Mangal I did it via REST API call. Not sure if that answers your question?
You can very well do this. But you need 2 API calls. 1 is to update the pipeline with the new configuration and then the actual call to start the pipeline. The rest API to start the pipeline does not have any ability to pass parameters to it. Also, only 1 instance of a DLT pipeline can execute at a time. You cannot treat this as a workflow where you can have the same workflow triggered multiple times with different parameters.
@anupamchand3690 thanks for your reply which I believe that's how I implemented previously. Cheers.
Great content as always Simon! One question on adding notebook list(aka libraries) in pipeline configuration, does the notebook execution carried out only sequentially? Is there a way that my 2 gold notebooks which are not dependent on each other can run in parallel after completing the first (bronze + silver) notebook?
Also, if my gold table is dependent on only 3-4 silver tables, will it still wait for the first notebook(bronze+silver) to complete?
DLT is intelligent enough to know what the sequence if at all. If there is no dependency, the notebooks will be executed together provided there is enough computing power to do so.