Advancing Spark - Databricks Delta Live Tables First Look

แชร์
ฝัง
  • เผยแพร่เมื่อ 14 ก.ค. 2021
  • From the initial Spark Summit talks about "engineering pipelines" we've been super excited to see where Databricks will go with automated engineering. Earlier this year we saw Delta Live Tables announced... but what do they actually do?
    In this first-look video, Simon digs into the DLT Quickstart, picking apart what the code is actually doing, highlighting a few misconceptions and getting you started with your first Delta Live Table Pipeline!
    As a reminder, at this time it is still a public preview feature, so you might not have access just yet, but you can still explore the docs and read up about how it might help you here: docs.microsoft.com/en-us/azur...
    As always - don't forget to Like & Subscribe, and let us know what you think!

ความคิดเห็น • 55

  • @MegaSb360
    @MegaSb360 ปีที่แล้ว

    WoW !!! Thank you so much. For the last couple of months, I have been struggling to understand DLT. I wish I had known sooner that a ~30mins video would do the trick.

  • @Sriramiyer1992
    @Sriramiyer1992 11 หลายเดือนก่อน

    They way the code was explained was outstanding!

  • @ashrafrcet
    @ashrafrcet 2 ปีที่แล้ว

    Thank you Simon for covering dlt in few mins.. Much helpful as always..

  • @danielperico2806
    @danielperico2806 2 ปีที่แล้ว +1

    Wow, that's amazing. Thank you Simon!

  • @Simondoubt4446
    @Simondoubt4446 2 ปีที่แล้ว +2

    Love these videos. Thank you Simon!

  • @user-ui1oh5zf6t
    @user-ui1oh5zf6t 3 หลายเดือนก่อน

    Very good!! Explained perfectly!

  • @JianZhouVA
    @JianZhouVA 2 ปีที่แล้ว +2

    Oh crap. I wrote my own Delta Live Table-like implementation (not as fancy, of course) early this year. Now I need to make a choice... Need to read the docs and get on a call with the Databricks folks. Got a lot of questions. Thanks for the video!

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว +1

      I think there's a lot of people in the same place! Build/Maintain your own framework with all the flexibility, or take the out-of-the-box for simplicity. Will be interesting as it matures, how feasible the latter is!

    • @bobj8690
      @bobj8690 2 ปีที่แล้ว

      haha me too! I have also been working on my own implementation that aims to populate a pipeline of DeltaLake tables. My biggest challenge is to figure out 'which part of downstream table needs to be updated because the corresponding part of upstream table is updated'. Somehow I think the 'inode' concept in OS File System might help...
      Would be interesting to see DLT's appraoch!

  • @amateurvisser
    @amateurvisser 2 ปีที่แล้ว

    Great explanation. Thanks. Wondering how to do incremental loading, reprocessing, watermarks and all that good stuff.

  • @suresh.suthar.24
    @suresh.suthar.24 ปีที่แล้ว

    best explanation

  • @biancairis93
    @biancairis93 2 ปีที่แล้ว

    this looks really cool indeed. The expectation checks are neat, I wonder what else they will introduce to make DQ/testing of the pipelines easier.

  • @dmitryanoshin8004
    @dmitryanoshin8004 2 ปีที่แล้ว

    Nice work! Not sure where to use it now, but looks cool!

  • @deenquotes786
    @deenquotes786 2 ปีที่แล้ว

    good work Man :)

  • @NeumsFor9
    @NeumsFor9 ปีที่แล้ว

    For anyone coming from visual etl, just think check constraints + SSIS error path in metadata when check constraints or other constraints or violated + data quality output process metadata + ability to define your own hardware.....minus the overhead of the RDBMS transaction log.

  • @vikashmishra2759
    @vikashmishra2759 2 ปีที่แล้ว

    Hi Simon....That was a really nice video and I love it. Even my all doubts are cleared. Do you have video for merging delta table using Z ordering & multilevel partition to optimize incremental load? If yes please share the link.

  • @krishnakoirala2088
    @krishnakoirala2088 ปีที่แล้ว +1

    Always thanks for those awesome videos you create. A question: How can we make those 3 tables created appeared in data tab under a schema (a.k.a database)? What happens with those 3 data folders created in storage if we don't specify the location/path while configuring, where do they go?

  • @devanshsharma7929
    @devanshsharma7929 2 ปีที่แล้ว +1

    Hi, thanks for the awesome video. I would like to know whether DLT can read data from kafka or not? At our company, we wish to read data from kafka, transform it and then load it to Cosmos DB. Want to know whether this can be possible using DLT.

  • @lackshubalasubramaniam7311
    @lackshubalasubramaniam7311 2 ปีที่แล้ว +2

    Delta live tables is an odd name. However the workflow concept is really cool. Been playing with it and like the expectations bit. The dependency appearing as a diagram is cool too...somewhat of a lineage concept. Prefer python over SQL. Find the SQL bit limiting. Probably works for Data Analyst role as you mentioned.

  • @sankhachakraborty5801
    @sankhachakraborty5801 2 ปีที่แล้ว

    Thanks for the video Simon. Have enjoyed watching it as always. I have got a quick question. Can we execute the Delta Live table pipelines from orchestrators such as Data Factory or Apache Airflow?

    • @alexischicoine2072
      @alexischicoine2072 2 ปีที่แล้ว +1

      I'm not Simon but what I've been doing is creating a job that runs the pipeline and then starting the job using the rest api. If you need it to wait for the job to finish you can write a while loop to get the status of the job using the api. You can do the calls from your tool or do that in a notebook on a tiny machine if that's easier.

  • @gardnmi
    @gardnmi 2 ปีที่แล้ว +1

    It almost seems like they asked a scala developer to write some python code and he took some creative freedoms. That is the craziest looking code I've ever seen from a professional company trying to sell a product.

  • @AVGMachine
    @AVGMachine 2 ปีที่แล้ว +2

    Great video!
    Databricks is decided to improve the number of services provided overtime. However, it's getting a bit confusing since we seem to have now services that are competing within the same ecosystem.
    When would you recommend to use ADF instead of DLT?

    • @limitlesslife7536
      @limitlesslife7536 2 ปีที่แล้ว +2

      when you have data processing steps that are not encapsulated in Databricks environment then use ADF. If all your ETL steps are in Databricks then use DLT.

  • @kuldipjoshi1406
    @kuldipjoshi1406 2 ปีที่แล้ว

    Is it incremental load or full load. What does behind the scenes write statement looks like. Can we partition , bucket while writing ?

  • @pranesh1213
    @pranesh1213 ปีที่แล้ว

    Can we do model scoring within the delta table definition script? Like pick a model from the registry and load it as udf and apply it live?

  • @advanceddataengineering3784
    @advanceddataengineering3784 2 ปีที่แล้ว

    Is it possible to use some other source like a JDBC database or azure event hub instead of cloud files?
    BTW, I watch your videos regularly. Great Work!! Thanks.

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว

      Yep, anything that has a spark dataframe reader! I'm sure there's a little bit of nuance with the weirder ones like event hubs, but it's just spinning up a spark job so should be doable with most things spark can read!
      Simon

  • @alexischicoine2072
    @alexischicoine2072 2 ปีที่แล้ว

    Does anyone know how to start a full refresh without clicking on the button on the ui? I've only been able to setup a regular refresh but not the full.
    I have a use case where I run a streaming query but periodically I save the output and restart a new streaming input so it doesn't grow too large for steps that run over the whole data and aren't streamed. Right now I send myself an email to go and do it it's not ideal. Otherwise the refresh throws an error as expected after modifying the streaming input.

  • @nithishreddy752
    @nithishreddy752 11 หลายเดือนก่อน

    Can we implement CDC capture and column name changes or transformations in single layer for DLT?

  • @neelbanerjee7875
    @neelbanerjee7875 ปีที่แล้ว

    Thanks for this awesome video.. However a quick question -
    Using python we can imply multiple additional functionalities over a dataframe in side the delta live table function, like custom function, UDFand also multi step programming (as you shown).. but don't think we can do all those using SQL in delta table.. Could pls correct me if wrong?

    • @AdvancingAnalytics
      @AdvancingAnalytics  ปีที่แล้ว +1

      Nope, you don't have the same iterative power as you would in Pyspark, but you can certainly achieve a lot. I've not tested whether Databricks SQL Functions work inside DLT, but if they do that's most of the functionality you list covered!

  • @alexischicoine2072
    @alexischicoine2072 2 ปีที่แล้ว

    Love your channel.
    Regarding the import dlt I found it annoying as well. I think it might be possible to hack together a fake import dlt to be able to have the function definitions at least run and provide some autocomplete with doc strings. I'm going to look into it if I can grab some source code at runtime.
    Have you had a chance to look at the new multi-task jobs / orchestration? For example I got a use case where I run a merge from some parquet source into a delta table that I then use as my first source for streaming in my delta live tables pipeline. With this new feature they can be ran one after the other in the same job.
    Keep up the good work your videos are really clear and you have an engaging presentation.

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว +1

      I'm holding off on mocking up the dlt library for now - I'm hoping that it'll be baked into future DBX runtimes (at least for autocomplete etc as you say), and it's just the preview nature that means it uses a custom runtime... but we'll see what it looks like as it moves towards general availability!
      Haven't looked at multi-task jobs yet - I checked yesterday and my workspace isn't enabled yet. I'll have a check early next week and put together a quick vid!
      Simon

  • @jespermartinsson8331
    @jespermartinsson8331 ปีที่แล้ว

    How do you actually specify the storage location in the delta lake pipeline to azure delta lake without mounting it to dbfs?

  • @morrolan
    @morrolan 11 หลายเดือนก่อน

    I always develop/test my notebook code locally, and then as a final step deploy to Databricks. With DLT, I feel the costs will skyrocket with those clusters needing to be running, and also it is very slow. I am really hesitant to use this in it's current state.

  • @tiagorente2860
    @tiagorente2860 2 ปีที่แล้ว

    We write our notebooks in Scala but looking at you video the supported languages are Pythion and SQL.
    Do you know if Scala will be a possible language to use in DLT?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว +1

      Honestly don't know! Some of the more abstracted Databricks elements (table access control, passthrough ADD etc) are python/SQL only, so it may be a similar limitation? No idea what the future plan is inside Databricks!
      Simon

  • @ezequielchurches5916
    @ezequielchurches5916 หลายเดือนก่อน

    clickstream_raw is mapped with BRONZE layer? clickstream_cleaned is mapped with SILVER LAYER? how could I map each delta table with the medallion layers?

  • @joyyoung3288
    @joyyoung3288 ปีที่แล้ว

    please help the storage location? as bump into some problems.

  • @thomasadams6860
    @thomasadams6860 2 ปีที่แล้ว +1

    Thoughts on dbt compared to this? Seems very similar.

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว +1

      Yeah, seems to be aiming at a similar space but has a lot less polish so far. Honestly I've only dabbled with DBT so can't comment much further!

  • @dmitryanoshin8004
    @dmitryanoshin8004 2 ปีที่แล้ว +1

    Does it replace Azure Data Factory in some extension?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว

      Potentially - certainly does some of the orchestration elements, but certainly isn't as good at other workflow elements, copying data into the platform etc!

  • @denisgodunov6157
    @denisgodunov6157 4 หลายเดือนก่อน

    Transformations very similar to what we have in DBT

  • @prashantthakur4324
    @prashantthakur4324 2 ปีที่แล้ว

    Would it be possible to use Delta Live tables for some temporary jobs like giving user a decrypted data where a decryption job runs and once the use is over we delete the data from Workspace

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว

      You certainly could - seems like a lot of setup if it's a throwaway bit of data. Probably easier to manage that via a custom notebook?

    • @prashantthakur4324
      @prashantthakur4324 2 ปีที่แล้ว

      @@AdvancingAnalytics we would like to expose this to other clients like Tableau so within databricks custom notebook is fine for external clients we wanted to use this option

  • @PersonOfBook
    @PersonOfBook 2 ปีที่แล้ว

    How is this different from an SQL View. Also can I do upserts and deletes on a Delta table using this?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว +1

      The view objects literally are just SQL Views, only difference is the extra wrapping that lets you materialise the data back to physical Delta tables.
      Updates & Deletes aren't currently supported, but I'm hoping we'll see them in there eventually!

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว +2

      Speak of the devil - just announced: databricks.com/blog/2022/02/10/databricks-delta-live-tables-announces-support-for-simplified-change-data-capture.html

    • @PersonOfBook
      @PersonOfBook 2 ปีที่แล้ว

      @@AdvancingAnalytics That is amazing. Thanks for sharing the link :)

  • @GuillaumeBerthier
    @GuillaumeBerthier 2 ปีที่แล้ว +1

    Thanks Simon for this new video, I love your channel !
    I'm not very familiar with Databricks but I just did some practice on Azure Synapse (based on your video th-cam.com/video/lpBM4Yn2k3U/w-d-xo.html ) and after watching at this new video on Delta Live Table I was wondering if the same "outcome" couldn't be achieved with Synapse (scheduled) Pipeline and Data Flows (with Delta Table in Sink)
    ....Or did I completely missed the point (which is completely possible :p )

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว +4

      Hey! Yep, you could build a pipeline loading a delta table using Synapse pipelines & data flows that would achieve the same loading for the quick batch example - the deeper point of DLT is around incremental loading, the data quality elements and (hopefully) building some reusable transformation functions, which you wouldn't be able to do to the same level in a Synapse data flow!
      Simon

  • @NeumsFor9
    @NeumsFor9 ปีที่แล้ว

    Expectations...... Assert transforms....... check constraints.......same stuff, different day.....