Advancing Spark - Automated Data Quality with Lakehouse Monitoring

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 มิ.ย. 2024
  • As data engineers, we've built countless ETL scripts that tracked data quality, over and over again. Wouldn't it be lovely if our systems just regularly polled the data and checked the DQ for us? Wouldn't it be great if we could apply a whole set of quality metrics across our tables as standard? Well that's exactly what Databricks Lakehouse Monitoring is!
    In this video, Simon takes a quick look at Lakehouse Monitoring, enables it for a sample table and runs through the quality metrics that are captured. If you're not already monitoring the quality of data in your Lakehouse... why not start now?
    For more details on Lakehouse Monitoring, check out: learn.microsoft.com/en-us/azu...
    If you're after some deep-dive, hands-on Spark training for the festive period, why not check out: advancinganalytics.teachable.com
    And if you're embarking on a Lakehouse journey, and want it to deliver serious value, why not give Advancing Analytics a call?

ความคิดเห็น • 8

  • @saugatmukherjee8119
    @saugatmukherjee8119 6 หลายเดือนก่อน +1

    Your videos are really helpful, Simon ! I read about this last week, but these videos are so helpful, before I have tried them out. Thanks ! Right now, I have a custom framework, where people can put in some yaml configs about a query and an alert definition , based on that query, and then upon checking it into the repo, the pipeline creates a saved query, an alert and a job with a sql_task based on if the alert has a schedule.

  • @ParkerWoodson-ow3ol
    @ParkerWoodson-ow3ol 4 หลายเดือนก่อน

    This is fantastic stuff that, like you said, should be done as a practice as part of the lifecycle management of your data. This could be especially helpful if you don't know where to start on implementing data profiling-testing. Definitely helpful on determining the more specific what, where, and when to do data testing and monitoring. The generic turn the Databricks quality monitoring switch on is only going to get you so far. It'll be excessive in some areas and not enough in others. To make it really useful and not unnecessarily blow out your costs fine tuning this processes is necessary IMO. I'm sure the feature will mature and hopefully allow finer control and extensibility so I'll be watching. Thanks for always keeping us up do date and covering really useful topics in a "What does this mean to my everyday data job?" context.

  • @alexischicoine2072
    @alexischicoine2072 4 หลายเดือนก่อน

    I think it’s important to have an action plan when setting this up. If you don’t have a plan to either work with data producers or decide on canceling a source then I wouldn’t do it. We had previous monitoring of null percentage we retired because investigating the cause took too much time and we have hundreds of data producers.

  • @NeumsFor9
    @NeumsFor9 6 หลายเดือนก่อน

    It's an ok start. However, this would be way more useful if we could monitor and the data quality associated with each load batch as well. It utilized caching and optimization so that the profiling would not take this long.
    We built a process to scan files, write all data profiling to a metadata repo, integrate those results with the metadata repo, and query those repos, metrics, and drifts as part of the ETL process, and take actions all based on metadata. What you are showing isn't a bad compliment to that......but I would prefer to see something more actionable. It is a good start though.

  • @fb-gu2er
    @fb-gu2er 5 หลายเดือนก่อน

    It would be good to mention the cost. Do we get charged by the work being done under the hood in DBUs?

  • @alexischicoine2072
    @alexischicoine2072 4 หลายเดือนก่อน

    Lille any background serverless task you get billed for it sounds like it could get real expensive real fast if it’s taking six minutes for 60k rows. Probably not something I would try on my personal account I pay for :).

  • @yatharthm22
    @yatharthm22 4 หลายเดือนก่อน

    In my the two metric tables are not getting created automatically, am i doing something wrong?

  • @alexischicoine2072
    @alexischicoine2072 4 หลายเดือนก่อน

    Lazy managed table? I also used to create tables as external but now that Unity catalog has undrop and brings extra functionality to managed tables the decision changes.