Advancing Spark - Automated Data Quality with Lakehouse Monitoring

แชร์
ฝัง
  • เผยแพร่เมื่อ 24 ม.ค. 2025

ความคิดเห็น • 8

  • @saugatmukherjee8119
    @saugatmukherjee8119 ปีที่แล้ว +1

    Your videos are really helpful, Simon ! I read about this last week, but these videos are so helpful, before I have tried them out. Thanks ! Right now, I have a custom framework, where people can put in some yaml configs about a query and an alert definition , based on that query, and then upon checking it into the repo, the pipeline creates a saved query, an alert and a job with a sql_task based on if the alert has a schedule.

  • @ParkerWoodson-ow3ol
    @ParkerWoodson-ow3ol 11 หลายเดือนก่อน

    This is fantastic stuff that, like you said, should be done as a practice as part of the lifecycle management of your data. This could be especially helpful if you don't know where to start on implementing data profiling-testing. Definitely helpful on determining the more specific what, where, and when to do data testing and monitoring. The generic turn the Databricks quality monitoring switch on is only going to get you so far. It'll be excessive in some areas and not enough in others. To make it really useful and not unnecessarily blow out your costs fine tuning this processes is necessary IMO. I'm sure the feature will mature and hopefully allow finer control and extensibility so I'll be watching. Thanks for always keeping us up do date and covering really useful topics in a "What does this mean to my everyday data job?" context.

  • @yatharthm22
    @yatharthm22 11 หลายเดือนก่อน +1

    In my the two metric tables are not getting created automatically, am i doing something wrong?

  • @fb-gu2er
    @fb-gu2er ปีที่แล้ว

    It would be good to mention the cost. Do we get charged by the work being done under the hood in DBUs?

  • @alexischicoine2072
    @alexischicoine2072 11 หลายเดือนก่อน

    I think it’s important to have an action plan when setting this up. If you don’t have a plan to either work with data producers or decide on canceling a source then I wouldn’t do it. We had previous monitoring of null percentage we retired because investigating the cause took too much time and we have hundreds of data producers.

  • @NeumsFor9
    @NeumsFor9 ปีที่แล้ว

    It's an ok start. However, this would be way more useful if we could monitor and the data quality associated with each load batch as well. It utilized caching and optimization so that the profiling would not take this long.
    We built a process to scan files, write all data profiling to a metadata repo, integrate those results with the metadata repo, and query those repos, metrics, and drifts as part of the ETL process, and take actions all based on metadata. What you are showing isn't a bad compliment to that......but I would prefer to see something more actionable. It is a good start though.

  • @alexischicoine2072
    @alexischicoine2072 11 หลายเดือนก่อน

    Lille any background serverless task you get billed for it sounds like it could get real expensive real fast if it’s taking six minutes for 60k rows. Probably not something I would try on my personal account I pay for :).

  • @alexischicoine2072
    @alexischicoine2072 11 หลายเดือนก่อน

    Lazy managed table? I also used to create tables as external but now that Unity catalog has undrop and brings extra functionality to managed tables the decision changes.