How And Why Data Engineers Need To Care About Data Quality Now - And How To Implement It

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 ก.ค. 2024

ความคิดเห็น • 25

  • @SeattleDataGuy
    @SeattleDataGuy  6 หลายเดือนก่อน +1

    If you need help with your data analytics strategy or are having problems with your data quality, feel free to set-up some time with me! calendly.com/ben-rogojan/consultation

  • @richardgui2934
    @richardgui2934 6 หลายเดือนก่อน +20

    # Short Summary
    ## Types of data quality checks:
    - Range check: checks for outliers
    - Category checks: works like an "enum"-s in programming
    - Data freshness check: fails if there was no or just a little new data
    - Volume data check
    - null check -- allow no nulls or allow for a % of fields to be null
    ## How to create a system to perform checks for you
    It is nice to have:
    - sending alert notifications if checks fail!
    - having a "Data quality" dashboard -- that contains "freshness", "volume", "null" checks, etc.
    - tracking change of volume, freshness, null checks over time
    - abstraction layers so that setting up test cases is a breeze
    ## Platforms
    Data Quality/Lineage tools exist. You can either use those or write your own tool. -- Project requirements wil help you choose
    There are data quality checks in DBT as well. There are Builtins ones and the great expectations library contains many more. You can also use the unit test library in order to test your data transformations in DBT.
    ---
    Thank you for the video. I love your content!

    • @SeattleDataGuy
      @SeattleDataGuy  6 หลายเดือนก่อน

      Thanks for the summary!

    • @Supersheep19
      @Supersheep19 หลายเดือนก่อน

      Thank you so much!! It saves me time to summarise the video which is what I planned to do. Glad that I checked the comments section before I do it.

  • @PyGorka
    @PyGorka 6 หลายเดือนก่อน +8

    Great talk. We are implementing more checks like this in our systems and they are nice. One check we like to do in Snowflake is a check to try to load a file into a check table which has the same schema of the final table. We then capture any errors in that check table, store the data in a blob and put metadata there to record it. We use this to see if a file can be loaded into the table or not. If a file can be loaded but one record is bad (Ex: missing columns) we just exclude that one row in a reject table.
    I'll have to look into the data operators I wonder how those well those run. This topic is so big and you could go so deep into explaining how to handle problems.

    • @SeattleDataGuy
      @SeattleDataGuy  5 หลายเดือนก่อน +2

      Thanks for sharing how your team is implementing some data quality checks its super helpful for everyone else!!!

  • @thndesmondsaid
    @thndesmondsaid 12 วันที่ผ่านมา

    Such a good video! Data quality checks are simple/common sense but many organizations don't take the time to implement them!

  • @jzthegreat
    @jzthegreat 6 หลายเดือนก่อน +1

    Your video quality has gotten a lot better my guy. I like the different zooms of focus

  • @heljava
    @heljava 6 หลายเดือนก่อน +1

    Thank you. Those are really great tips and as always the examples are great!

    • @SeattleDataGuy
      @SeattleDataGuy  5 หลายเดือนก่อน

      Glad you found this video helpful!

  • @andrejhucko6823
    @andrejhucko6823 5 หลายเดือนก่อน

    Good video, I liked the editing and explanations. I'm using mostly GX (great-expectations) for quality checks.

  • @JAYRROD
    @JAYRROD 6 หลายเดือนก่อน +2

    Great topic - appreciate the practical examples!

    • @SeattleDataGuy
      @SeattleDataGuy  6 หลายเดือนก่อน +1

      Glad you liked it!

  • @nishijain7993
    @nishijain7993 3 หลายเดือนก่อน +1

    Insightful!

  • @wilsonroberto3817
    @wilsonroberto3817 5 หลายเดือนก่อน

    Hellow
    man, really nice video!
    pls i'm in doubt about which certification should I take in AWS.
    Solutions Architect or wait for the Data Engineer certification which starts on March?
    I'm work as DE and I already have the cloud practioner and az900 certifications!

  • @sanjayplays5010
    @sanjayplays5010 3 หลายเดือนก่อน

    Thanks for the video Ben, using this to implement some DQ checks now. How do you reckon something like Deequ fits in here? Would you run a Deequ job prior to each ETL job?

  • @daegrun
    @daegrun 6 หลายเดือนก่อน +1

    If data quality checks are done at this level then why do I hear that a data analyst has to do a lot of data cleaning and data quality checks as well?
    Of the mention of the amount of failures allowed is the reason why?

    • @SeattleDataGuy
      @SeattleDataGuy  6 หลายเดือนก่อน +1

      There are a few reasons why, not everyone implements checks, data sources can still be wrong, sometimes due to the level of integration different analysts might pull the same data from different sources, some from the data warehouse, some from 3-4 different source systems and a few other reasons...

  • @alecryan8220
    @alecryan8220 5 หลายเดือนก่อน

    Are these videos AI generated? The editing is weird lol

    • @jorgeperez7742
      @jorgeperez7742 5 หลายเดือนก่อน

      🫵😹🫵😹🫵😹

  • @andydataguy
    @andydataguy 5 หลายเดือนก่อน

    Great to see a video talking about the trade offs! The sign of a good architect 🙌🏾🫡