How And Why Data Engineers Need To Care About Data Quality Now - And How To Implement It

Поделиться
HTML-код
  • Опубликовано: 28 сен 2024

Комментарии • 26

  • @SeattleDataGuy
    @SeattleDataGuy  8 месяцев назад +1

    If you need help with your data analytics strategy or are having problems with your data quality, feel free to set-up some time with me! calendly.com/ben-rogojan/consultation

  • @richardgui2934
    @richardgui2934 8 месяцев назад +22

    # Short Summary
    ## Types of data quality checks:
    - Range check: checks for outliers
    - Category checks: works like an "enum"-s in programming
    - Data freshness check: fails if there was no or just a little new data
    - Volume data check
    - null check -- allow no nulls or allow for a % of fields to be null
    ## How to create a system to perform checks for you
    It is nice to have:
    - sending alert notifications if checks fail!
    - having a "Data quality" dashboard -- that contains "freshness", "volume", "null" checks, etc.
    - tracking change of volume, freshness, null checks over time
    - abstraction layers so that setting up test cases is a breeze
    ## Platforms
    Data Quality/Lineage tools exist. You can either use those or write your own tool. -- Project requirements wil help you choose
    There are data quality checks in DBT as well. There are Builtins ones and the great expectations library contains many more. You can also use the unit test library in order to test your data transformations in DBT.
    ---
    Thank you for the video. I love your content!

    • @SeattleDataGuy
      @SeattleDataGuy  8 месяцев назад

      Thanks for the summary!

    • @Supersheep19
      @Supersheep19 3 месяца назад

      Thank you so much!! It saves me time to summarise the video which is what I planned to do. Glad that I checked the comments section before I do it.

  • @jzthegreat
    @jzthegreat 7 месяцев назад +1

    Your video quality has gotten a lot better my guy. I like the different zooms of focus

  • @PyGorka
    @PyGorka 7 месяцев назад +8

    Great talk. We are implementing more checks like this in our systems and they are nice. One check we like to do in Snowflake is a check to try to load a file into a check table which has the same schema of the final table. We then capture any errors in that check table, store the data in a blob and put metadata there to record it. We use this to see if a file can be loaded into the table or not. If a file can be loaded but one record is bad (Ex: missing columns) we just exclude that one row in a reject table.
    I'll have to look into the data operators I wonder how those well those run. This topic is so big and you could go so deep into explaining how to handle problems.

    • @SeattleDataGuy
      @SeattleDataGuy  7 месяцев назад +2

      Thanks for sharing how your team is implementing some data quality checks its super helpful for everyone else!!!

  • @andrejhucko6823
    @andrejhucko6823 7 месяцев назад

    Good video, I liked the editing and explanations. I'm using mostly GX (great-expectations) for quality checks.

  • @sanjayplays5010
    @sanjayplays5010 5 месяцев назад

    Thanks for the video Ben, using this to implement some DQ checks now. How do you reckon something like Deequ fits in here? Would you run a Deequ job prior to each ETL job?

  • @nishijain7993
    @nishijain7993 5 месяцев назад +1

    Insightful!

  • @daegrun
    @daegrun 8 месяцев назад +1

    If data quality checks are done at this level then why do I hear that a data analyst has to do a lot of data cleaning and data quality checks as well?
    Of the mention of the amount of failures allowed is the reason why?

    • @SeattleDataGuy
      @SeattleDataGuy  8 месяцев назад +1

      There are a few reasons why, not everyone implements checks, data sources can still be wrong, sometimes due to the level of integration different analysts might pull the same data from different sources, some from the data warehouse, some from 3-4 different source systems and a few other reasons...

  • @wilsonroberto3817
    @wilsonroberto3817 7 месяцев назад

    Hellow
    man, really nice video!
    pls i'm in doubt about which certification should I take in AWS.
    Solutions Architect or wait for the Data Engineer certification which starts on March?
    I'm work as DE and I already have the cloud practioner and az900 certifications!

  • @alecryan8220
    @alecryan8220 7 месяцев назад

    Are these videos AI generated? The editing is weird lol

    • @gamerjg777
      @gamerjg777 7 месяцев назад

      🫵😹🫵😹🫵😹

  • @nategraham3980
    @nategraham3980 9 дней назад

    Thank you! Appreciate the info!

  • @JAYRROD
    @JAYRROD 8 месяцев назад +2

    Great topic - appreciate the practical examples!

  • @thndesmondsaid
    @thndesmondsaid 2 месяца назад

    Such a good video! Data quality checks are simple/common sense but many organizations don't take the time to implement them!

  • @heljava
    @heljava 7 месяцев назад +1

    Thank you. Those are really great tips and as always the examples are great!

    • @SeattleDataGuy
      @SeattleDataGuy  7 месяцев назад

      Glad you found this video helpful!

  • @andydataguy
    @andydataguy 7 месяцев назад

    Great to see a video talking about the trade offs! The sign of a good architect 🙌🏾🫡