Why data engineers should care about data quality (and how to do it right)

Поделиться
HTML-код
  • Опубликовано: 3 июл 2024
  • Today we’re going to be talking about data quality.
    We’re going to cover:
    Why is data quality important?
    - Data powers so many decisions nowadays
    -- Low-quality data = low-quality decisions
    --- Whether that is big expensive decisions by CEOs
    --- Data scientists making incorrect decisions about AB tests
    --- Or machine learning models making million of low-quality decisions every day
    What are the different types of data quality issues?
    Incorrectness
    Common errors in this class:
    Duplicates
    NULLs
    Inconsistency in reporting
    Incompleteness
    Common errors in this class:
    Missing an important dimensions
    Not a robust enough data model to answer the questions you want
    Design problems
    Common errors in this class:
    Answering your questions is prohibitively expensive
    What causes data quality issues?
    Logging bugs
    Duplicates entering production databases
    Third-party APIs breaking contract
    How do you automate checking for common data quality errors?
    The most common way to do this is using the write-audit-publish pattern
    Write to a staging table
    Run your audit queries that check for things like NULLs and duplicates
    If the audits pass, publish the staging table data to production
    What are some tools to check out to accomplish these things?
    If you’re using Apache Spark, check out Amazon Deequ
    If you’re streaming data with Kafka and Flink, check out Apache Griffin
    For everybody else, check out Great Expectations
  • НаукаНаука

Комментарии • 21

  • @neosmith009
    @neosmith009 2 года назад +14

    I'd recommend soda sql instead of great expectations just because how difficult it is to bootstrap GE and the sheer boiler plate involved if you're running spark jobs.
    You could potentially also integrate soda sql with datahub and showcase the data quality checks on a dataset when anyone in org searches for metadata

  • @eugeniosp3
    @eugeniosp3 9 месяцев назад +4

    This guy is pure quality material. Zach you're a G.

  • @ermansahintatar8296
    @ermansahintatar8296 2 года назад +7

    Very helpful! Thanks Zach! Can you also talk about "end to end" designing ETL pipelines and API's architecture. Like a system design talk if it is possible! Thanks!

  • @__toby__
    @__toby__ 2 года назад +5

    Great points, especially the write-audit-publish. Definitely something to consider implementing!

  • @ABronfin
    @ABronfin 2 года назад +3

    I love the dog sleeping in the background :)

  • @beyzadelen367
    @beyzadelen367 2 года назад +3

    Love your videos and your personality!

  • @mohammedyamin2639
    @mohammedyamin2639 2 года назад +2

    Even though i wasnt able to grab a lot from the video, I hope one day i will understand each and every aspect of the things you spoke about in this video as I am an aspiring data engineer. And loving the new look Zack👌

  • @paularesende502
    @paularesende502 2 года назад +2

    Thanks very much for the material!

  • @zahscr
    @zahscr 2 года назад +2

    Your videos are really inspiring! Always looking forward to the next one!

  • @alexandergreen6079
    @alexandergreen6079 2 года назад +2

    Great stuff Zach

  • @lucifieramit1
    @lucifieramit1 2 года назад +2

    Data quality is one issue due to which many projects fail. It is needed to be taken seriously than it is now most teams do not have a data quality engineer as a job title. I am currently looking at great expectations to write checks on the ingestion side. I have also used pandas and pytest to come up with a framework which checks data quality daily

  • @jithendrayenugula7137
    @jithendrayenugula7137 2 года назад +2

    Great tips, Zach 🔥🔥
    I work in a consulting firm, So I cannot focus on data quality of our client. But These are really helpful.

  • @paulowiz
    @paulowiz 2 года назад +2

    Intersting!

  • @aashishraina2831
    @aashishraina2831 2 года назад +2

    Data quality is important . Thanks

  • @DataSavvyTV
    @DataSavvyTV 2 года назад +2

    Great tips, Zach 🔥
    -Vanessa

  • @sudheerthulluri
    @sudheerthulluri 2 года назад +2

    please make a video on designing of efficient table design. it will be very useful

  • @mysticalnights6820
    @mysticalnights6820 2 года назад +2

    🔥🔥🔥

  • @vilw4739
    @vilw4739 Год назад +2

    I had a double regarding partitionBy parquet file.Can you please help me?

  • @SiyaMedia
    @SiyaMedia Год назад +2

    Pooch sleep quality more important, say hi to the sleeping doggie

  • @lalagafarova3840
    @lalagafarova3840 7 месяцев назад +2

    Data quality is important!!! It is correct🗣