Why data engineers should care about data quality (and how to do it right)
HTML-код
- Опубликовано: 3 июл 2024
- Today we’re going to be talking about data quality.
We’re going to cover:
Why is data quality important?
- Data powers so many decisions nowadays
-- Low-quality data = low-quality decisions
--- Whether that is big expensive decisions by CEOs
--- Data scientists making incorrect decisions about AB tests
--- Or machine learning models making million of low-quality decisions every day
What are the different types of data quality issues?
Incorrectness
Common errors in this class:
Duplicates
NULLs
Inconsistency in reporting
Incompleteness
Common errors in this class:
Missing an important dimensions
Not a robust enough data model to answer the questions you want
Design problems
Common errors in this class:
Answering your questions is prohibitively expensive
What causes data quality issues?
Logging bugs
Duplicates entering production databases
Third-party APIs breaking contract
How do you automate checking for common data quality errors?
The most common way to do this is using the write-audit-publish pattern
Write to a staging table
Run your audit queries that check for things like NULLs and duplicates
If the audits pass, publish the staging table data to production
What are some tools to check out to accomplish these things?
If you’re using Apache Spark, check out Amazon Deequ
If you’re streaming data with Kafka and Flink, check out Apache Griffin
For everybody else, check out Great Expectations Наука
I'd recommend soda sql instead of great expectations just because how difficult it is to bootstrap GE and the sheer boiler plate involved if you're running spark jobs.
You could potentially also integrate soda sql with datahub and showcase the data quality checks on a dataset when anyone in org searches for metadata
This guy is pure quality material. Zach you're a G.
Very helpful! Thanks Zach! Can you also talk about "end to end" designing ETL pipelines and API's architecture. Like a system design talk if it is possible! Thanks!
Great points, especially the write-audit-publish. Definitely something to consider implementing!
I love the dog sleeping in the background :)
Love your videos and your personality!
Even though i wasnt able to grab a lot from the video, I hope one day i will understand each and every aspect of the things you spoke about in this video as I am an aspiring data engineer. And loving the new look Zack👌
Thanks very much for the material!
Your videos are really inspiring! Always looking forward to the next one!
Great stuff Zach
Data quality is one issue due to which many projects fail. It is needed to be taken seriously than it is now most teams do not have a data quality engineer as a job title. I am currently looking at great expectations to write checks on the ingestion side. I have also used pandas and pytest to come up with a framework which checks data quality daily
Great tips, Zach 🔥🔥
I work in a consulting firm, So I cannot focus on data quality of our client. But These are really helpful.
Intersting!
Data quality is important . Thanks
Great tips, Zach 🔥
-Vanessa
please make a video on designing of efficient table design. it will be very useful
🔥🔥🔥
I had a double regarding partitionBy parquet file.Can you please help me?
Pooch sleep quality more important, say hi to the sleeping doggie
Data quality is important!!! It is correct🗣