Data modeling a 100 TB data lake into 5 TBs with STRUCT and Array - DataExpert.io Bootcamp preview
HTML-код
- Опубликовано: 1 май 2024
- This is the first lecture of my 40+ hour boot camp materials. This is connected to this lesson here: dataexpert.io/lesson/dimensio... where you can get the slides.
At Airbnb, I worked a lot with dimensional data, Parquet and compression techniques. I was able to get the pricing and availability data to be much much smaller by leveraging the right processes!
We are still accepting applications for May 6th boot camp for a few more days! You can get a discount here: dataexpert.io/zach15
Message support@eczachly.com for any other questions y'all might have!
Don't forget to subscribe to my free newsletter: blog.dataengineer.io Наука
I don't find a lot of people I see having data engineering discussions at this level, I think this is really good.
Came here to compliment Zach's jawline. Decent tutorial too, I guess
holy legendary video, exactly what i was looking for
"Who is the user" is a great point, the higher up the company hierarchy the simpler it has to be. We deliver power over information to our company, they have to feel in control.
Hey Zach,really awesome .Please keep up the good work.Really a data fan of yours.
This was pure gold!
Great video Zach!
Thanks Zach 😊, your content is pure gold
This is really really good. I just sent this video to someone I just mentored. Keep sharing more stuff like this!
Loved this, thank you Zach!
This is great. I am really enjoying these videos.
great content - thanks zach for sharing!
Thank you Zach!
You had me at “dimensional data modeling”. And as an engineer I’m family with the other type of “dimensional modeling”.
I Recently Checked out your old lecture video that is posted on your channel and when i compare it to this one this is a very big jump in terms of explanation , Concept , Editing etc It feels very premium Good to see you growing
Superb content. Keep up.
Great video!
The content is fantastic, but you eating while explaining, is a little bit annoying and hard to stay focused.. but as always, brilliant
The Zuck email was the ultimate flex
9:44 I feel you, I'm literally living the same exact feeling in my current job :( sad but true
I've observed that snacking while in a meeting or presentation is a total power move, our VP of product used to snack while arguing with our CTO and it was a devastating technique.
It drives me crazy lol. Just hearing how it affects your speech *shudders. Definitely power move, also borderline disrespectful.
Its a dick move. Give the respect to the topic and the audience by engaging completely on the topic at hand.
I plan to do a on-premise data warehouse with a fast OLAP, like "bigquery fast" but I struggle to decide for one tool stack to go. I thought about mainly using Clickhouse but not too sure. Do you have a suggestion? Also maybe a good video in general "Choosing the right tools" or something
Hello from Brazil Zach!
Amazing content, really!
I have a question about cumulative table design. I'm searching for some references about it, like books, academic papers, or anything. Can you recommend some, please?
I am new to D.E. but I thought the section about cumulative table design was really cool.
Thanks for this! Just as you showed the sample schema of the 2B dataset, can you also share the sample of the equivalent array schema?
Sure
```
CREATE TABLE listing_availability (
listing_id BIGINT,
availability_next_365d ARRAY
)
PARITITION BY (ds STRING)
```
Hey Zach, love your content, mate! Particularly, the intricate yet simple details you just don't think of such as 30day arrays and 1am London time.
I really hope you build a self-paced course one day that focuses on your prerequisites:
-Proficiency in Python and SQL, at least 6 months of experience in both
-Basic understanding of Docker, Flink, and Kafka
-Basic understanding of SQL Window Functions
Whilst you can find basic stuff online, like CS50, I can see the benefits of investing in someone that teaches you the right way from the start, not cutting corners or a vast range of irrelevant content. Applying it to your data projects as you learn would be a bonus.
Just feels like that area is extremely saturated and not the best use of my time. That’s why I’ve almost done it a few times but backed out.
Every time I try to make basic content, my heart just isn’t in it
Hi Zach great lecture!
Regarding the compression problem of parquet when you shuffle.
Does that mean some pipeline do an order by at the end to increase parquet compression ?
If I'm not mistaken ordering in a distributed environment can be really expensive. If you have to shuffle is there any other solution than that?
Sort vs sortWithinPartitions are two very different methods in Spark. The latter is what I’m referring to. Global sorting sucks bad you’re right!
@@EcZachly_ I guess i learned something today Thanks ! I will keep it mind next i write some parquet files with Spark :)
Hey zach, its a great video. Can you Please make a road map of data engineering for absolute beginners and further levels.
Blog.dataengineer.io has the roadmap
I really enjoy your videos and learn a lot from them!
I do have a small question. Do you think it still is important to have the data types as small as possible even if you are using delta lake with pyspark for example? I wonder, because all the underlying data is in parquet and optimized.
I talked about Parquet quite a bit in this video. You should aim to make your data as small as possible. Just because something is parquet doesn’t mean you should just do whatever. It’s a file format and delta lake doesn’t optimize like you think it does.
Volume and tradeoffs needs to be considered here for sure
Thanks a lot for your answer. I recently started as a data engineer and told my team that we could reduce a lot of the data types sizes. However, they just said it is not important and that parquet fixes the issues. I will do some analysis on our data in terms of table size and query speed and show the results to the team.
Interesting. If I were to sign up for the self-paced version in July, would I have full access to everything over the span of two months to complete the bootcamp?
You have access to all paid data engineering content I create for a year
hi zach thanks for sharing.i have a query i am currently working as a data engineer with 3years of experience. my current company is asking to do data science tasks as well like opencv,genai etc .but in big tech companies there will be particular teams assigned for data science and data engineering right so will doing both be helpful in my career?
Life is long. Learning technical skills almost always improves your career.
If you're a data science-minded data engineer, you're better.
If you're a data science who can write pipelines, you're amazing.
Being able to unblock yourself is also important in big tech!
@@EcZachly_thanks for the advice❤
Thanks!
Live or Self-paced bootcamp - where can I get info to see which is best for me?
DataExpert.io has details. If you’re down to watch live classes at 6-8 PM pacific, live is probably best. Otherwise self-paced
I’m sold. Kinda wanna transition in to the data engineering from analytics, do you think I’d make a good candidate for your course.
Potentially! Do you know SQL and python to some extent?
@@EcZachly_ yes sir. Preparing for “aws data engineering associate” certificate as of now
@@rajdeepjinegar84 Cool my course would deepen a lot of your expertise I bet!
Hello Zach, I couldn't understand one thing about cumulative table design.
You said that, FB used this table design to keep only 30 days of data at a time. Does it mean that the cumulative table process started first day of every month and carried over to end of the month only so that the "last day of the month" partition has all required data in array?
If so, why would the data retention of anything >30 days be a problem especially for deleted users? Come 1st of next month, we'll not be considering deleted user of previous month at all, right? Or am I missing something here?
I have been scratching my head on this. Couldn't get over it!
It does a rolling 30 days
@@EcZachly_ ahh! Thanks. So "yesterday" would be 29 days of aggregated data at any point in time.
Even then, data retention of deleted users would not be more than 30 days, right?
Almost forgot to say: you've been a great inspiration and I am your one of many silent followers on linkedin 😀
Is the next video still comming?
This didn’t convert as well as I thought it would so probably not
Let’s go! You kick ass!
I didn't understand what backfilling is
sometimes the records are initialized in storage with partial state, they are composed from datasources that are async by nature or are an extensions of business processes like ammend/rollback of some record that existed and affected existing metric
In simple words:
you had report on ship traveling through ocean. Each day you get a report on some of its sensors. Based on that data you build a metric.
But, one day sensor lags, data is wrong, metrics wrong. Captain sends an email with correct data, because they have analogue doubles. You have to update the info.
So you backfill or create a new version and employ a data retention period strategy to manage old versions
@@Dmytro-kt3fr
Thanks!
how to get an internship in data engineer role
So you re telling me that even in those conpany you ll find ppl who dont wanna learn ? Ppl who should have the knowledge to just keep gathering new abilities? Like they are not my 45 Co worker who never liked cs stuff and struggle with sql, they are data analyst and cant reach to do an unnest?!
great video, but your eating while explaining is distracting
Seriously? The 50 minute lecture is when you had to eat?
The carbs kept me going
well, It's better to make organised content so that someone can stick to it.
the apple stole the show... but great content. Thank you @_@