Zach, your style is so intriguing, like seriously I am setting up to be a data analyst, I actually planned to just have a glimpse of this video, and oh boy!!! You tempting me to be a data engineer now 🥵.
I Recently Checked out your old lecture video that is posted on your channel and when i compare it to this one this is a very big jump in terms of explanation , Concept , Editing etc It feels very premium Good to see you growing
"Who is the user" is a great point, the higher up the company hierarchy the simpler it has to be. We deliver power over information to our company, they have to feel in control.
I've observed that snacking while in a meeting or presentation is a total power move, our VP of product used to snack while arguing with our CTO and it was a devastating technique.
Hey Zach, love your content, mate! Particularly, the intricate yet simple details you just don't think of such as 30day arrays and 1am London time. I really hope you build a self-paced course one day that focuses on your prerequisites: -Proficiency in Python and SQL, at least 6 months of experience in both -Basic understanding of Docker, Flink, and Kafka -Basic understanding of SQL Window Functions Whilst you can find basic stuff online, like CS50, I can see the benefits of investing in someone that teaches you the right way from the start, not cutting corners or a vast range of irrelevant content. Applying it to your data projects as you learn would be a bonus.
Just feels like that area is extremely saturated and not the best use of my time. That’s why I’ve almost done it a few times but backed out. Every time I try to make basic content, my heart just isn’t in it
I plan to do a on-premise data warehouse with a fast OLAP, like "bigquery fast" but I struggle to decide for one tool stack to go. I thought about mainly using Clickhouse but not too sure. Do you have a suggestion? Also maybe a good video in general "Choosing the right tools" or something
Hello from Brazil Zach! Amazing content, really! I have a question about cumulative table design. I'm searching for some references about it, like books, academic papers, or anything. Can you recommend some, please?
I really enjoy your videos and learn a lot from them! I do have a small question. Do you think it still is important to have the data types as small as possible even if you are using delta lake with pyspark for example? I wonder, because all the underlying data is in parquet and optimized.
I talked about Parquet quite a bit in this video. You should aim to make your data as small as possible. Just because something is parquet doesn’t mean you should just do whatever. It’s a file format and delta lake doesn’t optimize like you think it does. Volume and tradeoffs needs to be considered here for sure
Thanks a lot for your answer. I recently started as a data engineer and told my team that we could reduce a lot of the data types sizes. However, they just said it is not important and that parquet fixes the issues. I will do some analysis on our data in terms of table size and query speed and show the results to the team.
Hi Zach great lecture! Regarding the compression problem of parquet when you shuffle. Does that mean some pipeline do an order by at the end to increase parquet compression ? If I'm not mistaken ordering in a distributed environment can be really expensive. If you have to shuffle is there any other solution than that?
Interesting. If I were to sign up for the self-paced version in July, would I have full access to everything over the span of two months to complete the bootcamp?
hi zach thanks for sharing.i have a query i am currently working as a data engineer with 3years of experience. my current company is asking to do data science tasks as well like opencv,genai etc .but in big tech companies there will be particular teams assigned for data science and data engineering right so will doing both be helpful in my career?
Life is long. Learning technical skills almost always improves your career. If you're a data science-minded data engineer, you're better. If you're a data science who can write pipelines, you're amazing. Being able to unblock yourself is also important in big tech!
Hello sir I need help please I wanted a website to calculate the salary after taxes Can you give me a website to calculate the tax value, because I checked several sites. Each site gives me a different result than the other
Hello Zach, I couldn't understand one thing about cumulative table design. You said that, FB used this table design to keep only 30 days of data at a time. Does it mean that the cumulative table process started first day of every month and carried over to end of the month only so that the "last day of the month" partition has all required data in array? If so, why would the data retention of anything >30 days be a problem especially for deleted users? Come 1st of next month, we'll not be considering deleted user of previous month at all, right? Or am I missing something here? I have been scratching my head on this. Couldn't get over it!
@@EcZachly_ ahh! Thanks. So "yesterday" would be 29 days of aggregated data at any point in time. Even then, data retention of deleted users would not be more than 30 days, right? Almost forgot to say: you've been a great inspiration and I am your one of many silent followers on linkedin 😀
So you re telling me that even in those conpany you ll find ppl who dont wanna learn ? Ppl who should have the knowledge to just keep gathering new abilities? Like they are not my 45 Co worker who never liked cs stuff and struggle with sql, they are data analyst and cant reach to do an unnest?!
I don't find a lot of people I see having data engineering discussions at this level, I think this is really good.
Zach, your style is so intriguing, like seriously I am setting up to be a data analyst, I actually planned to just have a glimpse of this video, and oh boy!!! You tempting me to be a data engineer now 🥵.
9:50-10:15 is so relatable 😢. Something you learn after a few years for sure…
I Recently Checked out your old lecture video that is posted on your channel and when i compare it to this one this is a very big jump in terms of explanation , Concept , Editing etc It feels very premium Good to see you growing
"Who is the user" is a great point, the higher up the company hierarchy the simpler it has to be. We deliver power over information to our company, they have to feel in control.
You had me at “dimensional data modeling”. And as an engineer I’m family with the other type of “dimensional modeling”.
I've observed that snacking while in a meeting or presentation is a total power move, our VP of product used to snack while arguing with our CTO and it was a devastating technique.
It drives me crazy lol. Just hearing how it affects your speech *shudders. Definitely power move, also borderline disrespectful.
Its a dick move. Give the respect to the topic and the audience by engaging completely on the topic at hand.
The content is fantastic, but you eating while explaining, is a little bit annoying and hard to stay focused.. but as always, brilliant
Honestly that doesn’t bother me at all. People needs food to stay passionate
Came here to compliment Zach's jawline. Decent tutorial too, I guess
Most of the concepts explained in industrial standards ❤
9:44 I feel you, I'm literally living the same exact feeling in my current job :( sad but true
loved it! very practical and insightful and high value! I guess I should do the course
This is really really good. I just sent this video to someone I just mentored. Keep sharing more stuff like this!
Hey Zach,really awesome .Please keep up the good work.Really a data fan of yours.
The Zuck email was the ultimate flex
I am new to D.E. but I thought the section about cumulative table design was really cool.
Thanks Zach 😊, your content is pure gold
Hey Zach, love your content, mate! Particularly, the intricate yet simple details you just don't think of such as 30day arrays and 1am London time.
I really hope you build a self-paced course one day that focuses on your prerequisites:
-Proficiency in Python and SQL, at least 6 months of experience in both
-Basic understanding of Docker, Flink, and Kafka
-Basic understanding of SQL Window Functions
Whilst you can find basic stuff online, like CS50, I can see the benefits of investing in someone that teaches you the right way from the start, not cutting corners or a vast range of irrelevant content. Applying it to your data projects as you learn would be a bonus.
Just feels like that area is extremely saturated and not the best use of my time. That’s why I’ve almost done it a few times but backed out.
Every time I try to make basic content, my heart just isn’t in it
Bro is a hero !
Great talk a lot of practical discussions
Loved this, thank you Zach!
holy legendary video, exactly what i was looking for
great content - thanks zach for sharing!
I plan to do a on-premise data warehouse with a fast OLAP, like "bigquery fast" but I struggle to decide for one tool stack to go. I thought about mainly using Clickhouse but not too sure. Do you have a suggestion? Also maybe a good video in general "Choosing the right tools" or something
This is great. I am really enjoying these videos.
Great video Zach!
Thanks for the epic content Zach! But any idea how to handle late arriving data? The arrays would need to be updated accordingly right?
Thank you Zach!
This was pure gold!
Great video!
Thanks for this! Just as you showed the sample schema of the 2B dataset, can you also share the sample of the equivalent array schema?
Sure
```
CREATE TABLE listing_availability (
listing_id BIGINT,
availability_next_365d ARRAY
)
PARITITION BY (ds STRING)
```
Superb content. Keep up.
Hello from Brazil Zach!
Amazing content, really!
I have a question about cumulative table design. I'm searching for some references about it, like books, academic papers, or anything. Can you recommend some, please?
Back story, Zach was fired from every role
Have you ever worked with sap hana? as a db how efficient is it compared to others?
I really enjoy your videos and learn a lot from them!
I do have a small question. Do you think it still is important to have the data types as small as possible even if you are using delta lake with pyspark for example? I wonder, because all the underlying data is in parquet and optimized.
I talked about Parquet quite a bit in this video. You should aim to make your data as small as possible. Just because something is parquet doesn’t mean you should just do whatever. It’s a file format and delta lake doesn’t optimize like you think it does.
Volume and tradeoffs needs to be considered here for sure
Thanks a lot for your answer. I recently started as a data engineer and told my team that we could reduce a lot of the data types sizes. However, they just said it is not important and that parquet fixes the issues. I will do some analysis on our data in terms of table size and query speed and show the results to the team.
Hi Zach great lecture!
Regarding the compression problem of parquet when you shuffle.
Does that mean some pipeline do an order by at the end to increase parquet compression ?
If I'm not mistaken ordering in a distributed environment can be really expensive. If you have to shuffle is there any other solution than that?
Sort vs sortWithinPartitions are two very different methods in Spark. The latter is what I’m referring to. Global sorting sucks bad you’re right!
@@EcZachly_ I guess i learned something today Thanks ! I will keep it mind next i write some parquet files with Spark :)
Interesting. If I were to sign up for the self-paced version in July, would I have full access to everything over the span of two months to complete the bootcamp?
You have access to all paid data engineering content I create for a year
I’m sold. Kinda wanna transition in to the data engineering from analytics, do you think I’d make a good candidate for your course.
Potentially! Do you know SQL and python to some extent?
@@EcZachly_ yes sir. Preparing for “aws data engineering associate” certificate as of now
@@rajdeepjinegar84 Cool my course would deepen a lot of your expertise I bet!
I've never heard of 'shuffle'. So the points you make based on that feel vague to me.
nice! his script that likes every comment just popped off.
hi zach thanks for sharing.i have a query i am currently working as a data engineer with 3years of experience. my current company is asking to do data science tasks as well like opencv,genai etc .but in big tech companies there will be particular teams assigned for data science and data engineering right so will doing both be helpful in my career?
Life is long. Learning technical skills almost always improves your career.
If you're a data science-minded data engineer, you're better.
If you're a data science who can write pipelines, you're amazing.
Being able to unblock yourself is also important in big tech!
@@EcZachly_thanks for the advice❤
Hello sir
I need help please
I wanted a website to calculate the salary after taxes
Can you give me a website to calculate the tax value, because I checked several sites. Each site gives me a different result than the other
Is the next video still comming?
This didn’t convert as well as I thought it would so probably not
Let’s go! You kick ass!
Hello Zach, I couldn't understand one thing about cumulative table design.
You said that, FB used this table design to keep only 30 days of data at a time. Does it mean that the cumulative table process started first day of every month and carried over to end of the month only so that the "last day of the month" partition has all required data in array?
If so, why would the data retention of anything >30 days be a problem especially for deleted users? Come 1st of next month, we'll not be considering deleted user of previous month at all, right? Or am I missing something here?
I have been scratching my head on this. Couldn't get over it!
It does a rolling 30 days
@@EcZachly_ ahh! Thanks. So "yesterday" would be 29 days of aggregated data at any point in time.
Even then, data retention of deleted users would not be more than 30 days, right?
Almost forgot to say: you've been a great inspiration and I am your one of many silent followers on linkedin 😀
great video, but your eating while explaining is distracting
Thanks!
Live or Self-paced bootcamp - where can I get info to see which is best for me?
DataExpert.io has details. If you’re down to watch live classes at 6-8 PM pacific, live is probably best. Otherwise self-paced
the apple stole the show... but great content. Thank you @_@
So you re telling me that even in those conpany you ll find ppl who dont wanna learn ? Ppl who should have the knowledge to just keep gathering new abilities? Like they are not my 45 Co worker who never liked cs stuff and struggle with sql, they are data analyst and cant reach to do an unnest?!
Idk why i read the title as "100 TB data leak into 5TBs
Seriously? The 50 minute lecture is when you had to eat?
The carbs kept me going
Why do you eat an apple while recording a screencast, it is fairly distracting
Carbs help me continue to blab on
it shouldnt for you to be honest.
how to get an internship in data engineer role
well, It's better to make organised content so that someone can stick to it.
He looks like footballer Chellini
test comment