@@jay_wright_thats_right yes, I'm actually a data engineer. I've been a data engineer for a many years now so no, I didn't get a job because of the channel. But I can confirm that he teaches important concepts that are very useful and sometimes not readily available for a beginner engineer.
great video Zach, awesome content I learnt a Lot. Can you please make a video or share some content about why we should avoid shuffling, shuffling issues and ways to fix it?
If you want to sort from lowest cardinality to highest cardinality then to get an estimate of cardinality for each of the table colums we can look at the following values: - The most steals in a game is 11 - There number of teams in the NBA is 30 - The most assists ever in a game is 30 - The most rebounds in a game is 55 - The most points in a game is 100 - The number of players in the NBA is 500-600 This would imply that sorting by player name is actually one of the worst options. Is better to sort by player name rather than points because compressing the strings reduces more bytes than compressing the numeric values? Also, considering that it’s unlikely to have two players with the exact same name on different teams, wouldn’t you get even better compression by sorting first by team and then by player? This way, you’d maintain the same level of compression for the player column while also improving compression for the team column.
Thanks Zach, the practical you showed helped me learn a lot. Can you please tell if I do daily sorted inserts into my iceberg table from my OLTP system using an ETL pipeline, will Iceberg consider that instance 'exclusive' and compress store it or will it look for common columns in existing data files as well and then compress?
@zach Thanks for this informative video. I have one question. You mentioned about sorting the data on low cardinality columns and then moving towards high cardinality for better RLE which makes sense to get more compressed data. But on the read side taking an example of ICEBERG we generally try to filter data on high cardinality columns and hence use those columns in sorting the data so that we read fewer data and predicate pushdown will really help in reading very small subset of data. Now both these settings contradict each other, on one side we get smaller data but on the other side we are more concerned about using sorting on high cardinality columns.
Hey! Super late to this video but glad I found it. One thing you didn’t touch on is the tradeoff in compute for sorting the data on write. I’m wondering if you found this to be negligible in your past experience. Thanks again for the great content!
I have a question, during the whole video you've been dealing with historical data and moving it, what about new data received, how do you deal with it ? do you insert it into some random table then update your iceberg table using some crone jobs or do you insert it directly into iceberg and how?
There’s a 100% chance May is the last one that I’m teaching a majority (~75%) of the content. September/october would be the next one. I’ll be teaching like… 30-40%
The tables you are using for your sources... Are those iceberg tables which are really just files and folders in s3 under the hood, placed there before the training? I'm just confused where the raw is coming from and what it looks like.
Is it necessary to sort the dataset ? Say what if the compute and time for sorting >>>> Storage consumed? even if the storge is very large it is cheaper right ?? What is the good tipping point here ?
This is amazing . You are a fabulous teacher . Had a question on replication. Is the replication factor not a requirement any more in modern cloud data lakes ?
I dont understand the reason of parquet?? so you are saying 'parquet is amazing it partitions the data you don't have to select all columns'. Well why not just select (needed columns) instead of select *? This is sql 101...
It has to do with how the data is laid out on disk, and the fact that reading from disk happens in contiguous chunks. As an example, take a table that has 3 fields, A, B, and C, that are all 1 byte in size, and say you want to query only field A. Also, suppose your disk transfers happen in 4 byte chunks. With a row oriented format, 4 rows of data are laid out like: A B C A B C A B C A B C To perform the query, you would need to read all 12 bytes from disk. The 3x4 byte chunks being: A B C A B C A B C A B C That took 3 transfers, and since you’re only interested in field A, you end up only using 4 of the 12 bytes (33.3% efficiency). With a column oriented format the data is laid out like: A A A A B B B B C C C C This time you only need a single 4 byte transfer of the first chunk to get the data you want. That’s a third of the transfers and 100% efficiency
Often times, there will be additional benefit for having all the field A data stored contiguously in memory because your CPU also transfers contiguous chunks from RAM to the CPU cache. If you were reading from a row oriented format, you could rearrange the data to use a column oriented in-memory layout, like Arrow, but that rearrangement would be costly
If you want to sort from lowest cardinality to highest cardinality then to get an estimate of cardinality for each of the table colums we can look at the following values: - The most steals in a game is 11 - The number of teams in the NBA is 30 - The most assists ever in a game is 30 - The most rebounds in a game is 55 - The most points in a game is 100 - The number of players in the NBA is 500-600 This would imply that sorting by player name is actually one of the worst options. Is better to sort by player name rather than points since compressing the strings reduces more bytes than compressing the numeric values? Also, since having two players on different teams with the same name is very unlikely, wouldn’t you get even better compression by sorting first by team and then by player? This way, you’d maintain the same level of compression for the player column while also improving compression for the team column.
This channel is gold for any young data engineer. I wish I could pay you but you're probably already swimming in enough data :D
How do you know that? Did you get a job from what you learned on this channel? Are you actually a data engineer?
@@jay_wright_thats_right yes, I'm actually a data engineer. I've been a data engineer for a many years now so no, I didn't get a job because of the channel. But I can confirm that he teaches important concepts that are very useful and sometimes not readily available for a beginner engineer.
@@jay_wright_thats_right 100% I'm going to pitch some of these practises ... this is gold
amazing video! Nice that you didn't cut the video when the unsorted and sorted size wasn't the expected!
Zach! We just started our project where we will be transferring our data to Data Lake in parquet! This is a very timely video. Awesome job, as always!
Great lesson Zach! I have always wondered what the hell a Data Lake is. Great explanations and super easy to understand!
Best data engineer in the world 🚀
great video Zach, awesome content I learnt a Lot. Can you please make a video or share some content about why we should avoid shuffling, shuffling issues and ways to fix it?
Zach, I watched this while going office, and I loved the way,learnt hell about lot of things.Thanks for it
Great and insightful lessons Zach, just high quality content! Your community of loyal DEs is growing :) Keep up!
Awesome video man! Just discovered your channel and excited to see more like this
If you want to sort from lowest cardinality to highest cardinality then to get an estimate of cardinality for each of the table colums we can look at the following values:
- The most steals in a game is 11
- There number of teams in the NBA is 30
- The most assists ever in a game is 30
- The most rebounds in a game is 55
- The most points in a game is 100
- The number of players in the NBA is 500-600
This would imply that sorting by player name is actually one of the worst options. Is better to sort by player name rather than points because compressing the strings reduces more bytes than compressing the numeric values?
Also, considering that it’s unlikely to have two players with the exact same name on different teams, wouldn’t you get even better compression by sorting first by team and then by player? This way, you’d maintain the same level of compression for the player column while also improving compression for the team column.
Wow - I learned so much from this video - Amazing! Thank you for sharing.
amazing class Zach! keep going, thxxx
Nice video Zach ... learnt something new. Thanks 👍
Excellent lesson! Thank you Zach!
I love your videos! so useful for the day to day job
Need more of these videos, beginer friendly💡
Its great Video Zach, thoroughly Enjoyed It
Thanks Zach, the practical you showed helped me learn a lot. Can you please tell if I do daily sorted inserts into my iceberg table from my OLTP system using an ETL pipeline, will Iceberg consider that instance 'exclusive' and compress store it or will it look for common columns in existing data files as well and then compress?
Thanks for the hands on lab!
Wow Amazing content Zach
Thank you so much
@zach Thanks for this informative video. I have one question. You mentioned about sorting the data on low cardinality columns and then moving towards high cardinality for better RLE which makes sense to get more compressed data. But on the read side taking an example of ICEBERG we generally try to filter data on high cardinality columns and hence use those columns in sorting the data so that we read fewer data and predicate pushdown will really help in reading very small subset of data. Now both these settings contradict each other, on one side we get smaller data but on the other side we are more concerned about using sorting on high cardinality columns.
Yep it’s an art! It all depends on what columns are the most likely to be filtered on!
This guy knows how to explain
Casually ending the gender debate 😂 good video sir! Very informative
Wow the way people push vc is creative now good video.
Hey! Super late to this video but glad I found it. One thing you didn’t touch on is the tradeoff in compute for sorting the data on write. I’m wondering if you found this to be negligible in your past experience. Thanks again for the great content!
Great as always 🎉
What SQL syntax is that? (So i can Google it to research more about what options are available to create table).
Trino which has nearly identical syntax to Postgres
Wow this is amazing!
I have a question, during the whole video you've been dealing with historical data and moving it, what about new data received, how do you deal with it ? do you insert it into some random table then update your iceberg table using some crone jobs or do you insert it directly into iceberg and how?
Collect a daily batch in Kafka then dump it to iceberg
Apart from reducing the size, does sorted_by helps in read in any other way? Are order by queries efficient with sorted_by?
You get skipping and stuff like that too when you filter on columns in the sorted_by column so it’s more efficient there too
Hello Zach, thanks for the content, after May, when is the next bootcamp?
There’s a 100% chance May is the last one that I’m teaching a majority (~75%) of the content.
September/october would be the next one. I’ll be teaching like… 30-40%
Wanna join!
The tables you are using for your sources... Are those iceberg tables which are really just files and folders in s3 under the hood, placed there before the training? I'm just confused where the raw is coming from and what it looks like.
Yep
Is it necessary to sort the dataset ? Say what if the compute and time for sorting >>>> Storage consumed? even if the storge is very large it is cheaper right ?? What is the good tipping point here ?
Depends on down stream consumption and volume.
If the data set isn’t used a ton, sorting probably isn’t worth it
This is amazing . You are a fabulous teacher . Had a question on replication. Is the replication factor not a requirement any more in modern cloud data lakes ?
Nope. Hadoop is dead fam
Another heat vid
Amazing video! + 1 follower :D
I dont understand the reason of parquet?? so you are saying 'parquet is amazing it partitions the data you don't have to select all columns'. Well why not just select (needed columns) instead of select *? This is sql 101...
It has to do with how the data is laid out on disk, and the fact that reading from disk happens in contiguous chunks. As an example, take a table that has 3 fields, A, B, and C, that are all 1 byte in size, and say you want to query only field A. Also, suppose your disk transfers happen in 4 byte chunks.
With a row oriented format, 4 rows of data are laid out like:
A B C A B C A B C A B C
To perform the query, you would need to read all 12 bytes from disk. The 3x4 byte chunks being:
A B C A
B C A B
C A B C
That took 3 transfers, and since you’re only interested in field A, you end up only using 4 of the 12 bytes (33.3% efficiency).
With a column oriented format the data is laid out like:
A A A A B B B B C C C C
This time you only need a single 4 byte transfer of the first chunk to get the data you want. That’s a third of the transfers and 100% efficiency
Often times, there will be additional benefit for having all the field A data stored contiguously in memory because your CPU also transfers contiguous chunks from RAM to the CPU cache. If you were reading from a row oriented format, you could rearrange the data to use a column oriented in-memory layout, like Arrow, but that rearrangement would be costly
Love you you beautiful engineer
Your face getting full screen, we are not interested with your face >>
If you want to sort from lowest cardinality to highest cardinality then to get an estimate of cardinality for each of the table colums we can look at the following values:
- The most steals in a game is 11
- The number of teams in the NBA is 30
- The most assists ever in a game is 30
- The most rebounds in a game is 55
- The most points in a game is 100
- The number of players in the NBA is 500-600
This would imply that sorting by player name is actually one of the worst options. Is better to sort by player name rather than points since compressing the strings reduces more bytes than compressing the numeric values?
Also, since having two players on different teams with the same name is very unlikely, wouldn’t you get even better compression by sorting first by team and then by player? This way, you’d maintain the same level of compression for the player column while also improving compression for the team column.
You’re totally right