Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Поделиться
HTML-код
  • Опубликовано: 12 янв 2025

Комментарии • 58

  • @alonzo_go
    @alonzo_go 9 месяцев назад +31

    This channel is gold for any young data engineer. I wish I could pay you but you're probably already swimming in enough data :D

    • @jay_wright_thats_right
      @jay_wright_thats_right 7 месяцев назад

      How do you know that? Did you get a job from what you learned on this channel? Are you actually a data engineer?

    • @alonzo_go
      @alonzo_go 7 месяцев назад +7

      @@jay_wright_thats_right yes, I'm actually a data engineer. I've been a data engineer for a many years now so no, I didn't get a job because of the channel. But I can confirm that he teaches important concepts that are very useful and sometimes not readily available for a beginner engineer.

    • @SheeceGardazi
      @SheeceGardazi 2 месяца назад

      @@jay_wright_thats_right 100% I'm going to pitch some of these practises ... this is gold

  • @stiffer_do
    @stiffer_do 23 часа назад +1

    amazing video! Nice that you didn't cut the video when the unsorted and sorted size wasn't the expected!

  • @nobodyinparticula100
    @nobodyinparticula100 11 месяцев назад +6

    Zach! We just started our project where we will be transferring our data to Data Lake in parquet! This is a very timely video. Awesome job, as always!

  • @justinwilkinson6300
    @justinwilkinson6300 11 месяцев назад +4

    Great lesson Zach! I have always wondered what the hell a Data Lake is. Great explanations and super easy to understand!

  • @onzurkthewarrior2822
    @onzurkthewarrior2822 7 дней назад +1

    Best data engineer in the world 🚀

  • @rohitdeshmukh197
    @rohitdeshmukh197 11 месяцев назад +6

    great video Zach, awesome content I learnt a Lot. Can you please make a video or share some content about why we should avoid shuffling, shuffling issues and ways to fix it?

  • @vivekjha9952
    @vivekjha9952 10 месяцев назад +2

    Zach, I watched this while going office, and I loved the way,learnt hell about lot of things.Thanks for it

  • @theloniusmonkey5138
    @theloniusmonkey5138 11 месяцев назад +2

    Great and insightful lessons Zach, just high quality content! Your community of loyal DEs is growing :) Keep up!

  • @andydataguy
    @andydataguy 11 месяцев назад +2

    Awesome video man! Just discovered your channel and excited to see more like this

  • @FabianJEvans
    @FabianJEvans 24 дня назад +1

    If you want to sort from lowest cardinality to highest cardinality then to get an estimate of cardinality for each of the table colums we can look at the following values:
    - The most steals in a game is 11
    - There number of teams in the NBA is 30
    - The most assists ever in a game is 30
    - The most rebounds in a game is 55
    - The most points in a game is 100
    - The number of players in the NBA is 500-600
    This would imply that sorting by player name is actually one of the worst options. Is better to sort by player name rather than points because compressing the strings reduces more bytes than compressing the numeric values?
    Also, considering that it’s unlikely to have two players with the exact same name on different teams, wouldn’t you get even better compression by sorting first by team and then by player? This way, you’d maintain the same level of compression for the player column while also improving compression for the team column.

  • @qculryq43
    @qculryq43 9 месяцев назад +1

    Wow - I learned so much from this video - Amazing! Thank you for sharing.

  • @murilloandradef
    @murilloandradef 11 месяцев назад +2

    amazing class Zach! keep going, thxxx

  • @princegaurav
    @princegaurav 2 месяца назад +1

    Nice video Zach ... learnt something new. Thanks 👍

  • @fbokovikov
    @fbokovikov 4 месяца назад +1

    Excellent lesson! Thank you Zach!

  • @rocioradu8636
    @rocioradu8636 2 месяца назад +1

    I love your videos! so useful for the day to day job

  • @muhammadzakiahmad8069
    @muhammadzakiahmad8069 11 месяцев назад +2

    Need more of these videos, beginer friendly💡

  • @vivekjha9952
    @vivekjha9952 10 месяцев назад +1

    Its great Video Zach, thoroughly Enjoyed It

  • @ManishJindalmanisism
    @ManishJindalmanisism 11 месяцев назад +2

    Thanks Zach, the practical you showed helped me learn a lot. Can you please tell if I do daily sorted inserts into my iceberg table from my OLTP system using an ETL pipeline, will Iceberg consider that instance 'exclusive' and compress store it or will it look for common columns in existing data files as well and then compress?

  • @SheeceGardazi
    @SheeceGardazi 2 месяца назад +1

    Thanks for the hands on lab!

  • @srinubathina7191
    @srinubathina7191 9 месяцев назад +1

    Wow Amazing content Zach
    Thank you so much

  • @atifiu
    @atifiu 10 месяцев назад +2

    @zach Thanks for this informative video. I have one question. You mentioned about sorting the data on low cardinality columns and then moving towards high cardinality for better RLE which makes sense to get more compressed data. But on the read side taking an example of ICEBERG we generally try to filter data on high cardinality columns and hence use those columns in sorting the data so that we read fewer data and predicate pushdown will really help in reading very small subset of data. Now both these settings contradict each other, on one side we get smaller data but on the other side we are more concerned about using sorting on high cardinality columns.

    • @EcZachly_
      @EcZachly_  10 месяцев назад

      Yep it’s an art! It all depends on what columns are the most likely to be filtered on!

  • @adolfo1981
    @adolfo1981 4 месяца назад +1

    This guy knows how to explain

  • @zwartepeat3552
    @zwartepeat3552 9 месяцев назад +2

    Casually ending the gender debate 😂 good video sir! Very informative

  • @JP-zz6ql
    @JP-zz6ql 11 месяцев назад +1

    Wow the way people push vc is creative now good video.

  • @alecryan8220
    @alecryan8220 2 месяца назад

    Hey! Super late to this video but glad I found it. One thing you didn’t touch on is the tradeoff in compute for sorting the data on write. I’m wondering if you found this to be negligible in your past experience. Thanks again for the great content!

  • @papalaplace
    @papalaplace 11 месяцев назад +1

    Great as always 🎉

  • @YEM_
    @YEM_ 11 месяцев назад +1

    What SQL syntax is that? (So i can Google it to research more about what options are available to create table).

    • @EcZachly_
      @EcZachly_  11 месяцев назад +1

      Trino which has nearly identical syntax to Postgres

  • @anthonyanalytics
    @anthonyanalytics 6 месяцев назад +1

    Wow this is amazing!

  • @LMGaming0
    @LMGaming0 7 месяцев назад +1

    I have a question, during the whole video you've been dealing with historical data and moving it, what about new data received, how do you deal with it ? do you insert it into some random table then update your iceberg table using some crone jobs or do you insert it directly into iceberg and how?

    • @EcZachly_
      @EcZachly_  7 месяцев назад +2

      Collect a daily batch in Kafka then dump it to iceberg

  • @amankapoor3563
    @amankapoor3563 11 месяцев назад +1

    Apart from reducing the size, does sorted_by helps in read in any other way? Are order by queries efficient with sorted_by?

    • @EcZachly_
      @EcZachly_  11 месяцев назад

      You get skipping and stuff like that too when you filter on columns in the sorted_by column so it’s more efficient there too

  • @pauladataanalyst
    @pauladataanalyst 10 месяцев назад +1

    Hello Zach, thanks for the content, after May, when is the next bootcamp?

    • @EcZachly_
      @EcZachly_  10 месяцев назад

      There’s a 100% chance May is the last one that I’m teaching a majority (~75%) of the content.
      September/october would be the next one. I’ll be teaching like… 30-40%

    • @MichaelCheung-z2v
      @MichaelCheung-z2v 5 месяцев назад

      Wanna join!

  • @YEM_
    @YEM_ 11 месяцев назад +1

    The tables you are using for your sources... Are those iceberg tables which are really just files and folders in s3 under the hood, placed there before the training? I'm just confused where the raw is coming from and what it looks like.

  • @sreesanjeev84
    @sreesanjeev84 11 месяцев назад +1

    Is it necessary to sort the dataset ? Say what if the compute and time for sorting >>>> Storage consumed? even if the storge is very large it is cheaper right ?? What is the good tipping point here ?

    • @EcZachly_
      @EcZachly_  11 месяцев назад

      Depends on down stream consumption and volume.
      If the data set isn’t used a ton, sorting probably isn’t worth it

  • @thoughtfulsd
    @thoughtfulsd 8 месяцев назад

    This is amazing . You are a fabulous teacher . Had a question on replication. Is the replication factor not a requirement any more in modern cloud data lakes ?

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      Nope. Hadoop is dead fam

  • @ondacharts
    @ondacharts 7 месяцев назад +1

    Another heat vid

  • @LMGaming0
    @LMGaming0 9 месяцев назад +1

    Amazing video! + 1 follower :D

  • @-es2bf
    @-es2bf 4 месяца назад +1

    I dont understand the reason of parquet?? so you are saying 'parquet is amazing it partitions the data you don't have to select all columns'. Well why not just select (needed columns) instead of select *? This is sql 101...

    • @TheBigWazowski
      @TheBigWazowski 4 месяца назад +1

      It has to do with how the data is laid out on disk, and the fact that reading from disk happens in contiguous chunks. As an example, take a table that has 3 fields, A, B, and C, that are all 1 byte in size, and say you want to query only field A. Also, suppose your disk transfers happen in 4 byte chunks.
      With a row oriented format, 4 rows of data are laid out like:
      A B C A B C A B C A B C
      To perform the query, you would need to read all 12 bytes from disk. The 3x4 byte chunks being:
      A B C A
      B C A B
      C A B C
      That took 3 transfers, and since you’re only interested in field A, you end up only using 4 of the 12 bytes (33.3% efficiency).
      With a column oriented format the data is laid out like:
      A A A A B B B B C C C C
      This time you only need a single 4 byte transfer of the first chunk to get the data you want. That’s a third of the transfers and 100% efficiency

    • @TheBigWazowski
      @TheBigWazowski 4 месяца назад +1

      Often times, there will be additional benefit for having all the field A data stored contiguously in memory because your CPU also transfers contiguous chunks from RAM to the CPU cache. If you were reading from a row oriented format, you could rearrange the data to use a column oriented in-memory layout, like Arrow, but that rearrangement would be costly

  • @eduardogaitanescalante9698
    @eduardogaitanescalante9698 4 месяца назад +1

    Love you you beautiful engineer

  •  2 месяца назад +1

    Your face getting full screen, we are not interested with your face >>

  • @FabianJEvans
    @FabianJEvans 24 дня назад +1

    If you want to sort from lowest cardinality to highest cardinality then to get an estimate of cardinality for each of the table colums we can look at the following values:
    - The most steals in a game is 11
    - The number of teams in the NBA is 30
    - The most assists ever in a game is 30
    - The most rebounds in a game is 55
    - The most points in a game is 100
    - The number of players in the NBA is 500-600
    This would imply that sorting by player name is actually one of the worst options. Is better to sort by player name rather than points since compressing the strings reduces more bytes than compressing the numeric values?
    Also, since having two players on different teams with the same name is very unlikely, wouldn’t you get even better compression by sorting first by team and then by player? This way, you’d maintain the same level of compression for the player column while also improving compression for the team column.

    • @EcZachly_
      @EcZachly_  24 дня назад

      You’re totally right