Data Lake Modeling: 100 TBs into 5 TBs at Airbnb with Parquet + Run Length Encoding - DataExpert.io

Поделиться
HTML-код
  • Опубликовано: 24 ноя 2024

Комментарии •

  • @GugiMandini
    @GugiMandini 5 месяцев назад +32

    I don't find a lot of people I see having data engineering discussions at this level, I think this is really good.

  • @okinedobenedict2453
    @okinedobenedict2453 2 месяца назад +4

    Zach, your style is so intriguing, like seriously I am setting up to be a data analyst, I actually planned to just have a glimpse of this video, and oh boy!!! You tempting me to be a data engineer now 🥵.

  • @brendankerr148
    @brendankerr148 2 месяца назад +2

    9:50-10:15 is so relatable 😢. Something you learn after a few years for sure…

  • @visioncraftstudio
    @visioncraftstudio 6 месяцев назад +16

    I Recently Checked out your old lecture video that is posted on your channel and when i compare it to this one this is a very big jump in terms of explanation , Concept , Editing etc It feels very premium Good to see you growing

  • @GugiMandini
    @GugiMandini 5 месяцев назад +4

    "Who is the user" is a great point, the higher up the company hierarchy the simpler it has to be. We deliver power over information to our company, they have to feel in control.

  • @Milhouse77BS
    @Milhouse77BS 5 месяцев назад +4

    You had me at “dimensional data modeling”. And as an engineer I’m family with the other type of “dimensional modeling”.

  • @bfors8498
    @bfors8498 6 месяцев назад +29

    I've observed that snacking while in a meeting or presentation is a total power move, our VP of product used to snack while arguing with our CTO and it was a devastating technique.

    • @tiotito31
      @tiotito31 5 месяцев назад +2

      It drives me crazy lol. Just hearing how it affects your speech *shudders. Definitely power move, also borderline disrespectful.

    • @mohitranka9840
      @mohitranka9840 5 месяцев назад +4

      Its a dick move. Give the respect to the topic and the audience by engaging completely on the topic at hand.

  • @dutsi170
    @dutsi170 6 месяцев назад +90

    The content is fantastic, but you eating while explaining, is a little bit annoying and hard to stay focused.. but as always, brilliant

    • @StartDataLate
      @StartDataLate 2 месяца назад +1

      Honestly that doesn’t bother me at all. People needs food to stay passionate

  • @nietzsche.official
    @nietzsche.official 4 месяца назад +4

    Came here to compliment Zach's jawline. Decent tutorial too, I guess

  • @MohamedFazan-w2y
    @MohamedFazan-w2y 2 месяца назад +1

    Most of the concepts explained in industrial standards ❤

  • @LMGaming0
    @LMGaming0 6 месяцев назад +2

    9:44 I feel you, I'm literally living the same exact feeling in my current job :( sad but true

  • @dualfluidreactor
    @dualfluidreactor 3 месяца назад +1

    loved it! very practical and insightful and high value! I guess I should do the course

  • @sungwonchung
    @sungwonchung 6 месяцев назад +2

    This is really really good. I just sent this video to someone I just mentored. Keep sharing more stuff like this!

  • @DeepakUday-sf3wu
    @DeepakUday-sf3wu 6 месяцев назад +4

    Hey Zach,really awesome .Please keep up the good work.Really a data fan of yours.

  • @fuscone
    @fuscone 6 месяцев назад +17

    The Zuck email was the ultimate flex

  • @tomczubat
    @tomczubat 6 месяцев назад +1

    I am new to D.E. but I thought the section about cumulative table design was really cool.

  • @amandeepsingh7648
    @amandeepsingh7648 5 месяцев назад +1

    Thanks Zach 😊, your content is pure gold

  • @Steven-M-89
    @Steven-M-89 6 месяцев назад +3

    Hey Zach, love your content, mate! Particularly, the intricate yet simple details you just don't think of such as 30day arrays and 1am London time.
    I really hope you build a self-paced course one day that focuses on your prerequisites:
    -Proficiency in Python and SQL, at least 6 months of experience in both
    -Basic understanding of Docker, Flink, and Kafka
    -Basic understanding of SQL Window Functions
    Whilst you can find basic stuff online, like CS50, I can see the benefits of investing in someone that teaches you the right way from the start, not cutting corners or a vast range of irrelevant content. Applying it to your data projects as you learn would be a bonus.

    • @EcZachly_
      @EcZachly_  6 месяцев назад +8

      Just feels like that area is extremely saturated and not the best use of my time. That’s why I’ve almost done it a few times but backed out.
      Every time I try to make basic content, my heart just isn’t in it

  • @sitrakaforler8696
    @sitrakaforler8696 Месяц назад +1

    Bro is a hero !

  • @rembautimes8808
    @rembautimes8808 2 месяца назад +1

    Great talk a lot of practical discussions

  • @shifraisaacs2711
    @shifraisaacs2711 5 месяцев назад +1

    Loved this, thank you Zach!

  • @runix953
    @runix953 6 месяцев назад +2

    holy legendary video, exactly what i was looking for

  • @gizmoffm
    @gizmoffm 6 месяцев назад +4

    great content - thanks zach for sharing!

  • @gizmoffm
    @gizmoffm 6 месяцев назад +3

    I plan to do a on-premise data warehouse with a fast OLAP, like "bigquery fast" but I struggle to decide for one tool stack to go. I thought about mainly using Clickhouse but not too sure. Do you have a suggestion? Also maybe a good video in general "Choosing the right tools" or something

  • @tomczubat
    @tomczubat 6 месяцев назад +1

    This is great. I am really enjoying these videos.

  • @hugoelizabeth4548
    @hugoelizabeth4548 6 месяцев назад +3

    Great video Zach!

  • @ansonnn_
    @ansonnn_ Месяц назад +1

    Thanks for the epic content Zach! But any idea how to handle late arriving data? The arrays would need to be updated accordingly right?

  • @vachadave1706
    @vachadave1706 6 месяцев назад +1

    Thank you Zach!

  • @databasemadness
    @databasemadness 6 месяцев назад +1

    This was pure gold!

  • @spikeydude114
    @spikeydude114 5 месяцев назад +1

    Great video!

  • @amankapoor3563
    @amankapoor3563 6 месяцев назад +3

    Thanks for this! Just as you showed the sample schema of the 2B dataset, can you also share the sample of the equivalent array schema?

    • @EcZachly_
      @EcZachly_  6 месяцев назад +1

      Sure
      ```
      CREATE TABLE listing_availability (
      listing_id BIGINT,
      availability_next_365d ARRAY
      )
      PARITITION BY (ds STRING)
      ```

  • @Nalaka-Wanniarachchi
    @Nalaka-Wanniarachchi 6 месяцев назад +2

    Superb content. Keep up.

  • @SuperLano98
    @SuperLano98 6 месяцев назад +1

    Hello from Brazil Zach!
    Amazing content, really!
    I have a question about cumulative table design. I'm searching for some references about it, like books, academic papers, or anything. Can you recommend some, please?

  • @Rut-n1y
    @Rut-n1y 3 месяца назад +1

    Back story, Zach was fired from every role

  • @FlightOverRio
    @FlightOverRio 4 месяца назад

    Have you ever worked with sap hana? as a db how efficient is it compared to others?

  • @ByteNinja-YT
    @ByteNinja-YT 6 месяцев назад +2

    I really enjoy your videos and learn a lot from them!
    I do have a small question. Do you think it still is important to have the data types as small as possible even if you are using delta lake with pyspark for example? I wonder, because all the underlying data is in parquet and optimized.

    • @EcZachly_
      @EcZachly_  6 месяцев назад +1

      I talked about Parquet quite a bit in this video. You should aim to make your data as small as possible. Just because something is parquet doesn’t mean you should just do whatever. It’s a file format and delta lake doesn’t optimize like you think it does.
      Volume and tradeoffs needs to be considered here for sure

    • @ByteNinja-YT
      @ByteNinja-YT 6 месяцев назад

      Thanks a lot for your answer. I recently started as a data engineer and told my team that we could reduce a lot of the data types sizes. However, they just said it is not important and that parquet fixes the issues. I will do some analysis on our data in terms of table size and query speed and show the results to the team.

  • @MacGlourson
    @MacGlourson 6 месяцев назад +2

    Hi Zach great lecture!
    Regarding the compression problem of parquet when you shuffle.
    Does that mean some pipeline do an order by at the end to increase parquet compression ?
    If I'm not mistaken ordering in a distributed environment can be really expensive. If you have to shuffle is there any other solution than that?

    • @EcZachly_
      @EcZachly_  6 месяцев назад +2

      Sort vs sortWithinPartitions are two very different methods in Spark. The latter is what I’m referring to. Global sorting sucks bad you’re right!

    • @MacGlourson
      @MacGlourson 6 месяцев назад

      @@EcZachly_ I guess i learned something today Thanks ! I will keep it mind next i write some parquet files with Spark :)

  • @rabago85
    @rabago85 6 месяцев назад +1

    Interesting. If I were to sign up for the self-paced version in July, would I have full access to everything over the span of two months to complete the bootcamp?

    • @EcZachly_
      @EcZachly_  6 месяцев назад

      You have access to all paid data engineering content I create for a year

  • @rajdeepjinegar84
    @rajdeepjinegar84 5 месяцев назад

    I’m sold. Kinda wanna transition in to the data engineering from analytics, do you think I’d make a good candidate for your course.

    • @EcZachly_
      @EcZachly_  5 месяцев назад

      Potentially! Do you know SQL and python to some extent?

    • @rajdeepjinegar84
      @rajdeepjinegar84 5 месяцев назад

      @@EcZachly_ yes sir. Preparing for “aws data engineering associate” certificate as of now

    • @EcZachly_
      @EcZachly_  5 месяцев назад

      @@rajdeepjinegar84 Cool my course would deepen a lot of your expertise I bet!

  • @dima13693
    @dima13693 4 месяца назад +1

    I've never heard of 'shuffle'. So the points you make based on that feel vague to me.

    • @dima13693
      @dima13693 3 месяца назад

      nice! his script that likes every comment just popped off.

  • @sairevanthgudivada886
    @sairevanthgudivada886 6 месяцев назад +2

    hi zach thanks for sharing.i have a query i am currently working as a data engineer with 3years of experience. my current company is asking to do data science tasks as well like opencv,genai etc .but in big tech companies there will be particular teams assigned for data science and data engineering right so will doing both be helpful in my career?

    • @EcZachly_
      @EcZachly_  6 месяцев назад +8

      Life is long. Learning technical skills almost always improves your career.
      If you're a data science-minded data engineer, you're better.
      If you're a data science who can write pipelines, you're amazing.
      Being able to unblock yourself is also important in big tech!

    • @sairevanthgudivada886
      @sairevanthgudivada886 6 месяцев назад +1

      @@EcZachly_thanks for the advice❤

  • @thesea3018
    @thesea3018 24 дня назад

    Hello sir
    I need help please
    I wanted a website to calculate the salary after taxes
    Can you give me a website to calculate the tax value, because I checked several sites. Each site gives me a different result than the other

  • @matthiaswarlop2316
    @matthiaswarlop2316 5 месяцев назад

    Is the next video still comming?

    • @EcZachly_
      @EcZachly_  5 месяцев назад +1

      This didn’t convert as well as I thought it would so probably not

  • @imafatass9123
    @imafatass9123 5 месяцев назад +1

    Let’s go! You kick ass!

  • @narasimhanmb4703
    @narasimhanmb4703 6 месяцев назад

    Hello Zach, I couldn't understand one thing about cumulative table design.
    You said that, FB used this table design to keep only 30 days of data at a time. Does it mean that the cumulative table process started first day of every month and carried over to end of the month only so that the "last day of the month" partition has all required data in array?
    If so, why would the data retention of anything >30 days be a problem especially for deleted users? Come 1st of next month, we'll not be considering deleted user of previous month at all, right? Or am I missing something here?
    I have been scratching my head on this. Couldn't get over it!

    • @EcZachly_
      @EcZachly_  6 месяцев назад

      It does a rolling 30 days

    • @narasimhanmb4703
      @narasimhanmb4703 6 месяцев назад

      @@EcZachly_ ahh! Thanks. So "yesterday" would be 29 days of aggregated data at any point in time.
      Even then, data retention of deleted users would not be more than 30 days, right?
      Almost forgot to say: you've been a great inspiration and I am your one of many silent followers on linkedin 😀

  • @imbw80
    @imbw80 6 месяцев назад +4

    great video, but your eating while explaining is distracting

  • @DataMan90210
    @DataMan90210 6 месяцев назад +1

    Thanks!
    Live or Self-paced bootcamp - where can I get info to see which is best for me?

    • @EcZachly_
      @EcZachly_  6 месяцев назад +1

      DataExpert.io has details. If you’re down to watch live classes at 6-8 PM pacific, live is probably best. Otherwise self-paced

  • @mx953
    @mx953 6 месяцев назад +3

    the apple stole the show... but great content. Thank you @_@

  • @simonegalli5453
    @simonegalli5453 5 месяцев назад +2

    So you re telling me that even in those conpany you ll find ppl who dont wanna learn ? Ppl who should have the knowledge to just keep gathering new abilities? Like they are not my 45 Co worker who never liked cs stuff and struggle with sql, they are data analyst and cant reach to do an unnest?!

  • @kareemyoussef2304
    @kareemyoussef2304 2 месяца назад +1

    Idk why i read the title as "100 TB data leak into 5TBs

  • @vook777
    @vook777 6 месяцев назад +6

    Seriously? The 50 minute lecture is when you had to eat?

    • @EcZachly_
      @EcZachly_  6 месяцев назад +3

      The carbs kept me going

  • @allixender
    @allixender 2 месяца назад +2

    Why do you eat an apple while recording a screencast, it is fairly distracting

    • @EcZachly_
      @EcZachly_  2 месяца назад

      Carbs help me continue to blab on

    • @Vijayhub2
      @Vijayhub2 2 месяца назад

      it shouldnt for you to be honest.

  • @UjjwalSingh-x1v
    @UjjwalSingh-x1v 6 месяцев назад +2

    how to get an internship in data engineer role

  • @malikmoinawan6942
    @malikmoinawan6942 6 месяцев назад +1

    well, It's better to make organised content so that someone can stick to it.

  • @MachineLearning-f6e
    @MachineLearning-f6e 2 месяца назад

    He looks like footballer Chellini

  • @aso0om
    @aso0om 3 месяца назад +1

    test comment