Data modeling a 100 TB data lake into 5 TBs with STRUCT and Array - DataExpert.io Bootcamp preview

Поделиться
HTML-код
  • Опубликовано: 1 май 2024
  • This is the first lecture of my 40+ hour boot camp materials. This is connected to this lesson here: dataexpert.io/lesson/dimensio... where you can get the slides.
    At Airbnb, I worked a lot with dimensional data, Parquet and compression techniques. I was able to get the pricing and availability data to be much much smaller by leveraging the right processes!
    We are still accepting applications for May 6th boot camp for a few more days! You can get a discount here: dataexpert.io/zach15
    Message support@eczachly.com for any other questions y'all might have!
    Don't forget to subscribe to my free newsletter: blog.dataengineer.io
  • НаукаНаука

Комментарии • 67

  • @GugiMandini
    @GugiMandini 26 дней назад +10

    I don't find a lot of people I see having data engineering discussions at this level, I think this is really good.

  • @nietzsche.official
    @nietzsche.official 8 дней назад +2

    Came here to compliment Zach's jawline. Decent tutorial too, I guess

  • @runix953
    @runix953 2 месяца назад +2

    holy legendary video, exactly what i was looking for

  • @GugiMandini
    @GugiMandini 26 дней назад +1

    "Who is the user" is a great point, the higher up the company hierarchy the simpler it has to be. We deliver power over information to our company, they have to feel in control.

  • @DeepakUday-sf3wu
    @DeepakUday-sf3wu 2 месяца назад +3

    Hey Zach,really awesome .Please keep up the good work.Really a data fan of yours.

  • @mr.daniish
    @mr.daniish 2 месяца назад +1

    This was pure gold!

  • @hugoelizabeth4548
    @hugoelizabeth4548 2 месяца назад +3

    Great video Zach!

  • @amandeepsingh7648
    @amandeepsingh7648 Месяц назад +1

    Thanks Zach 😊, your content is pure gold

  • @sungwonchung
    @sungwonchung 2 месяца назад +2

    This is really really good. I just sent this video to someone I just mentored. Keep sharing more stuff like this!

  • @shifraisaacs2711
    @shifraisaacs2711 22 дня назад +1

    Loved this, thank you Zach!

  • @tomczubat
    @tomczubat 2 месяца назад +1

    This is great. I am really enjoying these videos.

  • @gizmoffm
    @gizmoffm 2 месяца назад +4

    great content - thanks zach for sharing!

  • @vachadave1706
    @vachadave1706 2 месяца назад +1

    Thank you Zach!

  • @Milhouse77BS
    @Milhouse77BS Месяц назад +2

    You had me at “dimensional data modeling”. And as an engineer I’m family with the other type of “dimensional modeling”.

  • @visioncraftstudio
    @visioncraftstudio 2 месяца назад +9

    I Recently Checked out your old lecture video that is posted on your channel and when i compare it to this one this is a very big jump in terms of explanation , Concept , Editing etc It feels very premium Good to see you growing

  • @Nalaka-Wanniarachchi
    @Nalaka-Wanniarachchi 2 месяца назад +2

    Superb content. Keep up.

  • @spikeydude114
    @spikeydude114 17 дней назад +1

    Great video!

  • @dutsi170
    @dutsi170 2 месяца назад +63

    The content is fantastic, but you eating while explaining, is a little bit annoying and hard to stay focused.. but as always, brilliant

  • @fuscone
    @fuscone 2 месяца назад +16

    The Zuck email was the ultimate flex

  • @LMGaming0
    @LMGaming0 2 месяца назад +2

    9:44 I feel you, I'm literally living the same exact feeling in my current job :( sad but true

  • @bfors8498
    @bfors8498 2 месяца назад +21

    I've observed that snacking while in a meeting or presentation is a total power move, our VP of product used to snack while arguing with our CTO and it was a devastating technique.

    • @tiotito31
      @tiotito31 28 дней назад

      It drives me crazy lol. Just hearing how it affects your speech *shudders. Definitely power move, also borderline disrespectful.

    • @mohitranka9840
      @mohitranka9840 22 дня назад

      Its a dick move. Give the respect to the topic and the audience by engaging completely on the topic at hand.

  • @gizmoffm
    @gizmoffm 2 месяца назад +2

    I plan to do a on-premise data warehouse with a fast OLAP, like "bigquery fast" but I struggle to decide for one tool stack to go. I thought about mainly using Clickhouse but not too sure. Do you have a suggestion? Also maybe a good video in general "Choosing the right tools" or something

  • @SuperLano98
    @SuperLano98 Месяц назад +1

    Hello from Brazil Zach!
    Amazing content, really!
    I have a question about cumulative table design. I'm searching for some references about it, like books, academic papers, or anything. Can you recommend some, please?

  • @tomczubat
    @tomczubat 2 месяца назад +1

    I am new to D.E. but I thought the section about cumulative table design was really cool.

  • @amankapoor3563
    @amankapoor3563 2 месяца назад +3

    Thanks for this! Just as you showed the sample schema of the 2B dataset, can you also share the sample of the equivalent array schema?

    • @EcZachly_
      @EcZachly_  2 месяца назад +1

      Sure
      ```
      CREATE TABLE listing_availability (
      listing_id BIGINT,
      availability_next_365d ARRAY
      )
      PARITITION BY (ds STRING)
      ```

  • @Steven-M-89
    @Steven-M-89 2 месяца назад +3

    Hey Zach, love your content, mate! Particularly, the intricate yet simple details you just don't think of such as 30day arrays and 1am London time.
    I really hope you build a self-paced course one day that focuses on your prerequisites:
    -Proficiency in Python and SQL, at least 6 months of experience in both
    -Basic understanding of Docker, Flink, and Kafka
    -Basic understanding of SQL Window Functions
    Whilst you can find basic stuff online, like CS50, I can see the benefits of investing in someone that teaches you the right way from the start, not cutting corners or a vast range of irrelevant content. Applying it to your data projects as you learn would be a bonus.

    • @EcZachly_
      @EcZachly_  2 месяца назад +8

      Just feels like that area is extremely saturated and not the best use of my time. That’s why I’ve almost done it a few times but backed out.
      Every time I try to make basic content, my heart just isn’t in it

  • @MacGlourson
    @MacGlourson 2 месяца назад +1

    Hi Zach great lecture!
    Regarding the compression problem of parquet when you shuffle.
    Does that mean some pipeline do an order by at the end to increase parquet compression ?
    If I'm not mistaken ordering in a distributed environment can be really expensive. If you have to shuffle is there any other solution than that?

    • @EcZachly_
      @EcZachly_  2 месяца назад +1

      Sort vs sortWithinPartitions are two very different methods in Spark. The latter is what I’m referring to. Global sorting sucks bad you’re right!

    • @MacGlourson
      @MacGlourson 2 месяца назад

      @@EcZachly_ I guess i learned something today Thanks ! I will keep it mind next i write some parquet files with Spark :)

  • @Faire-rs7ph
    @Faire-rs7ph 2 месяца назад +2

    Hey zach, its a great video. Can you Please make a road map of data engineering for absolute beginners and further levels.

    • @EcZachly_
      @EcZachly_  2 месяца назад

      Blog.dataengineer.io has the roadmap

  • @CreativePuppyYT
    @CreativePuppyYT 2 месяца назад +1

    I really enjoy your videos and learn a lot from them!
    I do have a small question. Do you think it still is important to have the data types as small as possible even if you are using delta lake with pyspark for example? I wonder, because all the underlying data is in parquet and optimized.

    • @EcZachly_
      @EcZachly_  2 месяца назад +1

      I talked about Parquet quite a bit in this video. You should aim to make your data as small as possible. Just because something is parquet doesn’t mean you should just do whatever. It’s a file format and delta lake doesn’t optimize like you think it does.
      Volume and tradeoffs needs to be considered here for sure

    • @CreativePuppyYT
      @CreativePuppyYT 2 месяца назад

      Thanks a lot for your answer. I recently started as a data engineer and told my team that we could reduce a lot of the data types sizes. However, they just said it is not important and that parquet fixes the issues. I will do some analysis on our data in terms of table size and query speed and show the results to the team.

  • @rabago85
    @rabago85 2 месяца назад +1

    Interesting. If I were to sign up for the self-paced version in July, would I have full access to everything over the span of two months to complete the bootcamp?

    • @EcZachly_
      @EcZachly_  2 месяца назад

      You have access to all paid data engineering content I create for a year

  • @sairevanthgudivada886
    @sairevanthgudivada886 2 месяца назад +2

    hi zach thanks for sharing.i have a query i am currently working as a data engineer with 3years of experience. my current company is asking to do data science tasks as well like opencv,genai etc .but in big tech companies there will be particular teams assigned for data science and data engineering right so will doing both be helpful in my career?

    • @EcZachly_
      @EcZachly_  2 месяца назад +8

      Life is long. Learning technical skills almost always improves your career.
      If you're a data science-minded data engineer, you're better.
      If you're a data science who can write pipelines, you're amazing.
      Being able to unblock yourself is also important in big tech!

    • @sairevanthgudivada886
      @sairevanthgudivada886 2 месяца назад +1

      @@EcZachly_thanks for the advice❤

  • @DataMan90210
    @DataMan90210 2 месяца назад +1

    Thanks!
    Live or Self-paced bootcamp - where can I get info to see which is best for me?

    • @EcZachly_
      @EcZachly_  2 месяца назад +1

      DataExpert.io has details. If you’re down to watch live classes at 6-8 PM pacific, live is probably best. Otherwise self-paced

  • @rajdeepjinegar84
    @rajdeepjinegar84 22 дня назад

    I’m sold. Kinda wanna transition in to the data engineering from analytics, do you think I’d make a good candidate for your course.

    • @EcZachly_
      @EcZachly_  22 дня назад

      Potentially! Do you know SQL and python to some extent?

    • @rajdeepjinegar84
      @rajdeepjinegar84 22 дня назад

      @@EcZachly_ yes sir. Preparing for “aws data engineering associate” certificate as of now

    • @EcZachly_
      @EcZachly_  22 дня назад

      @@rajdeepjinegar84 Cool my course would deepen a lot of your expertise I bet!

  • @narasimhanmb4703
    @narasimhanmb4703 2 месяца назад

    Hello Zach, I couldn't understand one thing about cumulative table design.
    You said that, FB used this table design to keep only 30 days of data at a time. Does it mean that the cumulative table process started first day of every month and carried over to end of the month only so that the "last day of the month" partition has all required data in array?
    If so, why would the data retention of anything >30 days be a problem especially for deleted users? Come 1st of next month, we'll not be considering deleted user of previous month at all, right? Or am I missing something here?
    I have been scratching my head on this. Couldn't get over it!

    • @EcZachly_
      @EcZachly_  2 месяца назад

      It does a rolling 30 days

    • @narasimhanmb4703
      @narasimhanmb4703 2 месяца назад

      @@EcZachly_ ahh! Thanks. So "yesterday" would be 29 days of aggregated data at any point in time.
      Even then, data retention of deleted users would not be more than 30 days, right?
      Almost forgot to say: you've been a great inspiration and I am your one of many silent followers on linkedin 😀

  • @matthiaswarlop2316
    @matthiaswarlop2316 Месяц назад

    Is the next video still comming?

    • @EcZachly_
      @EcZachly_  Месяц назад

      This didn’t convert as well as I thought it would so probably not

  • @imafatass9123
    @imafatass9123 Месяц назад +1

    Let’s go! You kick ass!

  • @iyadahmed3773
    @iyadahmed3773 2 месяца назад +1

    I didn't understand what backfilling is

    • @Dmytro-kt3fr
      @Dmytro-kt3fr 2 месяца назад +1

      sometimes the records are initialized in storage with partial state, they are composed from datasources that are async by nature or are an extensions of business processes like ammend/rollback of some record that existed and affected existing metric
      In simple words:
      you had report on ship traveling through ocean. Each day you get a report on some of its sensors. Based on that data you build a metric.
      But, one day sensor lags, data is wrong, metrics wrong. Captain sends an email with correct data, because they have analogue doubles. You have to update the info.
      So you backfill or create a new version and employ a data retention period strategy to manage old versions

    • @iyadahmed3773
      @iyadahmed3773 2 месяца назад +1

      @@Dmytro-kt3fr
      Thanks!

  • @user-wk2xy2vo6w
    @user-wk2xy2vo6w 2 месяца назад +2

    how to get an internship in data engineer role

  • @simonegalli5453
    @simonegalli5453 22 дня назад +2

    So you re telling me that even in those conpany you ll find ppl who dont wanna learn ? Ppl who should have the knowledge to just keep gathering new abilities? Like they are not my 45 Co worker who never liked cs stuff and struggle with sql, they are data analyst and cant reach to do an unnest?!

  • @imbw80
    @imbw80 2 месяца назад +3

    great video, but your eating while explaining is distracting

  • @vook777
    @vook777 Месяц назад +2

    Seriously? The 50 minute lecture is when you had to eat?

    • @EcZachly_
      @EcZachly_  Месяц назад +2

      The carbs kept me going

  • @malikmoinawan6942
    @malikmoinawan6942 2 месяца назад +1

    well, It's better to make organised content so that someone can stick to it.

  • @mx953
    @mx953 2 месяца назад +2

    the apple stole the show... but great content. Thank you @_@