The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Поделиться
HTML-код
  • Опубликовано: 27 дек 2024

Комментарии • 59

  • @prabhumaganur
    @prabhumaganur 4 года назад +41

    The best representation of Parquet file structure!! Simply Awesome!!

    • @ikercrew8830
      @ikercrew8830 3 года назад

      You prolly dont give a shit but does any of you know of a tool to get back into an Instagram account..?
      I was stupid forgot my login password. I appreciate any help you can give me

    • @kristophergunnar9551
      @kristophergunnar9551 3 года назад

      @Iker Crew Instablaster :)

    • @ikercrew8830
      @ikercrew8830 3 года назад

      @Kristopher Gunnar Thanks for your reply. I got to the site on google and I'm trying it out now.
      Looks like it's gonna take quite some time so I will get back to you later when my account password hopefully is recovered.

    • @ikercrew8830
      @ikercrew8830 3 года назад

      @Kristopher Gunnar it worked and I actually got access to my account again. I'm so happy:D
      Thank you so much, you saved my account :D

    • @kristophergunnar9551
      @kristophergunnar9551 3 года назад

      @Iker Crew glad I could help :D

  • @robinjamwal1
    @robinjamwal1 3 года назад +10

    Great talk, Great Teach, Excellent Tutor! One of the best presentation I have ever viewed and listened.

  • @ДаниилИмани
    @ДаниилИмани 13 дней назад

    00:30 - data processing and analytics pipeline
    01:11 - outline of the talk
    01:29 - data sources and format (in terms of structuredness)
    03:09 - physical storage layout models
    04:12 - different workloads (OLTP and OLAP)
    06:27 - row-wise vs columnar storage
    10:22 - hybrid model
    11:01 - apache parquet format; data organization
    13:03 - encoding schemas
    16:51 - dictionary encoding
    18:39 - inspecting parquet files using parquet-tools utility
    19:11 - page compression
    20:35 - predicate pushdown
    24:32 - partitioning
    25:38 - tip: avoid many small files; manual compaction
    27:41 - tip: avoid few huge files
    30:25 - Delta Lake; automated repartitioning
    33:09 - conclusion
    34:43 - Q&A

  • @manishsingh455
    @manishsingh455 4 года назад +14

    This content explained most of the thing and It is really amazing .

  • @SunilBuge
    @SunilBuge 3 года назад +7

    Great overview to address performance issues with storage layer design 👍

  • @nagusameta366
    @nagusameta366 Месяц назад

    How do I calculate the optimal numPartitions in repartition or coalesce of dataframe?

  • @tadastadux
    @tadastadux 3 года назад +2

    @databricks - what is the best practice to use or not use nested columns. For Example, I have struct of customer with Age, Gender, Name, etc attributes. Is it better to keep it as struct or separate into its own columns?

  • @lhok
    @lhok Год назад

    Best Parquet File presentation I watch

  • @higiniofuentes2551
    @higiniofuentes2551 Год назад

    Seems the time and i/o needed before use the data in doing the sort first is not considered?

  • @YinghuaShen-kw5ys
    @YinghuaShen-kw5ys 7 месяцев назад

    Great, this makes me know more about Parquet. Thanks for the pre!

  • @jefftao257
    @jefftao257 2 месяца назад

    Very helpful sharing, thanks a lot.

  • @vt1454
    @vt1454 Год назад +1

    Great presentation 👏 👌

  • @mallikarjunyadav7839
    @mallikarjunyadav7839 2 года назад +1

    Awesome video with great content and explanation. Very very useful.

  • @elalemanpaisa
    @elalemanpaisa 3 месяца назад

    why shouldn't csv be right next to txt? it is literally the same

  • @Chrisratata
    @Chrisratata 2 года назад

    I haven't watched this yet but for the sake of prioritizing when I do, how well does this topic apply to platforms and systems other than Spark?

  • @BuvanAlmighty
    @BuvanAlmighty 3 года назад +1

    Best presentation in Parquet.

  • @raviiit6415
    @raviiit6415 Год назад

    great talk with simple explanations.

  • @kehaarable
    @kehaarable 3 года назад +1

    Awesome video - not too much extraneous or labored points. Thank you!

  • @flaviofukabori2149
    @flaviofukabori2149 3 года назад +1

    Amazing. All concepts really well explained.

  • @AM-iz8gk
    @AM-iz8gk 2 года назад

    Impressive presentation well structured explanations.

  • @higiniofuentes2551
    @higiniofuentes2551 Год назад

    Thank you for this very useful video!

  • @raghudesparado
    @raghudesparado 4 года назад +1

    Great Presentation. Thank you

  • @spacedustpi
    @spacedustpi 4 года назад +1

    Thanks for posting this presentation. Could you clarify something? How does performance improve when you compress pages only to decompress it again to read it? I'm sure I'm not understanding something, but not sure what.

    • @rescuemay
      @rescuemay 4 года назад +4

      He mentions around @19:30 that you only see a benefit when the I/O savings outweigh the cost of decompressing.

    • @SQLwithManoj
      @SQLwithManoj 4 года назад +1

      I/O is more expensive compared to the time taken by CPU to decompress the data, thus ColumnStore is faster compared to RowStore.

    • @rajeshgupta4466
      @rajeshgupta4466 4 года назад +1

      Snappy provides good compression with a low CPU overhead during compression/decompression. The real win in performance comes from reduced I/O cost when reading a column chunk's page. The overall cost (CPU+I/O) is generally lower for reading snappy compressed as compared to uncompressed.

    • @spacedustpi
      @spacedustpi 4 года назад +10

      @MGLondon How old are you? I am American (and not from China), and stick to common meats. This is an example of hate/harassment. Are you a high school kid?

    • @spacedustpi
      @spacedustpi 4 года назад +1

      @harsh savla Good for you. Ecoli enters the body on vegetables.

  • @salookie8000
    @salookie8000 Год назад

    interesting how Parquet (columnar analytical focused) data can be optimized using dictionary-based compression and partitioning

  • @hatemsiyala4944
    @hatemsiyala4944 2 года назад

    Great talk. Thank you!

  • @payalbhatia6927
    @payalbhatia6927 5 месяцев назад

    Superb

  • @ashokkumarsivasankaran5428
    @ashokkumarsivasankaran5428 Год назад

    Great! Well explained!

  • @ravann123
    @ravann123 3 года назад

    Very helpful, thank you 😊

  • @Pavi950
    @Pavi950 4 года назад +2

    Thanks for the content!

  • @AmitYadav-h9e
    @AmitYadav-h9e Год назад

    Just excellent 👍

  • @pavanreddy3321
    @pavanreddy3321 4 года назад

    Thanks for great explanation

  • @jeremygiaco
    @jeremygiaco 3 года назад +1

    How is storing json/xml (not parquet) more efficient than csv? You literally store the "column names" in each "row" in xml/json (at least when stored in a text file) . Also, there is definitely the notion of a "record" in csv.

    • @happywednesday6741
      @happywednesday6741 3 года назад

      Example 1. If you wanted to add a new properties to records overtime, you only need to add it to the new records (no need to back date blanks for legacy records for example). So think scale and change at scale.

    • @happywednesday6741
      @happywednesday6741 3 года назад

      Example 2. You can leverage hash/dictionary data structures in programming, these can find records at a much better scaling, look up hash functions and big o. Again think scaling related to data access, hashing vs at best search trees.

    • @happywednesday6741
      @happywednesday6741 3 года назад

      Example 3. You can more easily partition records via collections paradigm. Again storage and access at scale.

    • @happywednesday6741
      @happywednesday6741 3 года назад

      Example 4. You will more easily access and operate xml / json - like data from applications via APIs. Systems and interoperability at scale.

    • @jeremygiaco
      @jeremygiaco 3 года назад +1

      @@happywednesday6741 i asked how it was more efficient to store it. if i have 500 million "entries" in a text file, I'm definitely storing it in a delimited format or parquet to take advantage of said dictionaries and not json/xml. you can parse either into objects directly from the file, or bulk insert into a db table. the json/xlm formats would be 10x slower to parse/read in based on sheer disk/network i/o alone... if we're talking about efficiency in processing it. no one is going to load csv into memory and start trying to scan row by row for data, it's going to get converted into objects or a db anyways. my concern is when people store json formatted files to disk to be read into objects later. what does that buy you?

  • @aratithakare8016
    @aratithakare8016 3 года назад

    too good video. Excellent

  • @tasak_5542
    @tasak_5542 Год назад

    great talk

  • @dayserivera
    @dayserivera Год назад

    Great!

  • @CDALearningHub
    @CDALearningHub 3 года назад

    Thank you!

  • @maxcoteclearning
    @maxcoteclearning 2 года назад

    Thankyou :)

  • @rum81
    @rum81 4 года назад +13

    anyone who says parquet is columnar format is having just bookish knowledge

    • @immaculatesethu
      @immaculatesethu 3 года назад +1

      Its a mixture of both Horizontal and Vertical partitioning and combines best of both worlds

    • @jeremygiaco
      @jeremygiaco 3 года назад

      i like the way it compresses the data into dictionaries per file. reminds me a bit of an EAV database stored as a file

  • @thevijayraj34
    @thevijayraj34 3 года назад

    Bucketing explanation was not great. Rest was fantabulous.

  • @chriskeo392
    @chriskeo392 3 года назад

    Or whatever.... 😂

  • @lax976
    @lax976 Год назад

    Worst lecture ever