How are integers encoded in Apache Parquet?

Поделиться
HTML-код
  • Опубликовано: 17 окт 2024

Комментарии • 11

  • @danieleden1856
    @danieleden1856 Год назад +2

    Fantastic break down, thanks Mark

  • @andreasnordbass
    @andreasnordbass 9 месяцев назад +1

    Awesome, hard to find good content on this topic!

  • @dominikseljan3043
    @dominikseljan3043 9 месяцев назад

    Amazing video Mark, your explanation and visualisation of everything was so nice!

  • @pawarbi4675
    @pawarbi4675 Год назад

    Excellent, how do you use this in practice? Check the cardinality of each column and then choose encoding before saving parquet? If schema is defined in spark for each column before saving parquet, are we doing the same thing effectively?

    • @learndatawithmark
      @learndatawithmark  Год назад

      I'm not sure what spark does actually - I'd have to check. I still find it kinda surprising that the parquet writers don't just optimise everything for you - that would make more sense to me!
      I need to see how much saving on space impacts on the query side. In theory there should be a trade off between the two, but I'm not sure how big it is

  • @EugenioG2021
    @EugenioG2021 Год назад

    Amazing videos! Thanks so for much for doing these!
    One question, in the video we saw that 27 bits was enough for represent the max value in our data, but I see you used 32 bits for the dictionary. Was there a reason to not use 27 if that already was able to allocate the maximum value?

    • @learndatawithmark
      @learndatawithmark  Год назад +1

      We have to choose the size of one of Parquet's support types - in this case the one that fits our integers with a maximum value that fits in 27 bits is an int32 type.

  • @nmstoker
    @nmstoker Год назад

    Thanks for the nice video
    This makes sense where you have one or a few massive files, but if you've got a boatload of such files is there a way to make the computer apply rules of thumb for you (so it scales as a process rather than having a person spend five mins per file thousands of times over!)

    • @learndatawithmark
      @learndatawithmark  Год назад

      Which bit in particular do you mean or just in general? I reckon you could probably automate everything that I did in this video to retrospectively look at a bunch of existing parquet files and see if there's a better way to store things.
      Definitely wouldn't recommend doing it manually!