Excellent, how do you use this in practice? Check the cardinality of each column and then choose encoding before saving parquet? If schema is defined in spark for each column before saving parquet, are we doing the same thing effectively?
I'm not sure what spark does actually - I'd have to check. I still find it kinda surprising that the parquet writers don't just optimise everything for you - that would make more sense to me! I need to see how much saving on space impacts on the query side. In theory there should be a trade off between the two, but I'm not sure how big it is
Amazing videos! Thanks so for much for doing these! One question, in the video we saw that 27 bits was enough for represent the max value in our data, but I see you used 32 bits for the dictionary. Was there a reason to not use 27 if that already was able to allocate the maximum value?
We have to choose the size of one of Parquet's support types - in this case the one that fits our integers with a maximum value that fits in 27 bits is an int32 type.
Thanks for the nice video This makes sense where you have one or a few massive files, but if you've got a boatload of such files is there a way to make the computer apply rules of thumb for you (so it scales as a process rather than having a person spend five mins per file thousands of times over!)
Which bit in particular do you mean or just in general? I reckon you could probably automate everything that I did in this video to retrospectively look at a bunch of existing parquet files and see if there's a better way to store things. Definitely wouldn't recommend doing it manually!
Fantastic break down, thanks Mark
Glad you liked it!
Awesome, hard to find good content on this topic!
Amazing video Mark, your explanation and visualisation of everything was so nice!
Thanks! That's very kind of you :)
Excellent, how do you use this in practice? Check the cardinality of each column and then choose encoding before saving parquet? If schema is defined in spark for each column before saving parquet, are we doing the same thing effectively?
I'm not sure what spark does actually - I'd have to check. I still find it kinda surprising that the parquet writers don't just optimise everything for you - that would make more sense to me!
I need to see how much saving on space impacts on the query side. In theory there should be a trade off between the two, but I'm not sure how big it is
Amazing videos! Thanks so for much for doing these!
One question, in the video we saw that 27 bits was enough for represent the max value in our data, but I see you used 32 bits for the dictionary. Was there a reason to not use 27 if that already was able to allocate the maximum value?
We have to choose the size of one of Parquet's support types - in this case the one that fits our integers with a maximum value that fits in 27 bits is an int32 type.
Thanks for the nice video
This makes sense where you have one or a few massive files, but if you've got a boatload of such files is there a way to make the computer apply rules of thumb for you (so it scales as a process rather than having a person spend five mins per file thousands of times over!)
Which bit in particular do you mean or just in general? I reckon you could probably automate everything that I did in this video to retrospectively look at a bunch of existing parquet files and see if there's a better way to store things.
Definitely wouldn't recommend doing it manually!