Using DuckDB to diff Apache Parquet schemas

All Rust string types explained

How are strings encoded in Apache Parquet?

14 Hidden Details in Dragon Ball Sparking Zero! (Budokai Tenkaichi 4)

BRASIL vs. PERÚ [4-0] | RESUMEN | ELIMINATORIAS SUDAMERICANAS | FECHA 10

Best New Halloween Candy

How are integers encoded in Apache Parquet?

Learn Data with Mark

Просмотров 690

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 17 окт 2024

Комментарии • 11

@danieleden1856 Год назад ⁺²
Fantastic break down, thanks Mark
@learndatawithmark Год назад ⁺¹
Glad you liked it!
@andreasnordbass 9 месяцев назад ⁺¹
Awesome, hard to find good content on this topic!
@dominikseljan3043 9 месяцев назад
Amazing video Mark, your explanation and visualisation of everything was so nice!
@learndatawithmark 8 месяцев назад
Thanks! That's very kind of you :)
@pawarbi4675 Год назад
Excellent, how do you use this in practice? Check the cardinality of each column and then choose encoding before saving parquet? If schema is defined in spark for each column before saving parquet, are we doing the same thing effectively?
@learndatawithmark Год назад
I'm not sure what spark does actually - I'd have to check. I still find it kinda surprising that the parquet writers don't just optimise everything for you - that would make more sense to me!
I need to see how much saving on space impacts on the query side. In theory there should be a trade off between the two, but I'm not sure how big it is
@EugenioG2021 Год назад
Amazing videos! Thanks so for much for doing these!
One question, in the video we saw that 27 bits was enough for represent the max value in our data, but I see you used 32 bits for the dictionary. Was there a reason to not use 27 if that already was able to allocate the maximum value?
@learndatawithmark Год назад ⁺¹
We have to choose the size of one of Parquet's support types - in this case the one that fits our integers with a maximum value that fits in 27 bits is an int32 type.
@nmstoker Год назад
Thanks for the nice video
This makes sense where you have one or a few massive files, but if you've got a boatload of such files is there a way to make the computer apply rules of thumb for you (so it scales as a process rather than having a person spend five mins per file thousands of times over!)
@learndatawithmark Год назад
Which bit in particular do you mean or just in general? I reckon you could probably automate everything that I did in this video to retrospectively look at a bunch of existing parquet files and see if there's a better way to store things.
Definitely wouldn't recommend doing it manually!

Следующие

Автовоспроизведение

Using DuckDB to diff Apache Parquet schemas

Using DuckDB to diff Apache Parquet schemas

All Rust string types explained

All Rust string types explained

How are strings encoded in Apache Parquet?

How are strings encoded in Apache Parquet?

14 Hidden Details in Dragon Ball Sparking Zero! (Budokai Tenkaichi 4)

14 Hidden Details in Dragon Ball Sparking Zero! (Budokai Tenkaichi 4)

BRASIL vs. PERÚ [4-0] | RESUMEN | ELIMINATORIAS SUDAMERICANAS | FECHA 10

BRASIL vs. PERÚ [4-0] | RESUMEN | ELIMINATORIAS SUDAMERICANAS | FECHA 10

Best New Halloween Candy

Best New Halloween Candy

What if Earth grew 1cm every second?

What if Earth grew 1cm every second?

Parquet File Format - Explained to a 5 Year Old!

Parquet File Format - Explained to a 5 Year Old!

The columnar roadmap: Apache Parquet and Apache Arrow

The columnar roadmap: Apache Parquet and Apache Arrow

Computer Vision Meetup: Exploring Multimodal Models: Llava-Next and TextQA Dataset

Computer Vision Meetup: Exploring Multimodal Models: Llava-Next and TextQA Dataset

What are Digital Signatures? - Computerphile

What are Digital Signatures? - Computerphile

What Are Matryoshka Embeddings?

What Are Matryoshka Embeddings?

Google SWE teaches systems design | EP44: Apache Parquet

Google SWE teaches systems design | EP44: Apache Parquet

The cloud is over-engineered and overpriced (no music)

The cloud is over-engineered and overpriced (no music)

Dear Game Developers, Stop Messing This Up!

Dear Game Developers, Stop Messing This Up!

This Is Why Python Data Classes Are Awesome

This Is Why Python Data Classes Are Awesome

С 1956 года В ОДНОЙ СЕМЬЕ! СЛОВО ПАЦАНА, что мы её ВОССТАНОВИМ!

С 1956 года В ОДНОЙ СЕМЬЕ! СЛОВО ПАЦАНА, что мы её ВОССТАНОВИМ!

Джамшут и Равшан по Американски @Whatthefshow

Джамшут и Равшан по Американски @Whatthefshow

Amazing Digital Circus Painting Color Match Puzzle Game 🎯

Amazing Digital Circus Painting Color Match Puzzle Game 🎯

ЗА МНОЙ ОХОТИТСЯ СОНИК ПОЖИРАТЕЛЬ В ГТА 5 ! - 24 ЧАСА СПАСАЮСЬ В GTA 5

ЗА МНОЙ ОХОТИТСЯ СОНИК ПОЖИРАТЕЛЬ В ГТА 5 ! - 24 ЧАСА СПАСАЮСЬ В GTA 5

Слава ЗАСМУЩАЛА Султана Лагучева🤭Ещё бы устоять перед такой энергией | Битва поколений

Слава ЗАСМУЩАЛА Султана Лагучева🤭Ещё бы устоять перед такой энергией | Битва поколений

МАМА В 16 | 2 СЕЗОН, 7 ВЫПУСК | АННА, АМУРСК

МАМА В 16 | 2 СЕЗОН, 7 ВЫПУСК | АННА, АМУРСК

Вот так ВСТРЕЧА! Степашка и Маняшка ВСТРЕТИЛИСЬ после долгой разлуки!

Вот так ВСТРЕЧА! Степашка и Маняшка ВСТРЕТИЛИСЬ после долгой разлуки!

Какие флаги у континентов? #азия #европа #америка

Какие флаги у континентов? #азия #европа #америка