The columnar roadmap: Apache Parquet and Apache Arrow

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Small Things Like These (2024) Official Trailer - Cillian Murphy, Emily Watson

A fresh start...

iOS 18 - All New Features You NEED to Know!

Apache Parquet: Parquet file internals and inspecting Parquet file structure

Melvin L

Просмотров 57 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 20 сен 2024

Комментарии • 20

@srividyaus 4 года назад ⁺³
Best explanation of parquet file and columnar file format, I came across so far. Thank you very much
@markevogt 4 года назад ⁺⁴
Interesting video showing a single RowGroup...
You present well, and clearly have a solid grasp of the Parquet file format.
If you're interested in preparing a sequel to your video...
... considering showing a diagram of MULTIPLE row groups, each stored on a different disk in a different node in a cluster, so that a RowGroup represents the "sharding" (splitting across rows in the logical representation of a table) of a logical table and distributing shards-as-RowGroups on DIFFERENT nodes.
Then you could explore what happens during a query like "What is average square ft in ZIP Code 60542?"
This query can & will be PARALLELIZED into 1 query on each disk where a portion of the larger (logical) table has been stored.
What's COOL about parquet is this:
- in a ROW-based storage format to get the ZIP from a single record I have to read EACH row, FIND the ZIP field and return it.
- therefore in a row-based "shard" containing (say) 10,000 rows across (say) 10 disks (so 1,000 rows per disk) I have to make 10,000 READS across different regions of my disk... VERY INEFFICIENT just to get a SINGLE field (sqft) :-(
- in a COLUMN-based storage format I simply have to make 1 single read , starting with where the sqft data begins, and stopping where this field ends. And in a SINGLE read (NOT 1,000) I have ALL the sqft values in that shard representing those rows in my larger (logical) table :-)
- MEANWHILE on my other (say 10) disks also containing this (logical) table, there are also only 1 READ per disk,
The result?
Instead of 10,000 reads across 10 disks just to get 10,000 measely values of sqft to average...
... the parquet format lets me make only 10 reads and get the same 10,000 values :-O
Illustrate THIS in your next video ;-)
You'll be a hero :-)
-Mark in North Aurora IL ...
@flwi 4 года назад ⁺²
Great overview! Thanks for taking the time to record it!
@abhijeetzagade3349 3 года назад
best explanation of columnar storage format
@aniruddhnathani5518 10 месяцев назад
Nice video but i dont see any row group tuning parameter directly. It is tuned via block.size itself. Is my understanding correct?
@nkantkumar 6 лет назад ⁺⁴
Excellent talk!!
@charanjeetsingh1100 6 лет назад ⁺¹
Very nice. Brilliant. Thanks.
@sunilmali8483 6 месяцев назад
Hi all , I am searching a way to load the parquet file but not in one go. Want to load in parts . How can i achieve this in Java . Any Implementation reference will be highly appreciated. I have gone through few articles but not up to the mark.
@debashishkheti5010 7 лет назад ⁺²
Nice Explanation !!
@meditating010 6 лет назад ⁺¹
crazy good videos .... you are godly
@melvinl5797 6 лет назад
Wow..thanks 😀
@aharonwsmith 5 лет назад ⁺¹
Good lecture. Play at 1.25x
@rogermenezes 6 лет назад ⁺³
Awesome talk. Melvin, can you share your slides? via Slideshare or something.
@melvinl5797 6 лет назад ⁺¹
Thanks! Unfortunately dont have the slides anymore. The images used in the slides have been sourced from the official parquet site parquet.apache.org/documentation/latest/
@brianz2011 5 лет назад
Why the parquet store the data as row layout (row group)? Does it store data as column side by side?
@djibb.7876 7 лет назад
Great talk!
I set up a spark-cluster with 2 workers. I save a Dtaframe using partitionBy ("column x") as a parquet format to some path on each worker. The matter is that i am able to save it but if i want to read it back i am getting these errors: - Could not read footer for file file´status ...... - unable to specify Schema ... Any Suggestions?
@__dio 5 лет назад
What happens if i write a parquet file that has 2 row group??
@rambabuchamakuri1780 5 лет назад
excellent..
@karthikgolagani6844 7 лет назад ⁺¹
learnt new things
@rajatsharma1570 4 года назад
Parquet-tools not working..

Следующие

Автовоспроизведение

The columnar roadmap: Apache Parquet and Apache Arrow

The columnar roadmap: Apache Parquet and Apache Arrow

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Small Things Like These (2024) Official Trailer - Cillian Murphy, Emily Watson

Small Things Like These (2024) Official Trailer - Cillian Murphy, Emily Watson

A fresh start...

A fresh start...

iOS 18 - All New Features You NEED to Know!

iOS 18 - All New Features You NEED to Know!

Kai Cenat Spends $50,000 with Mr.Beast at CoolKicks

Kai Cenat Spends $50,000 with Mr.Beast at CoolKicks

Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)

Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)

This INCREDIBLE trick will speed up your data processes.

This INCREDIBLE trick will speed up your data processes.

Apache ORC :Master Class (Everything you need to know about ORC)

Apache ORC :Master Class (Everything you need to know about ORC)

Redis Crash Course - the What, Why and How to use Redis as your primary database

Redis Crash Course - the What, Why and How to use Redis as your primary database

What is Apache Parquet file?

What is Apache Parquet file?

File Formats: Big Data- Parquet, Avro, ORC | The Data Channel

File Formats: Big Data- Parquet, Avro, ORC | The Data Channel

Parquet File Format - Explained to a 5 Year Old!

Parquet File Format - Explained to a 5 Year Old!

A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)

A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)

Linux File System/Structure Explained!

Linux File System/Structure Explained!

Петренто ремонт #петренто #приколы #прикол #petrento #мем #юмор #memes #funny

Петренто ремонт #петренто #приколы #прикол #petrento #мем #юмор #memes #funny

Chill Out! 🍧 How to Make Watermelon DIY Ice Cream - A Refreshing Treat!

Chill Out! 🍧 How to Make Watermelon DIY Ice Cream - A Refreshing Treat!

Вот что случилось с боксёром, которые НЕ УВАЖАЛ нокаутирующую мощь Уайлдера #shorts

Вот что случилось с боксёром, которые НЕ УВАЖАЛ нокаутирующую мощь Уайлдера #shorts

Парень Ксении Бородиной ворвался на съемку? Выбор кроссовок! #тренды #интервью

Парень Ксении Бородиной ворвался на съемку? Выбор кроссовок! #тренды #интервью

Стрельба в Wildberries в центре Москвы: что известно

Стрельба в Wildberries в центре Москвы: что известно

GONE.Fludd, ЛСП - Ути-Пути (official video)

GONE.Fludd, ЛСП — Ути-Пути (official video)

Я ПЕРЕЖИЛ 10 СТАДИЙ ЗЛОДЕЕВ В МАЙНКРАФТ!

Я ПЕРЕЖИЛ 10 СТАДИЙ ЗЛОДЕЕВ В МАЙНКРАФТ!

The Joker's betrayal of Harley Quinn has been discovered!#joker #shorts

The Joker's betrayal of Harley Quinn has been discovered!#joker #shorts