75. Databricks | Pyspark | Performance Optimization - Bucketing

map join, skew join, sort merge bucket join in hive

101. Databricks | Pyspark |Core/Architecture: Spark/Databricks Interview Question Series - I

The Aston Martin Valkyrie Is a $4.5 Million Insane Hypercar

Duramax Diesel "Extreme" Tune and Allison Transmission Service (My Going Ta' Town Rig!)

Exposing the Blox Fruits Dragon Update

74. Databricks | Pyspark | Interview Question: Sort-Merge Join (SMJ)

Raja's Data Engineering

Просмотров 18 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 янв 2025

Комментарии • 40

@omprakashreddy4230 2 года назад ⁺⁷
You are here to make our lives simple. Thank you so much !!
@rajasdataengineering7585 2 года назад
Thank you Omprakash
@moviestime2346 Год назад ⁺⁴
No one can explain better than this..Thanks raja for your efforts and time.
@rajasdataengineering7585 Год назад
Thanks for your comment. Glad it helps you
@taikoktsui_sithlord 6 месяцев назад ⁺³
to-the-point explanation, thanks!
@rajasdataengineering7585 6 месяцев назад
Glad it was helpful! Thanks
@rebalaashishreddy9908 2 года назад ⁺¹
Best channel for data bricks
@rajasdataengineering7585 2 года назад
Thank you
@suresh.suthar.24 Год назад ⁺²
hats of to you sir g ur explanation is next level.
@rajasdataengineering7585 Год назад
Thank you, Suresh!
@bhargavkumar4724 2 года назад ⁺²
Excellent Explanation!!!
@Lalamikuzinha Год назад ⁺¹
The best explanation i´ve seen.
@rajasdataengineering7585 Год назад
Thank you
@vineethreddy.s 2 года назад ⁺⁴
Say if we have deptid 111 in emp table a million times and deptid 111 in dept table over 500k times.
During the shuffle spark would create 200 partitions. So deptid 111 of emptable may split across 20 partitions and deptid 111 of depttable may split across 10 partitions and if the sort and merge is performed on these partitions, then this would result in partial join. How does spark handle it internally?
@aswaniyettapu9992 2 года назад ⁺²
Very good explanation
@rajasdataengineering7585 2 года назад
Thank you
@prathapganesh7021 9 месяцев назад ⁺¹
Excellence explanation thank you
@rajasdataengineering7585 9 месяцев назад
Glad it was helpful! Thanks Prathap
@saikiran-pl4cc 2 года назад
Thank you for clear explaination
@JimRohn-u8c 7 месяцев назад ⁺¹
Is this the same as the Sort-Merge-Bucket (SMB) join?
@rahulmittal116 5 месяцев назад ⁺¹
Hats off
@rajasdataengineering7585 5 месяцев назад
Thank you
@srinubathina7191 Год назад ⁺¹
Thank You Sir
@rajasdataengineering7585 Год назад
Most welcome
@prabhatgupta6415 Год назад ⁺¹
Sir i have seen multiple join strategies are there . I could find in ur playlist.
@rajasdataengineering7585 Год назад
That's great
@bikeshtiwari6418 2 года назад
ur awsome Spark Guru
@rajasdataengineering7585 2 года назад
Thanks
@pavankumarveesam8412 Год назад ⁺¹
But in the third stage its not completed right lets say there is one more filter operation on the data frame it will still be in that stage only but if the data frame encounters a shuffle operation like join there will be another stage correct ?
@rajasdataengineering7585 Год назад ⁺¹
Yes that's right. Only when there is shuffle through wide transformation, new stage would be created
@pavankumarveesam8412 Год назад
@@rajasdataengineering7585 thanks Raja
@vineethreddy.s 2 года назад ⁺¹
Thanks, Helpful
@rajasdataengineering7585 2 года назад
Thanks
@mohitupadhayay1439 2 года назад ⁺²
Is this why we use BROADCAST join? Because normal joins are expensive?
@rajasdataengineering7585 2 года назад ⁺²
Exactly, this is the reason why we need to use broadcast join to avoid expensive sort merge join
@mohitupadhayay1439 2 года назад
@@rajasdataengineering7585 One more question : How can we use broadcast if the small df couldn't occupy the memory? Wouldn't the data spill from the memory?
@oleg20century 11 месяцев назад
Hello!
1 executor unit is not 1 worker node unit? Maybe this worker node 1 is rack or little cluster? Or maybe this executors is actually containers (cores) on 1 executor (worker)?
@venkatasai4293 2 года назад ⁺¹
Good explanation Raja. Few questions 1)Does number of partitions determined by number of cores in the cluster or input split size for example s3 bucket 128MB 2)what happens if the partition size greater than the executor size . Does it spill to the disk ? Is that impacts the performance ?
@rajasdataengineering7585 2 года назад ⁺²
Thanks Venkat.
1. Number of partitions are determined by various factors. If the input file is in splittable format, each core will start read the data in parallel and each core can produce one partition at 128 mb size. If the input file is much bigger, each core will produce multiple partitions of 128 mb. So number of partitions will be in multiples of number of cores.
2. Usually partition size does not exceed executor onheap memory. If Dataframe (multiple distributed partitions across cluster) size is exceeding total size of on heap memory, it leads to data spill. So few partitions will be stored in local disk of worker node. Splilled data hits the performance as it needs to be recalculated every time.
Hope it helps
@venkatasai4293 2 года назад ⁺¹
@@rajasdataengineering7585 thanks raja

Следующие

Автовоспроизведение

75. Databricks | Pyspark | Performance Optimization - Bucketing

75. Databricks | Pyspark | Performance Optimization - Bucketing

map join, skew join, sort merge bucket join in hive

map join, skew join, sort merge bucket join in hive

101. Databricks | Pyspark |Core/Architecture: Spark/Databricks Interview Question Series - I

101. Databricks | Pyspark |Core/Architecture: Spark/Databricks Interview Question Series - I

The Aston Martin Valkyrie Is a $4.5 Million Insane Hypercar

The Aston Martin Valkyrie Is a $4.5 Million Insane Hypercar

Duramax Diesel "Extreme" Tune and Allison Transmission Service (My Going Ta' Town Rig!)

Duramax Diesel "Extreme" Tune and Allison Transmission Service (My Going Ta' Town Rig!)

Exposing the Blox Fruits Dragon Update

Exposing the Blox Fruits Dragon Update

I Upgraded to MAX Dragon Fruit in Blox Fruits Update

I Upgraded to MAX Dragon Fruit in Blox Fruits Update

How do nested loop, hash, and merge joins work? Databases for Developers Performance #7

How do nested loop, hash, and merge joins work? Databases for Developers Performance #7

22 Optimize Joins in Spark & Understand Bucketing for Faster joins |Sort Merge Join |Broad Cast Join

22 Optimize Joins in Spark & Understand Bucketing for Faster joins |Sort Merge Join |Broad Cast Join

72. Databricks | Pyspark | Interview Question: Explain Plan

72. Databricks | Pyspark | Interview Question: Explain Plan

FAANG System Design Interview: Design A Location Based Service (Yelp, Google Places)

FAANG System Design Interview: Design A Location Based Service (Yelp, Google Places)

52. Databricks| Pyspark| Delta Lake Architecture: Internal Working Mechanism

52. Databricks| Pyspark| Delta Lake Architecture: Internal Working Mechanism

I gave 127 interviews. Top 5 Algorithms they asked me.

I gave 127 interviews. Top 5 Algorithms they asked me.

38. Databricks | Pyspark | Interview Question | Compression Methods: Snappy vs Gzip

38. Databricks | Pyspark | Interview Question | Compression Methods: Snappy vs Gzip

23. Databricks | Spark | Cache vs Persist | Interview Question | Performance Tuning

23. Databricks | Spark | Cache vs Persist | Interview Question | Performance Tuning

35. Join Strategy in Spark with Demo

35. Join Strategy in Spark with Demo

Иван Ургант - Про возвращение Вечернего Урганта, Ёлки и природоведение / Опять не Гальцев

Иван Ургант - Про возвращение Вечернего Урганта, Ёлки и природоведение / Опять не Гальцев

Новогоднее поздравление от деда Архимеда 2025

Новогоднее поздравление от деда Архимеда 2025

Акмаль перепел Абсент живьем! Талантище🔥 #muzloft

Акмаль перепел Абсент живьем! Талантище🔥 #muzloft

ALL Sprunki Statue Timelapse Build

ALL Sprunki Statue Timelapse Build

Самара и волшебная картина! Часть 4 #shorts

Самара и волшебная картина! Часть 4 #shorts

МАФИЯ В ШКОЛЕ 😱 ЗАСТУПИЛСЯ ЗА ДРУГА 🤯 ШКОЛЬНИКИ СТАЛИ МИЛЛИОНЕРАМИ

МАФИЯ В ШКОЛЕ 😱 ЗАСТУПИЛСЯ ЗА ДРУГА 🤯 ШКОЛЬНИКИ СТАЛИ МИЛЛИОНЕРАМИ

Ленинград - Ёж (Live @ НАШЕ Радио)

Ленинград — Ёж (Live @ НАШЕ Радио)

The CRAZIEST pizza of the year @ChefRush @albert_cancook

The CRAZIEST pizza of the year @ChefRush @albert_cancook