75. Databricks | Pyspark | Performance Optimization - Bucketing

Raja's Data Engineering

Просмотров 21 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 янв 2025

Комментарии • 58

@derekbevan9385 2 года назад ⁺³
Thanks!
@dhanushkumar9943 2 года назад ⁺⁴
Bro your work is awesome so plz continue making spark videos
@rajasdataengineering7585 2 года назад
Thank you bro! Sure will make more videos
@srinubathina7191 Год назад ⁺¹
Awesome explanation Sir and wow super content as well
@rajasdataengineering7585 Год назад
Thanks! Hope it helps
@kubersharma1971 7 месяцев назад ⁺¹
Great explnation nice work dude
@rajasdataengineering7585 7 месяцев назад
Glad you liked it! Thanks
@mariumbegum7325 2 года назад ⁺¹
Great video and energy :)
@rajasdataengineering7585 2 года назад
Thank you
@somisettyrevanthrs859 2 года назад ⁺¹
u r awesome. Thank u for the clear explanation
@rajasdataengineering7585 2 года назад
Welcome
@abhishek_grd 2 года назад ⁺¹
Great Work thank you
@rajasdataengineering7585 2 года назад
Thank you
@najmehforoozani 2 года назад ⁺¹
Well done, thank you
@rajasdataengineering7585 2 года назад
Thank you
@derekbevan9385 2 года назад
You are a great teacher
@vinayakkulkarni4904 9 месяцев назад
In video at 16:48, I can see that there are 3 jobs. but in my case when I joined df3 and df4, databricks shows 5 jobs. Can you please explain why is it different? Also, is it possible to know what each job does ? Thank you!
@suryasabulalmathew1331 25 дней назад
Hi Sir, Can you tell in these examples you have shown, why many jobs are created for each of the join query you have executed. I have understood the stages, explain plan and the DAG. But, the number of jobs part is not clear for me, can you shed some light on it.
@oiwelder 2 года назад ⁺²
hello Raja, is Bucketing deprecated for "delta" format?
@rajasdataengineering7585 2 года назад ⁺²
Hi Welder, yes bucketing is not supported for delta table. Z-ordering can be used as an alternative
@oiwelder 2 года назад
@@rajasdataengineering7585 I did a quick search, and found Z-Ordering plus OPTIMIZE. Thanks for your suggestion.
You have a subscriber.
@vlogsofsiriii 8 месяцев назад
Hi Raja. Thankyou for the concept. You have mentioned if any dataframe size less than 10MB by default it will use BroadcaseJoin rgt. I have taken two dataframes which has 2 rows(size is in bytes) and applied join. But it was showing Sort Merge Join. Could you please tell me the reason ????
@sohelsayyad5572 Год назад
Thank you Raja sir for this informative lecture.
In demo, 10 partitions were used. If we have 2 df with 400partitions nd bucketed by 5 nd we didn't change default shuffle partition number. Which is 200.
It means while shuffling: 400 partis will be confined to 200 shuffle partitions.
Ie EACH shuffle partition will hv data from 2 partitions ie 10 buckets. And resultant df will hv 200 partitions instead of 500.
Please answer nd even correct question if Ihv got confusion in concept.
Thank you a lot.
@sagarvarma3919 2 года назад ⁺²
Hello Sir,
I had one doubt, if I'm processing 1TB of data and my cluster has storage of 500G and 2TB, will it always load entire 1TB data or how that works can you please help me here also it would be great if you can make video on topic covering the performance aspects
@rajasdataengineering7585 2 года назад ⁺²
If your Dataframe is bigger than cluster memory, it will lead to data spill, which means partial data would be stored in local storage disk
@bachannigam4332 Год назад ⁺¹
Hi,
U have provided beautiful insights about databricks.
I am using photon accelerator in my db cluster, so I am not able to understand the stages part, please make videos on photon accelerator and provide the insights about, jobs stages and tasks
@rajasdataengineering7585 Год назад
Sure, will create a video on photon engine which plays crucial role in performance
@at-cv9ky 11 месяцев назад
How is the size of a bucket decided in the bucketed table ?
How is the partition size decided in non-bucketed table ?
@uttamkumarreddygaggenapall2504 Год назад ⁺¹
Super
@rajasdataengineering7585 Год назад
Thanks
@vineethreddy.s 2 года назад
when we do a join on two dataframes (non-bucket and non-partitioned), it would involve a suffle. So during the shuffle will it make sure that each partition has only 1 distinct key?
For ex: If I join 2 dataframes on column 'A' and there are 600 distinct column 'A' values. So does the shuffle create 600 partitions?
@rajasdataengineering7585 2 года назад ⁺¹
For any wide transformation (join, group by), data shuffle occurs. During shuffle, default number of partitions are 200. But this parameter can be configured using spark conf settings
@vineethreddy.s 2 года назад ⁺¹
Thank you
@rahulbhusari1478 3 месяца назад
Super can you please cover scenario based pyspark interview question for optimization ..
@MrMuraliS 2 года назад
Let's say we wanted to repartition our data and we have configured our partition size as 128 MB.. Our total data is 1 GB and we need to reparation it to 2, each partition size can be 128 MB. What will happen ?
@ps-up2mx 2 года назад
Hi Raja sir,
Do u have any full course about full-fledged data bricks with scala/python .?
How we can connect with you .(
@abhishek_grd 2 года назад ⁺¹
Please just give a video link if any reference video needs to be looked into. It really helps.
@rajasdataengineering7585 2 года назад
Sure Abhishek, will add reference link in description
@pranavsarc4788 2 года назад ⁺¹
Great explanation Sir!!! Can u pls clarify me below doubts ?
1) if we are using Spark SQL, how to see physical plan (in dtaframe, we can use like df.explain but sparksql how can I check)?
2) As in this example you have mentioned bucket as 10. How to determine the bucket number while I am bucking the table ?
Thanks in advanced 🙂
@rajasdataengineering7585 2 года назад ⁺¹
Thank you for your comment.
1. To get explain plan of sql statement, you can use command like "EXPLAIN SELECT * FROM EMP" - syntax is EXPLAIN
2. There is no magic number in deciding number of buckets. It is depending on the use case, volume of data, number of executors etc., I would recommend to keep it multiples of number of cores in your cluster
@ranjithrampally7982 Год назад
How do we decide on the number of buckets that we need to set ? example : bucketBy(id,X)-- X can we any number right? how can we decide on the number of buckets that should be passed ?
@starmscloud Год назад ⁺¹
It should be a decent number . If given a larger number with larger dataset , it will create numerous small files which will eventually become an overhead for spark to process and in the same time , the bucket size should also not be too small or else spark will create extreme large parquet files. Most effective file size of each parquet file to process by Spark is 1 GB .
Hope this makes sense .
@venkatasai4293 2 года назад ⁺¹
Good explanation RaJa . 1)how does spark know there is no need of shuffle and sort? How does spark collocate the data from two datasets into the same executor ?2)suppose 50 partitions are there and we want 50 buckets so total 2500 files will be created . Is there any way we can create one file per bucket ?
@rajasdataengineering7585 2 года назад ⁺³
Thanks Venkat.
1. When we create bucket, hash key is specified with each bucket. When hash key is matching on partition files of 2 dataframes, spark would come to know that shuffle and sort is not needed for that operation because it is already bucketed
2. If we want to create one file per bucket, partition should be avoided in those use cases or create just one partition using repartition or coalesce
@venkatasai4293 2 года назад
@@rajasdataengineering7585 can we able to see hash key on part files?
@omprakashreddy4230 2 года назад ⁺¹
Hi Raja, If you get some time can you please explain difference between normal function and udf's(In some cases I observed we use normal python function in our code and in some cases they are registering as UDF) and when to use what
@rajasdataengineering7585 2 года назад
Sure Omprakash, will do the video on this request
@omprakashreddy4230 2 года назад ⁺¹
@@rajasdataengineering7585 You are really awesome !!
@rajasdataengineering7585 2 года назад
Thank you
@kiranmudradi26 2 года назад
Nice Video Sir. I have got one interview question like if i have 1000 of datasets and i don't know the size of all those data sets. U have to perform join how will u decide whether to go for broadcast join or not?
@rajasdataengineering7585 2 года назад ⁺²
By default, if a dataframe is lesser than 10MB, spark will do broadcast join. This parameter is configurable. Apart from this, we can enforce broadcast explicitly through our coding. For that we should ensure that the Dataframe size fits into driver memory (as it passes through driver) and not consuming major part of executor memory
@AmanShaikh-wv9kx 2 года назад ⁺¹
Hello sir.
Please make video on databricks connectivity with azure event hubs with basics transformation. Hope you will make . Thank you
@rajasdataengineering7585 2 года назад ⁺¹
Sure Aman, I will do that
@AmanShaikh-wv9kx 2 года назад ⁺¹
@@rajasdataengineering7585 Thank you so much sir 😊. We are waiting...
@stoyyeti3671 2 года назад
Hi sir, i was asked in one interview suppose you have 10gb input file , your cluster shud automatically allocate number of nodes depending on input file .... I said autoscaling sir but they said there is another option ...can you let me know sir what is that
@CoolGuy Год назад
repartition ?
@Mehtre108 11 месяцев назад
Hi bhai can you pls share notes of the videos. It will be great

Следующие

Автовоспроизведение

76. Databricks|Pyspark:Interview Question|Scenario Based|Max Over () Get Max value of Duplicate Data