Apache Spark | Spark Scenario Based Question | Data Skewed or Not ? | Count of Each Partition in DF

Azarudeen Shahul

Просмотров 14 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 7 ноя 2024

Комментарии • 33

@sanyoge389 3 месяца назад
excellent series. Thank you very much.
@dattaningole8063 Год назад
very nice . Great programmatic way to find data is skewed or not
@rishigc 2 года назад ⁺¹
very nice and concise video. Do you have the video where you show how to resolve the skew as you mentioned at the end of the video ?
@ashwinc9867 3 года назад
When can I expect the video to be uploaded for resolving the issue for data skewness.. waiting
@WolfmaninKannada 6 месяцев назад
Brother amazing👍😍
@Gamer_Dooby 3 года назад
Good Content , Explained Well .. Thanks Much ..
Please post the continuation .. .
@sasim6339 3 года назад ⁺¹
Hi bro, can you explain dynamic memory allocation in spark submit command
@rupeshdeoria1 3 года назад
We can check this by group by then why we using partition? I am new in spark plz explain
@cswanghan 2 года назад
Clearly demonstrate how to identify / detect column data skewnewss
@nikhilmeghnani3487 3 года назад ⁺¹
Great Session. Could you please share notebook or link to learntospark
@AzarudeenShahul 3 года назад
Thanks, I havn't published the notebook.. once done will share the link in description
@soumyakantarath5078 3 года назад
I have been asked the same question 😊 How you find out there is data skew problem is there and I have asked this question to lots of people nobody able to answer it. I have one supplement question- let’s say if I am not using partition by to that DF will data skew problem arise and if yes then How we will find out.
@trilokinathji31 2 года назад ⁺¹
in_data=df1.repartition("Card_Category")
in_data.rdd.getNumPartitions()
Answer is 1 at databricks cluster
It should be 200 as given in this video or 4.
@AzarudeenShahul 2 года назад
Yes, in databricks it by default enable AQE ( adaptive query execution) .. because of which it gives 1.. try disabling AQE and check.
@trilokinathji31 2 года назад
@@AzarudeenShahul : How to do that?
@tanyasaxena7968 3 года назад
Thanks for very nice video
@priyankas6354 3 года назад ⁺¹
Nice explanation of the data skewness.Could you please explain how can we achieve the same using scala.
@AzarudeenShahul 3 года назад ⁺³
Thanks for your support :)
There is not much syntactical change. you can try using below code
import org.apache.spark.sql.functions.{spark_partition_id, asc, desc}
df.groupBy(spark_partition_id).count().orderBy(asc("count")).show()
@priyankas6354 3 года назад
@@AzarudeenShahul Thank you
@sanjeev5149 3 года назад ⁺¹
Bro, you are awesome
@mohamedbilal7011 3 года назад
Good one
@sravankumar1767 2 года назад
Nice explanation bro 👍
@AzarudeenShahul 2 года назад ⁺¹
Thanks for your support 🙂
@subramanyamsibbala 3 года назад
Do have sample code for this Kindly share.
@ashwinc9867 3 года назад
Can u share scala code for same?
@Tech-Nature-IND 2 года назад
There is not much syntactical change. you can try using below code
import org.apache.spark.sql.functions.{spark_partition_id, asc, desc}

df.groupBy(spark_partition_id).count().orderBy(asc("count")).show()
@rupeshdeoria1 3 года назад
hi sir i am not understand it show 200 partition and you say it 4 partition while we not do repartition(4) show how it in 4 partition.
@rupeshdeoria1 3 года назад ⁺¹
sorry my bad it only 4 category of card so that 4 partition while apply repartition("Card_Category")
@ashwinc9867 3 года назад
@@rupeshdeoria1 rdd.getnumpartion is giving 200!! Why?
We should get 4 na for df.repartition("card category")
@rupeshdeoria1 3 года назад
@@ashwinc9867 yes bro I am relaise now it should be show 4 partitions but it show 200 I am not clear on this
@ashwinc9867 3 года назад
Ya...if u understand why it is showing 200....let me know
@srinuch9531 3 года назад ⁺²
Please go through following blog for better understanding on repartitioning
kontext.tech/column/spark/296/data-partitioning-in-spark-pyspark-in-depth-walkthrough
In spark by default it creates 200 partitions though the data available in 2 or 3 partitions that’s y we are seeing 200 for getnumpartitions for specific column.
Please note he did repartition on whole dataset not an individual column so it varies.
@asyakatanani8181 2 года назад
where is the video telling how to handle the skewness?

Следующие

Автовоспроизведение

Spark Scenario Based Question | Dealing with Date in PySpark | Beginner's Guide | LearntoSpark

$Apache Spark | Spark Scenario Based Question | Spark Read Json {From_JSON, To_JSON, JSON_Tuple }$ 11:40 $Apache Spark | Spark Scenario Based Question | Spark Read Json {From_JSON, To_JSON, JSON_Tuple }$

Spark Performance Tuning | Handling DATA Skewness | Interview Question

16:08

Repartition vs Coalesce in Apache Spark | Rock the JVM

11:50

Why Data Skew Will Ruin Your Spark Performance

12:36

$Spark Interview Question | Scenario Based Questions | { Regexp_replace } | Using PySpark$ 9:22 $Spark Interview Question | Scenario Based Questions | { Regexp_replace } | Using PySpark$

How to handle Data skewness in Apache Spark using Key Salting Technique