I have been asked the same question 😊 How you find out there is data skew problem is there and I have asked this question to lots of people nobody able to answer it. I have one supplement question- let’s say if I am not using partition by to that DF will data skew problem arise and if yes then How we will find out.
in_data=df1.repartition("Card_Category") in_data.rdd.getNumPartitions() Answer is 1 at databricks cluster It should be 200 as given in this video or 4.
Thanks for your support :) There is not much syntactical change. you can try using below code import org.apache.spark.sql.functions.{spark_partition_id, asc, desc} df.groupBy(spark_partition_id).count().orderBy(asc("count")).show()
Please go through following blog for better understanding on repartitioning kontext.tech/column/spark/296/data-partitioning-in-spark-pyspark-in-depth-walkthrough In spark by default it creates 200 partitions though the data available in 2 or 3 partitions that’s y we are seeing 200 for getnumpartitions for specific column. Please note he did repartition on whole dataset not an individual column so it varies.
excellent series. Thank you very much.
very nice . Great programmatic way to find data is skewed or not
very nice and concise video. Do you have the video where you show how to resolve the skew as you mentioned at the end of the video ?
When can I expect the video to be uploaded for resolving the issue for data skewness.. waiting
Brother amazing👍😍
Good Content , Explained Well .. Thanks Much ..
Please post the continuation .. .
Hi bro, can you explain dynamic memory allocation in spark submit command
We can check this by group by then why we using partition? I am new in spark plz explain
Clearly demonstrate how to identify / detect column data skewnewss
Great Session. Could you please share notebook or link to learntospark
Thanks, I havn't published the notebook.. once done will share the link in description
I have been asked the same question 😊 How you find out there is data skew problem is there and I have asked this question to lots of people nobody able to answer it. I have one supplement question- let’s say if I am not using partition by to that DF will data skew problem arise and if yes then How we will find out.
in_data=df1.repartition("Card_Category")
in_data.rdd.getNumPartitions()
Answer is 1 at databricks cluster
It should be 200 as given in this video or 4.
Yes, in databricks it by default enable AQE ( adaptive query execution) .. because of which it gives 1.. try disabling AQE and check.
@@AzarudeenShahul : How to do that?
Thanks for very nice video
Nice explanation of the data skewness.Could you please explain how can we achieve the same using scala.
Thanks for your support :)
There is not much syntactical change. you can try using below code
import org.apache.spark.sql.functions.{spark_partition_id, asc, desc}
df.groupBy(spark_partition_id).count().orderBy(asc("count")).show()
@@AzarudeenShahul Thank you
Bro, you are awesome
Good one
Nice explanation bro 👍
Thanks for your support 🙂
Do have sample code for this Kindly share.
Can u share scala code for same?
There is not much syntactical change. you can try using below code
import org.apache.spark.sql.functions.{spark_partition_id, asc, desc}
df.groupBy(spark_partition_id).count().orderBy(asc("count")).show()
hi sir i am not understand it show 200 partition and you say it 4 partition while we not do repartition(4) show how it in 4 partition.
sorry my bad it only 4 category of card so that 4 partition while apply repartition("Card_Category")
@@rupeshdeoria1 rdd.getnumpartion is giving 200!! Why?
We should get 4 na for df.repartition("card category")
@@ashwinc9867 yes bro I am relaise now it should be show 4 partitions but it show 200 I am not clear on this
Ya...if u understand why it is showing 200....let me know
Please go through following blog for better understanding on repartitioning
kontext.tech/column/spark/296/data-partitioning-in-spark-pyspark-in-depth-walkthrough
In spark by default it creates 200 partitions though the data available in 2 or 3 partitions that’s y we are seeing 200 for getnumpartitions for specific column.
Please note he did repartition on whole dataset not an individual column so it varies.
where is the video telling how to handle the skewness?