Apache Spark | Spark Scenario Based Question | Data Skewed or Not ? | Count of Each Partition in DF

Поделиться
HTML-код
  • Опубликовано: 7 ноя 2024

Комментарии • 33

  • @sanyoge389
    @sanyoge389 3 месяца назад

    excellent series. Thank you very much.

  • @dattaningole8063
    @dattaningole8063 Год назад

    very nice . Great programmatic way to find data is skewed or not

  • @rishigc
    @rishigc 2 года назад +1

    very nice and concise video. Do you have the video where you show how to resolve the skew as you mentioned at the end of the video ?

  • @ashwinc9867
    @ashwinc9867 3 года назад

    When can I expect the video to be uploaded for resolving the issue for data skewness.. waiting

  • @WolfmaninKannada
    @WolfmaninKannada 6 месяцев назад

    Brother amazing👍😍

  • @Gamer_Dooby
    @Gamer_Dooby 3 года назад

    Good Content , Explained Well .. Thanks Much ..
    Please post the continuation .. .

  • @sasim6339
    @sasim6339 3 года назад +1

    Hi bro, can you explain dynamic memory allocation in spark submit command

  • @rupeshdeoria1
    @rupeshdeoria1 3 года назад

    We can check this by group by then why we using partition? I am new in spark plz explain

  • @cswanghan
    @cswanghan 2 года назад

    Clearly demonstrate how to identify / detect column data skewnewss

  • @nikhilmeghnani3487
    @nikhilmeghnani3487 3 года назад +1

    Great Session. Could you please share notebook or link to learntospark

    • @AzarudeenShahul
      @AzarudeenShahul  3 года назад

      Thanks, I havn't published the notebook.. once done will share the link in description

  • @soumyakantarath5078
    @soumyakantarath5078 3 года назад

    I have been asked the same question 😊 How you find out there is data skew problem is there and I have asked this question to lots of people nobody able to answer it. I have one supplement question- let’s say if I am not using partition by to that DF will data skew problem arise and if yes then How we will find out.

  • @trilokinathji31
    @trilokinathji31 2 года назад +1

    in_data=df1.repartition("Card_Category")
    in_data.rdd.getNumPartitions()
    Answer is 1 at databricks cluster
    It should be 200 as given in this video or 4.

    • @AzarudeenShahul
      @AzarudeenShahul  2 года назад

      Yes, in databricks it by default enable AQE ( adaptive query execution) .. because of which it gives 1.. try disabling AQE and check.

    • @trilokinathji31
      @trilokinathji31 2 года назад

      @@AzarudeenShahul : How to do that?

  • @tanyasaxena7968
    @tanyasaxena7968 3 года назад

    Thanks for very nice video

  • @priyankas6354
    @priyankas6354 3 года назад +1

    Nice explanation of the data skewness.Could you please explain how can we achieve the same using scala.

    • @AzarudeenShahul
      @AzarudeenShahul  3 года назад +3

      Thanks for your support :)
      There is not much syntactical change. you can try using below code
      import org.apache.spark.sql.functions.{spark_partition_id, asc, desc}
      df.groupBy(spark_partition_id).count().orderBy(asc("count")).show()

    • @priyankas6354
      @priyankas6354 3 года назад

      @@AzarudeenShahul Thank you

  • @sanjeev5149
    @sanjeev5149 3 года назад +1

    Bro, you are awesome

  • @mohamedbilal7011
    @mohamedbilal7011 3 года назад

    Good one

  • @sravankumar1767
    @sravankumar1767 2 года назад

    Nice explanation bro 👍

  • @subramanyamsibbala
    @subramanyamsibbala 3 года назад

    Do have sample code for this Kindly share.

  • @ashwinc9867
    @ashwinc9867 3 года назад

    Can u share scala code for same?

    • @Tech-Nature-IND
      @Tech-Nature-IND 2 года назад

      There is not much syntactical change. you can try using below code
      import org.apache.spark.sql.functions.{spark_partition_id, asc, desc}

      df.groupBy(spark_partition_id).count().orderBy(asc("count")).show()

  • @rupeshdeoria1
    @rupeshdeoria1 3 года назад

    hi sir i am not understand it show 200 partition and you say it 4 partition while we not do repartition(4) show how it in 4 partition.

    • @rupeshdeoria1
      @rupeshdeoria1 3 года назад +1

      sorry my bad it only 4 category of card so that 4 partition while apply repartition("Card_Category")

    • @ashwinc9867
      @ashwinc9867 3 года назад

      @@rupeshdeoria1 rdd.getnumpartion is giving 200!! Why?
      We should get 4 na for df.repartition("card category")

    • @rupeshdeoria1
      @rupeshdeoria1 3 года назад

      @@ashwinc9867 yes bro I am relaise now it should be show 4 partitions but it show 200 I am not clear on this

    • @ashwinc9867
      @ashwinc9867 3 года назад

      Ya...if u understand why it is showing 200....let me know

    • @srinuch9531
      @srinuch9531 3 года назад +2

      Please go through following blog for better understanding on repartitioning
      kontext.tech/column/spark/296/data-partitioning-in-spark-pyspark-in-depth-walkthrough
      In spark by default it creates 200 partitions though the data available in 2 or 3 partitions that’s y we are seeing 200 for getnumpartitions for specific column.
      Please note he did repartition on whole dataset not an individual column so it varies.

  • @asyakatanani8181
    @asyakatanani8181 2 года назад

    where is the video telling how to handle the skewness?