coalesce vs repartition vs partitionBy in spark | Interview question Explained

Поделиться
HTML-код
  • Опубликовано: 25 авг 2021
  • Hi All,
    In this video, I have explained the concepts of coalesce, repartition, and partitionBy in apache spark.
    To become a GKCodelabs Extended plan member you can check the below links, and purchase the Big Data end to end pipeline course in your preferred language Python or SCALA
    PySpark course available at
    courses.gkcodelabs.com/produc...
    Spark + SCALA course available at
    courses.gkcodelabs.com/produc...
    End to End pipeline Introduction Videos:
    Pyspark End to End Pipeline
    • BIG DATA COMPLETE PROJ... ​
    Spark + Scala End to End Pipeline
    • BIG DATA complete PROJ... ​
    Starter Pack available at just: ₹549 (For Indian Payments) or $9 (For non-Indian payments)
    Extended Pack available at just: ₹1299 (For Indian Payments) or $19 (For non-Indian payments)
    Queries? Write to us at support@gkcodelabs.com
    Website: www.gkcodelabs.com​ In this video I have shared my day-2 experience as a Big Data Engineer and shared with you the usual tasks, assignments, call, and routines in my life as a Big Data engineer.
    To become a GKCodelabs Extended plan member you can check the below links, and purchase the Big Data end to end pipeline course in your preferred language Python or SCALA
    PySpark course available at
    courses.gkcodelabs.com/produc...
    Spark + SCALA course available at
    courses.gkcodelabs.com/produc...
    End to End pipeline Introduction Videos:
    Pyspark End to End Pipeline
    • BIG DATA COMPLETE PROJ... ​
    Spark + Scala End to End Pipeline
    • BIG DATA complete PROJ... ​
    Starter Pack available at just: ₹549 (For Indian Payments) or $9 (For non-Indian payments)
    Extended Pack available at just: ₹1299 (For Indian Payments) or $19 (For non-Indian payments)
    Queries? Write to us at: support@gkcodelabs.com
    Website: www.gkcodelabs.com

Комментарии • 5

  • @johnsonrajendran6194
    @johnsonrajendran6194 2 года назад

    Nice explanation!!

  • @user-lp7sb5dw7l
    @user-lp7sb5dw7l 7 месяцев назад

    When you do repartition and then partitionby already data is partitioned now based on partitionby column they why no of part file depend on repartition() again?

  • @srikanthk8261
    @srikanthk8261 2 года назад +2

    Good explanation. I have question as you mentioned when your doing partition by age columns that will creating 3 partitions bcoz we have three age groups here. Let's assume I have 1000 unique Ids in a dataset. I have provided partition by Id column then how many partition it will create. On which basis it will create partitions. Could you please brief about this if you have time.
    Thanks
    Srikanth kita

    • @GKCodelabs
      @GKCodelabs  2 года назад +2

      Good catch 😊! I will try to answer this, in as simple way as possible, but it will have some conditions 😉 (distributed computing always has a lot to given's and provided's) 😜
      So for your case:
      It will be 1000 partitions (condition: You should have 1000+ cores on your cluster)
      Else it will be equal to your number of cores (condition: Each core could handle the amount of data which it is processing)
      Else it can be slightly more than your number of cores, in case some cores were not able to processes the data given to them, and processed rest of it in next cycle (task).
      Hope i was able to answer your question.!

  • @MiRayalaseemaPillakai
    @MiRayalaseemaPillakai 2 года назад

    1st view