22 Optimize Joins in Spark & Understand Bucketing for Faster joins |Sort Merge Join |Broad Cast Join

Поделиться
HTML-код
  • Опубликовано: 1 янв 2025

Комментарии • 52

  • @NileshPatil-z4w
    @NileshPatil-z4w 9 месяцев назад +5

    very nice , so far best vid for beginners on join

  • @sureshraina321
    @sureshraina321 Год назад

    Most expected video😊
    Thank you

  • @evgeniy7069
    @evgeniy7069 8 дней назад +1

    thanks a lot!
    does bucketing work with hive?
    how bucketing should be done in case if I need to join by several columns?

    • @easewithdata
      @easewithdata  7 дней назад

      Absolutely join works with Hive. If you select more than one column then the hashing will happen with combination of both columns.

  • @chetanphalak7192
    @chetanphalak7192 9 месяцев назад

    Amazingly explained

  • @Basket-hb5jc
    @Basket-hb5jc Месяц назад

    Great work

    • @easewithdata
      @easewithdata  Месяц назад

      Thanks 👍 Please make sure to share with your network over LinkedIn ❤️

  • @anuragdwivedi1804
    @anuragdwivedi1804 4 месяца назад

    truly an amazing video

    • @easewithdata
      @easewithdata  4 месяца назад

      Thank you 👍 Please make sure to share with your network over LinkedIn 🙂

  • @dasaratimadanagopalan-rf9ow
    @dasaratimadanagopalan-rf9ow 21 день назад

    Thanks so much for the video. I have a follow up question for you, can bucketing be used on High Cardinality Columns? Thanks in Advance

  • @prathamesh_a_k
    @prathamesh_a_k 7 месяцев назад

    nice explaination

    • @easewithdata
      @easewithdata  7 месяцев назад

      Thanks please make sure share with your network on LinkedIn ❤️

  • @adulterrier
    @adulterrier 2 месяца назад

    First of all, a big kudo!
    Fun fact: added up the times on my cluster. Two bucket writes were 7 and 11s, unbucketed join was 40s, the bucketed was 15s. So 7+11+15 = 33 is less than 40. It looks like it pays out to bucket the data first, right?

    • @easewithdata
      @easewithdata  2 месяца назад

      Awesome observation 👏 Yes if you data is already in buckets as per the required keys of join, you will see significant improvements with joins. And it seems you already figured that with your experiment 👍
      In case you like my content please make sure to share with your network over LinkedIn 🙂

  • @Aravind-gz3gx
    @Aravind-gz3gx 9 месяцев назад

    @23:03, the tasks showed only 4 tasks here , usually it will come's up with 16 tasks due to actual config in the cluster, but only 4 tasks is being taken due to the data is being bucketed before reading. Is it correct ?

    • @easewithdata
      @easewithdata  9 месяцев назад

      Yes, the bucketing would restrict the number of tasks to avoid shuffling. So it's important to decide number of buckets.

    • @gagansingh3481
      @gagansingh3481 7 дней назад

      Brother teaching Har kisi ke bas ki baat nai hai seems person in a hurry .... Make 4 buckets for each partition brother data to dikhao Har partition ka then he switch to spark terminal . Terminal commands aati nai wo kya karega ..

  • @jeevanbarma9457
    @jeevanbarma9457 Месяц назад +1

    I need material for this can I get it?

    • @easewithdata
      @easewithdata  Месяц назад

      Hello,
      What exactly are you looking for ?

  • @keen8five
    @keen8five Год назад

    Bucketing can't be applied when the data resides in a Delta Lake table, right?

    • @easewithdata
      @easewithdata  Год назад

      Delta lake tables doesnt supports bucketing. Please avoid using it for the delta lake tables. Try to use other optimization like z ordering while dealing with delta lake tables.

    • @svsci323
      @svsci323 11 месяцев назад

      @@easewithdata So, in real-world project bucketing need to be applied on rdbms table or files?

    • @PrajwalTaneja
      @PrajwalTaneja 5 месяцев назад

      @@svsci323 on dataframes and dataset

  • @ahmedaly6999
    @ahmedaly6999 8 месяцев назад

    how i join small table with big table but i want to fetch all the data in small table like
    the small table is 100k record and large table is 1 milion record
    df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin')
    it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help

    • @Abhisheksingh-vd6yo
      @Abhisheksingh-vd6yo 7 месяцев назад

      df = largedf.join(broadcast(smalldf), smalldf.id==largedf.id , how = 'right join') may it will work here

  • @NiteeshKumarPinjala
    @NiteeshKumarPinjala 3 месяца назад

    Hi Subham, one quick question.
    Can we Un broadcast the broadcasted dataframe? We can Un cache the cached dataset right, in the sameway can we do un broadcasting?

    • @easewithdata
      @easewithdata  3 месяца назад

      If you dont want to broadcast a joined dataframe then, suppress it setting spark.sql.autoBroadcastJoinThreshold to -1

  • @avinash7003
    @avinash7003 11 месяцев назад +1

    high cardinality --- bucketing and low cardinality --- partition?

  • @divit00
    @divit00 4 месяца назад

    Good stuff. Can you provide me the dataset?

    • @easewithdata
      @easewithdata  4 месяца назад

      Thanks 👍 The datasets are huge and its very difficult to upload them. However, you can find most of the at this Github url:
      github.com/subhamkharwal/pyspark-zero-to-hero/tree/master/datasets
      If you like my content, Please make sure to share with your network over LinkedIn 👍 This helps a lot 💓

    • @divit00
      @divit00 4 месяца назад

      @@easewithdata For me, even with AQE disabled it's doing broadcast join. What could be the reason? I have used your dataset and code.
      Spark 3.3.2
      df_joined = emp.join(broadcast(dept), on=emp.department_id==dept.department_id, how="left")
      df_joined = emp.join(broadcast(dept), on=emp.department_id==dept.department_id, how="left")
      df_joined.explain()
      == Physical Plan ==
      *(2) BroadcastHashJoin [department_id#7], [department_id#58], LeftOuter, BuildRight, false

    • @NiteeshKumarPinjala
      @NiteeshKumarPinjala 3 месяца назад

      @@divit00 i guess post spark 2.0 by default data less than 10MB is broadcasted and join operation will be sort merge.

    • @divit00
      @divit00 3 месяца назад

      @@NiteeshKumarPinjala but his Spark is greater than 2.0 and we are setting autobroadcast to 0

  • @Abhisheksingh-vd6yo
    @Abhisheksingh-vd6yo 7 месяцев назад

    how 16 partition(task) is created because partition size is 128mb and here we have only 94.8 MB OF DATA
    .. @please explain please

    • @easewithdata
      @easewithdata  7 месяцев назад

      Hello
      Number of partitions for data is not only determined using partition size, there are some other factors too
      checkout this article blog.devgenius.io/pyspark-estimate-partition-count-for-file-read-72d7b5704be5

  • @adulterrier
    @adulterrier 2 месяца назад

    Increased the buckets number to 16 and got the join in 3 secs, while writing buckets was 3 and 6 seconds. Can I draw any conclusions from this?

    • @easewithdata
      @easewithdata  2 месяца назад

      Yes, increasing the number of buckets will increase performance as the tasks executing joins will also increase. But only thing to keep is mind is small file issue. Dont make too many buckets leading to small files.

  • @DEwithDhairy
    @DEwithDhairy Год назад +2

    PySpark Coding Interview Questions and Answer of Top Companies
    ruclips.net/p/PLqGLh1jt697zXpQy8WyyDr194qoCLNg_0

  • @alishmanvar8592
    @alishmanvar8592 7 месяцев назад

    Hello Subham, why did not cover Shuffle hash join practically over here? as I can see here you have explained only in theory

    • @easewithdata
      @easewithdata  7 месяцев назад

      Hello,
      There is very less chance that some will run into issues with Shuffle Hash Join. The majority of challenges comes when you have optimize Sort Merge which is usually used for bigger datasets. And in case of smaller datasets now a days everyone prefers broadcasting.

    • @alishmanvar8592
      @alishmanvar8592 7 месяцев назад

      @@easewithdata suppose we don't choose any join behavior then u meant to say shuffle hash join is by default join?

    • @easewithdata
      @easewithdata  7 месяцев назад

      AQE would optimize and choose the best possible join

    • @alishmanvar8592
      @alishmanvar8592 7 месяцев назад

      @@easewithdata Hello Subham, can u please come up with session where u can show how can we use delta table (residing on golden layer) for power bi reporting purpose or import into power bi

    • @PrajwalTaneja
      @PrajwalTaneja 5 месяцев назад

      @@alishmanvar8592 save the table in delta format, open powerBI, load that file and do your visualisation

  • @subhashkumar209
    @subhashkumar209 Год назад

    Hi,
    I have noticed that you use "noop" to perform an action. Any particular reason to not use ".show()" or .display()?

    • @easewithdata
      @easewithdata  Год назад +1

      Hello,
      show and display doesn't trigger the complete dataset. Best way to trigger complete dataset is using count or write. And for write we are noop.
      This was already explained in past videos of the series. Have a look.

  • @gagansingh3481
    @gagansingh3481 7 дней назад

    IT seems person teach himself..... So much confusion . for teaching you need to work more .... A simple person can't understand anything .