Advancing Spark - Crazy Performance with Spark 3 Adaptive Query Execution

Поделиться
HTML-код
  • Опубликовано: 5 янв 2025

Комментарии • 21

  • @JustLikeBuildingThings
    @JustLikeBuildingThings 5 месяцев назад

    One thing to note at least what I was told on the dbx performance course is that when AQE is on, shuffle.partitions setting still acts as your ceiling.

  • @carltonpatterson5539
    @carltonpatterson5539 4 года назад +5

    This is the most enjoyable video on Spark Execution plans - I never thought in my wildest dreams I would enjoy listening to someone talk about data skew, execution plans and query engines.
    I definitely need to follow this guy

  • @carltonpatterson5539
    @carltonpatterson5539 4 года назад

    Oops!, I should not have said engine ... I meant execution :-)

  • @mohitsanghai5455
    @mohitsanghai5455 3 года назад

    Wow...Just Perfect...Subscribed.

  • @akhilannan
    @akhilannan 4 года назад

    Exactly what I needed! Thanks for this video! I was struggling with building a big datamart with lots of joins to huge tables (600m +). Even with a 8 node 56GB cluster, it didn't finish after 1.5hrs and I had to cancel the query. As per explain plan it was doing lots of Sort Mearge joins, and execution hangs with 1 executor doing all the work. I even tried computing table /column stats still didn't help. With Adaptive Query Execution the SQL ran in 28min!!!

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 года назад +1

      Fantastic news - glad it helped! I expect we'll see quite a few similar stories as people discover the config option, switch it on and watch their queries suddenly speed up! Definitely need to investigate to see if there are any potential downsides, but looks like a winner so far!
      Simon

    • @GhernieM
      @GhernieM 3 года назад +1

      @@AdvancingAnalytics Hi, have you found any downsides after 9 months of use?

  • @Dreamer-xj3ms
    @Dreamer-xj3ms 6 месяцев назад

    Great video.
    Do I need to turn it on manually in 2024.

    • @AdvancingAnalytics
      @AdvancingAnalytics  6 месяцев назад

      Nope - turned on by default since Spark 3.2.0 (released end of 2021)!

  • @shakezi
    @shakezi 2 года назад

    Thanks for the video!
    Any idea why Spark AQE creates multiple jobs, instead of multiple stages in the past...?
    You've mentioned the "jobs" but with no explanation... :)

  • @mohitsanghai5455
    @mohitsanghai5455 3 года назад

    @Advanced Analytics Can you make a video on Dynamic Partition Pruning in Spark ? It is quite confusing

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад

      Yep, there's a DPP vid from a while ago - ruclips.net/video/-86iMCKeYxI/видео.html.
      Let us know if that answers your questions!

    • @mohitsanghai5455
      @mohitsanghai5455 3 года назад

      @@AdvancingAnalytics Yes..that was a great video.

    • @mohitsanghai5455
      @mohitsanghai5455 3 года назад

      @@AdvancingAnalytics I have added few questions in the DPP vid. Can u please have a look at them

  • @richardcarpenter6775
    @richardcarpenter6775 4 года назад

    Great video and thanks for sharing!! I'm currently using Delta Lake on Databricks 2.4.5 and considering migrating to 3.x, do you know if AQE will bring speed enhancements when working with Delta files?

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 года назад

      Absolutely - the stats collected by Delta gives you some speed boosts from dynamic partition pruning in Spark 3, which is a huge boost in itself. AQE should help anytime there's a shuffle/exchange - so if you're joining tables, performing wide transformations etc, you should see the benefits.
      Only gotcha I found when jumping up from spark 2.4 to 3.x was the shift in date compatibility (ie: strict case sensitivity on date format strings). Otherwise, it was a pain-free upgrade with a few nice performance gains.
      Simon

    • @richardcarpenter6775
      @richardcarpenter6775 4 года назад

      @@AdvancingAnalytics Thanks for this, much appreciated.

  • @radhsri
    @radhsri 4 года назад

    Hi, what exactly is 26 here? I get the 200 default partitions that the clusters create with the data, but not sure of what the "26" is..thank you..

    • @AdvancingAnalytics
      @AdvancingAnalytics  4 года назад +1

      Hey - Been a while since I ran the demo, but from what I recall the "sales" table that I'd saved down was from a dataframe with 26 RDD blocks, so had created 26 parquet files within the table. In any of the examples going back to that table, it needs 26 tasks to read each of those files, before we can do any of the actual shuffle/exchange steps that AQE is actually helping us with.
      Simon

  • @WKhan-fh2pp
    @WKhan-fh2pp 3 года назад

    I have a window operation on 8M records and even AQE is enabled but it is causing data skewness issue, any idea how to resolve it.

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      That's always gonna be an "it depends" :)
      If the window is partitioned by a skewed key, and there's no way you can break the partition down another way, you're going to get elements of skew during the stage that performs that transform. You could potentially use bucketing to collect some of the smaller partitions together - but if the issue is around one partition being way too big, it's not going to help too much.
      That said, 8M records isn't a crazy amount of data to just brute force through, unless it's an issue trying to serve data directly into dashboards etc that needs the faster execution time? If that's the case, you can always go for a larger worker size, but fewer nodes, so at least you're not hitting any memory boundaries?