One thing to note at least what I was told on the dbx performance course is that when AQE is on, shuffle.partitions setting still acts as your ceiling.
This is the most enjoyable video on Spark Execution plans - I never thought in my wildest dreams I would enjoy listening to someone talk about data skew, execution plans and query engines. I definitely need to follow this guy
Exactly what I needed! Thanks for this video! I was struggling with building a big datamart with lots of joins to huge tables (600m +). Even with a 8 node 56GB cluster, it didn't finish after 1.5hrs and I had to cancel the query. As per explain plan it was doing lots of Sort Mearge joins, and execution hangs with 1 executor doing all the work. I even tried computing table /column stats still didn't help. With Adaptive Query Execution the SQL ran in 28min!!!
Fantastic news - glad it helped! I expect we'll see quite a few similar stories as people discover the config option, switch it on and watch their queries suddenly speed up! Definitely need to investigate to see if there are any potential downsides, but looks like a winner so far! Simon
Thanks for the video! Any idea why Spark AQE creates multiple jobs, instead of multiple stages in the past...? You've mentioned the "jobs" but with no explanation... :)
Great video and thanks for sharing!! I'm currently using Delta Lake on Databricks 2.4.5 and considering migrating to 3.x, do you know if AQE will bring speed enhancements when working with Delta files?
Absolutely - the stats collected by Delta gives you some speed boosts from dynamic partition pruning in Spark 3, which is a huge boost in itself. AQE should help anytime there's a shuffle/exchange - so if you're joining tables, performing wide transformations etc, you should see the benefits. Only gotcha I found when jumping up from spark 2.4 to 3.x was the shift in date compatibility (ie: strict case sensitivity on date format strings). Otherwise, it was a pain-free upgrade with a few nice performance gains. Simon
Hey - Been a while since I ran the demo, but from what I recall the "sales" table that I'd saved down was from a dataframe with 26 RDD blocks, so had created 26 parquet files within the table. In any of the examples going back to that table, it needs 26 tasks to read each of those files, before we can do any of the actual shuffle/exchange steps that AQE is actually helping us with. Simon
That's always gonna be an "it depends" :) If the window is partitioned by a skewed key, and there's no way you can break the partition down another way, you're going to get elements of skew during the stage that performs that transform. You could potentially use bucketing to collect some of the smaller partitions together - but if the issue is around one partition being way too big, it's not going to help too much. That said, 8M records isn't a crazy amount of data to just brute force through, unless it's an issue trying to serve data directly into dashboards etc that needs the faster execution time? If that's the case, you can always go for a larger worker size, but fewer nodes, so at least you're not hitting any memory boundaries?
One thing to note at least what I was told on the dbx performance course is that when AQE is on, shuffle.partitions setting still acts as your ceiling.
This is the most enjoyable video on Spark Execution plans - I never thought in my wildest dreams I would enjoy listening to someone talk about data skew, execution plans and query engines.
I definitely need to follow this guy
Oops!, I should not have said engine ... I meant execution :-)
Wow...Just Perfect...Subscribed.
Exactly what I needed! Thanks for this video! I was struggling with building a big datamart with lots of joins to huge tables (600m +). Even with a 8 node 56GB cluster, it didn't finish after 1.5hrs and I had to cancel the query. As per explain plan it was doing lots of Sort Mearge joins, and execution hangs with 1 executor doing all the work. I even tried computing table /column stats still didn't help. With Adaptive Query Execution the SQL ran in 28min!!!
Fantastic news - glad it helped! I expect we'll see quite a few similar stories as people discover the config option, switch it on and watch their queries suddenly speed up! Definitely need to investigate to see if there are any potential downsides, but looks like a winner so far!
Simon
@@AdvancingAnalytics Hi, have you found any downsides after 9 months of use?
Great video.
Do I need to turn it on manually in 2024.
Nope - turned on by default since Spark 3.2.0 (released end of 2021)!
Thanks for the video!
Any idea why Spark AQE creates multiple jobs, instead of multiple stages in the past...?
You've mentioned the "jobs" but with no explanation... :)
@Advanced Analytics Can you make a video on Dynamic Partition Pruning in Spark ? It is quite confusing
Yep, there's a DPP vid from a while ago - ruclips.net/video/-86iMCKeYxI/видео.html.
Let us know if that answers your questions!
@@AdvancingAnalytics Yes..that was a great video.
@@AdvancingAnalytics I have added few questions in the DPP vid. Can u please have a look at them
Great video and thanks for sharing!! I'm currently using Delta Lake on Databricks 2.4.5 and considering migrating to 3.x, do you know if AQE will bring speed enhancements when working with Delta files?
Absolutely - the stats collected by Delta gives you some speed boosts from dynamic partition pruning in Spark 3, which is a huge boost in itself. AQE should help anytime there's a shuffle/exchange - so if you're joining tables, performing wide transformations etc, you should see the benefits.
Only gotcha I found when jumping up from spark 2.4 to 3.x was the shift in date compatibility (ie: strict case sensitivity on date format strings). Otherwise, it was a pain-free upgrade with a few nice performance gains.
Simon
@@AdvancingAnalytics Thanks for this, much appreciated.
Hi, what exactly is 26 here? I get the 200 default partitions that the clusters create with the data, but not sure of what the "26" is..thank you..
Hey - Been a while since I ran the demo, but from what I recall the "sales" table that I'd saved down was from a dataframe with 26 RDD blocks, so had created 26 parquet files within the table. In any of the examples going back to that table, it needs 26 tasks to read each of those files, before we can do any of the actual shuffle/exchange steps that AQE is actually helping us with.
Simon
I have a window operation on 8M records and even AQE is enabled but it is causing data skewness issue, any idea how to resolve it.
That's always gonna be an "it depends" :)
If the window is partitioned by a skewed key, and there's no way you can break the partition down another way, you're going to get elements of skew during the stage that performs that transform. You could potentially use bucketing to collect some of the smaller partitions together - but if the issue is around one partition being way too big, it's not going to help too much.
That said, 8M records isn't a crazy amount of data to just brute force through, unless it's an issue trying to serve data directly into dashboards etc that needs the faster execution time? If that's the case, you can always go for a larger worker size, but fewer nodes, so at least you're not hitting any memory boundaries?