Видео 12
Просмотров 113 301

23:09

Shuffle Partition Spark Optimization: 10x Faster!

19:03

Bucketing - The One Spark Optimization You're Not Doing

35:04

Dynamic Partition Pruning: How It Works (And When It Doesn’t)

20:33

The TRUTH About High Performance Data Partitioning

22:18

Speed Up Your Spark Jobs Using Caching

20:32

Apache Spark Executor Tuning | Executor Cores & Memory

Welcome back to our comprehensive series on Apache Spark Performance Tuning & Optimisation! In this guide, we dive deep into the art of executor tuning in Apache Spark to ensure your data engineering tasks run efficiently.
🔹 What is inside:
Learn how to properly allocate CPU and memory resources to your Spark executors and the number of executors to create to achieve optimal performance. Whether you're new to Apache Spark or an experienced data engineer looking to refine your Spark jobs, this video provides valuable insights into configuring the number of executors, memory, and cores for peak performance. I’ve covered everything from understanding the basic structure of Spark executors wit...

Видео

23:09

Apache Spark Memory Management

Просмотров 10 тыс.5 месяцев назад

Welcome back to our comprehensive series on Apache Spark Performance Tuning/Optimisation! In this video, we dive deep into the intricacies of Spark's internal memory allocation and how it divides memory resources for optimal performance. 🔹 What you'll learn: 1. On-Heap Memory: Learn about the parts of memory where Spark stores data for computation (shuffling, joins, sorting, aggregation) and ca...

Shuffle Partition Spark Optimization: 10x Faster!

19:03

Shuffle Partition Spark Optimization: 10x Faster!

Просмотров 8 тыс.8 месяцев назад

Welcome to our comprehensive guide on understanding and optimising shuffle operations in Apache Spark! In this deep-dive video, we uncover the complexities of shuffle partitions and how shuffling works in Spark, providing you with the knowledge to enhance your big data processing tasks. Whether you're a beginner or an experienced Spark developer, this video is designed to elevate your skills an...

Bucketing - The One Spark Optimization You're Not Doing

35:04

Bucketing - The One Spark Optimization You're Not Doing

Просмотров 7 тыс.9 месяцев назад

Dive deep into the world of Apache Spark performance tuning in this comprehensive guide. We unpack the intricacies of Spark's bucketing feature, exploring its practical applications, benefits, and limitations. We discuss the following real-world scenarios where bucketing is most effective, enhancing your data processing tasks. 🔥 What's Inside: 1. Filter Join Aggregation Operations: A comparison...

Dynamic Partition Pruning: How It Works (And When It Doesn’t)

20:33

Dynamic Partition Pruning: How It Works (And When It Doesn’t)

Просмотров 3,7 тыс.9 месяцев назад

Dive deep into Dynamic Partition Pruning (DPP) in Apache Spark with this comprehensive tutorial. If you've already explored my previous video on partitioning, you're perfectly set up for this one. In this video, I explain the concept of static partition pruning and then transition into the more advanced and efficient technique of dynamic partition pruning. You'll learn through practical example...

The TRUTH About High Performance Data Partitioning

22:18

The TRUTH About High Performance Data Partitioning

Просмотров 6 тыс.9 месяцев назад

Welcome back to our comprehensive series on Apache Spark performance optimization techniques! In today's episode, we dive deep into the world of partitioning in Spark - a crucial concept for anyone looking to master Apache Spark for big data processing. 🔥 What's Inside: 1. Partitioning Basics in Spark: Understand the fundamental principles of partitioning in Apache Spark and why it's essential ...

20:32

Speed Up Your Spark Jobs Using Caching

Просмотров 4,2 тыс.11 месяцев назад

Welcome to our easy-to-follow guide on Spark Performance Tuning, honing in on the essentials of Caching in Apache Spark. Ever been curious about Lazy Evaluation in Spark? I’'ve got it broken down for you. Dive into the world of Spark's Lineage Graph and understand its role in performance. The age-old debate, Spark Persist vs. Cache, is also tackled in this video to clear up any confusion. Learn...

28:55

How Salting Can Reduce Data Skew By 99%

Просмотров 8 тыс.Год назад

Spark Performance Tuning Master the art of Spark Performance Tuning and Data Engineering in this comprehensive Apache Spark tutorial! Data skew is a common issue in big data processing, leading to performance bottlenecks by overloading some nodes while underutilizing others. This video dives deep into a practical example of data skew and demonstrates how to optimize Spark performance by using a...

Data Skew Drama? Not Anymore With Broadcast Joins & AQE

20:37

Data Skew Drama? Not Anymore With Broadcast Joins & AQE

Просмотров 6 тыс.Год назад

Spark Performance Tuning Welcome back to another engaging apache spark tutorial! In this apache spark performance optimization hands on tutorial, we dive deep into the techniques to fix data skew, focusing on Adaptive Query Execution (AQE) and broadcast join. AQE, a feature introduced in Spark 3.0, uses runtime statistics to select the most efficient query plan, optimizing shuffle partitions, j...

Why Data Skew Will Ruin Your Spark Performance

12:36

Why Data Skew Will Ruin Your Spark Performance

Просмотров 5 тыс.Год назад

Spark Performance Tuning Welcome back to my channel. In this tutorial to dive into this comprehensive Apache Spark tutorial, where we will cover Apache Spark optimization techniques. Are you struggling with Data Skew and uneven partitioning while running Spark jobs? You're not alone! In this video, we dive deep into the world of Spark Performance Tuning and Data Engineering to tackle the common...

34:14

Master Reading Spark DAGs

Просмотров 16 тыс.Год назад

Spark Performance Tuning In this tutorial, we dive deep into the core of Apache Spark performance tuning by exploring the Spark DAGs (Directed Acyclic Graph). We cover the Spark DAGs (Directed Acyclic Graph) for a range of operations from reading files, Spark narrow and wide transformations with examples, aggregation using groupBy count, groupBy count distinct. Understand the differences betwee...

39:19

Master Reading Spark Query Plans

Просмотров 32 тыс.Год назад

Spark Performance Tuning Dive deep into Apache Spark Query Plans to better understand how Apache Spark operates under the hood. We'll cover how Spark creates logical and physical plans, as well as the role of the Catalyst Optimizer in utilizing optimization techniques such as filter (predicate) pushdown and projection pushdown. The video covers intermediate concepts of Apache Spark in-depth, de...

@yuvanshankarm5260 День назад
Hi Bro, I have a question, My cluster config is 65 nodes each with 16 cores and 128 GB Memory. I am reading a file of 7 days of total partitions of 4500 (some are unevenly partitions as well) of 500GB data, and join and filters happens and it becoms 1.3 TB and i have given shuffle partition of 10000 (each of size 128MB). So, if i give core - 5 and totalm executors - 195 based on the calculation you provided. How will it process the data, will it be able to process 500GB data and dumps 1.3 TB. If yes, how it will be done? Should i do coalesce(). Or how i can approach this?
@mirli33 2 дня назад
Currently watching your playlist one by one. Great content. Very detailed explanation. In the first scenario you had 5 executors and with 4 cores each. If you have 1500 shuffle partition how they are going to be accommodated.
@JustDinesh-1934 3 дня назад
I have learned somewhere that the max partition size can only be 128mb in spark. Isnt that contradict to what you mentined when explaining about 300GB example? Just asking to Correct myself if wrong.
@ybalasaireddy1248 3 дня назад
Hey @afaqueahamd7117 , The explanation is excellent I watched all of your videos and they way you explain things in detail actually makes to watch it again and again whenever I am attending an interview. Eagerly waiting for your next videos. Kudos🙌
@afaqueahmad7117 2 дня назад
Hey @ybalasaireddy1248, this means a lot to me, thanks a lot for the kind appreciation. More coming soon :)
@NaveenKumar-fm5yg 4 дня назад
you are a lifesaver bro i have learnt this concept so many time but i didn't get it very well but in your video i totally understand how to calculate the resource for spark job
@afaqueahmad7117 2 дня назад
Glad to hear that @NaveenKumar-fm5yg, appreciate the kind words :)
@AshishStudyDE 7 дней назад
Sir still waiting for a dedicated video for driver and executor omm in very detailed. Question on if the file is 100 gb, and can we sort it? If yes will there be data spill, basically a interview quesion for 8+ year exp
@kartikjaiswal8923 8 дней назад
I love you bro for such crisp explanation, the way to experiment and teach helps a lot!
@afaqueahmad7117 7 дней назад
@kartikjaiswal8923 Appreciate it man :)
@srinivasjagalla7864 9 дней назад
Nice explanation
@BabaiChakraborty-ss8pt 11 дней назад
Great Work Bro.
@BabaiChakraborty-ss8pt 12 дней назад
amazing job bro
@joseduarte5663 14 дней назад
Hey Afaque, awesome video as always! quick question. If we have the chance to increase the memory of the spark execution container, how can we decide between assigning that extra memory to the on heap memory or assigning it to the off heap memory if at the end the total available memory is always the sum of these two? I know you mentioned that off heap memory is not affected by the garbage collection process, but is also slower that on heap memory, so wouldn't it be better if we always assign all possible memory to the on heap memory right from the beginning instead of waiting for the off heap memory to come into play?
@afaqueahmad7117 6 дней назад
Hey @joseduarte5663, good question! Generally, all memory is appropriate to be assigned to on-heap with off-heap mostly being disabled. However, it's best to monitor job performance and lookout for issues where the overall run may be affected due to, for e.g. "GC cleanup" timing; in such cases you may prefer to change your strategy and allocate 10-20% of memory to "off-heap"
@joseduarte5663 17 дней назад
Awesome video as always. Would really appreciate more videos explaining how DAG's can be read
@abhisheknigam3768 17 дней назад
Industry level content.
@9666gaurav 18 дней назад
Is this applicable to cloud platform?
@janb4637 19 дней назад
I never see such a detailed explanation. Thank you very much @afaque Ahmad. Is there any way we can get the document.
@afaqueahmad7117 7 дней назад
Appreciate it @janb4637, let me try and put it on GitHub :)
@joseduarte5663 21 день назад
Awesome video! I've been searching for something like this and all the other videos I found don't get to the point and neither explain things as good as you do. I'm definitely subscribing and sharing this with other DE's from my team, please keep posting content like this!
@afaqueahmad7117 7 дней назад
Appreciate the kind words @joseduarte5663 :)
@dwipalshrirao499 21 день назад
Very informative video afaque. Please create more videos.
@afaqueahmad7117 7 дней назад
Thank you, appreciate it, @dwipalshrirao499 :)
@tridipdas5445 24 дня назад
What if the nodes are of unequal size?
@cantcatchme8368 24 дня назад
Excellent.. Keep going..
@cantcatchme8368 24 дня назад
Am.not able to see spill details in spark 3.5.2 UI?
@arghyakundu8558 24 дня назад
Excellent Content..!! Loved It. Such detailed explanation on Salting Technique with Graphical Representation.
@afaqueahmad7117 7 дней назад
Appreciate it @arghyakundu8558 :)
@the_gamer2416 25 дней назад
Hi Sir Please Make a Detailed course on apache spark which include every aspect of spark for Data Engineer role Please make sure there are a lot of begineer course here in market keep the course from intermediate level to advance level. Please tr to make video in Hindi it will be very helpful.
@vishalpathak8266 25 дней назад
Thank you for this video !!
@bhargaviakkineni 26 дней назад
Sir please do a video on executor out of memory in spark and driver out of memory in spark
@snehitvaddi 28 дней назад
This is helpful, but I still have a few doubts. 1. If Broadcast join is immune to skewness, why there is Salting technique? 2. In the Broadcast join example, the customer dataset appeared to be outside of any executor. Where is it actually stored? How can we specify its storage location?
@shaifalipal9415 27 дней назад
Broadcast is only possible if the other table is really small to be replicated
@narutomaverick 28 дней назад
Want to understand better, Read this: Spark CacheLLM Response: 1. **Why Use Caching?** - Caching can significantly improve performance by reusing persistent data instead of recomputing it - It helps avoid redundant computations on the same dataset across multiple actions 2. **Lazy Evaluation and Caching** - Apache Spark uses lazy evaluation, where transformations are not executed until an action is triggered - Caching can help by materializing the result of a long sequence of transformations, avoiding recomputation 3. **Spark's Lineage Graph** - Spark tracks the lineage of transformations using a Lineage Graph - Caching breaks the lineage, reducing the size of the graph and improving performance 4. **Caching vs. No Caching** - The demo shows a significant performance improvement when caching is used, as seen in the Spark UI 5. **Persist and Storage Levels** - The `persist()` method is used for caching, with different storage levels available - Storage levels like `MEMORY_ONLY`, `DISK_ONLY`, and combinations control memory/disk usage and replication - Choose the appropriate storage level based on your requirements and cluster resources 6. **When to Cache?** - Cache datasets that are reused multiple times, especially after a long sequence of transformations - Cache intermediate datasets that are expensive to recompute - Be mindful of cluster resources and cache judiciously 7. **Unpersist** - Use `unpersist()` to remove cached data and free up resources when no longer needed - Spark may automatically unpersist cached data if memory is needed If you liked it, Upvote it. NarutoLLM Response
@afaqueahmad7117 7 дней назад
Good summary :)
@choubeysumit246 29 дней назад
Great tutorials 🙏, please create more videos on spark from beginners point of view
@narutomaverick Месяц назад
Your channel is so underrated, Please dont stop
@vijaykumar-b6i7t Месяц назад
i like very much of your videos, it's insightful. can you please make series/videos on Spark interview oriented questions. Thanks in advance
@mohitupadhayay1439 Месяц назад
Hi Afaque. A suggestion. You could start from the beginning to connect the DOTS! Like if in your scenario we have X Node machine with Y workers and Z exectors and if you do REPARTITION and fit the data like this then this could happen. Otherwise the Machine would sit idle and so on.
@tumbler8324 Месяц назад
Perfect explanation & perfect examples throughout the playlist, Bhai mere Change data capture aur Slowly changing dimension jo bhi apply hote hain project me uska bhi khel samza de.
@afaqueahmad7117 Месяц назад
Thanks for the kind words bhai @tumbler8324. Sab ayega bhai kuch waqt mein, pipeline mein hai :)
@vijaykumar-b6i7t Месяц назад
a lot of knowledge in just one video
@afaqueahmad7117 Месяц назад
Appreciate it @user-pq9tx6ui2t :)
@skybluelearner4198 Месяц назад
I spent INR 42000 on a Big Data course but could not understand this concept clearly because the trainer himself lacked clarity. Here I understood completely.
@afaqueahmad7117 Месяц назад
Appreciate the kind words @skybluelearner4198 :)
@Dhawal-ld2mc Месяц назад
Great explanation of such a complex topic, thanks and keep up the good work.
@afaqueahmad7117 Месяц назад
Thanks man @Dhawal-ld2mc :)
@mahendranarayana1744 Месяц назад
Great explanation, Thank you, But how would we know how to configure exact (at least best) "spark.sql.shuffle.partitions" at run time? Because each run/day the volume of the data is going to be changed. So, how do we determine the data volume at run time to set the shuffle.partitions number?
@SurendraKumar-qj9tv Месяц назад
Awesome explanations! pls share us more relevant videos
@mohitupadhayay1439 Месяц назад
Dead gorgeous stuff.
@afaqueahmad7117 Месяц назад
Appreciate it man :)
@mohitupadhayay1439 Месяц назад
Hey Afaque Great tutorials. You should consider doing a full end to end spark project with a Big volume of data so we can understand the challenges faced and how to tackle them. Would be really helpful!
@afaqueahmad7117 Месяц назад
A full-fledged in-depth project using Spark and the modern data stack coming soon, stay tuned @mohitupadhayay1439 :)
@sonlh81 Месяц назад
Not easy to understand, but it great
@Akshaykumar-pu4vi Месяц назад
Useful information
@leonardopetraglia6040 Месяц назад
Thanks for the video! I also have a question: when I execute complex query, there will be multiple stage with different shuffle write sizes, which do I have to take in consideration for the computation of the optimal number of shuffle partitions?
@deepikas7462 Месяц назад
All the concepts are clearly explained. Please do more videos.
@afaqueahmad7117 Месяц назад
Appreciate the kind words @deepikas7462, more coming soon :)
@abusayed.mondal Месяц назад
Your teaching skill is very good, please make a full series on PySpark, that'll be helpful for so many aspiring data engineers.
@afaqueahmad7117 Месяц назад
Appreciate the kind words @abusayed.mondal, more coming soon, stay tuned :)
@muhammadzakiahmad8069 Месяц назад
Please make one on AWE aswell
@afaqueahmad7117 Месяц назад
You mean AWS?
@muhammadzakiahmad8069 Месяц назад
@@afaqueahmad7117 Sorry it was supposed to be AQE ( Adaptive Query Execution).
@afaqueahmad7117 Месяц назад
Complete details on AQE is here below :) ruclips.net/video/bRjVa7MgsBM/видео.html
@muhammadzakiahmad8069 Месяц назад
@@afaqueahmad7117 Thanks🌟
@Ravi_Teja_Padala_tAlKs Месяц назад
Excellent 🎉 👍 appreciate your effort
@leonardopetraglia6040 Месяц назад
Correct me if I'm wrong, but these calculations consider the execution of only one job at a time. How do the calculations change when there are multiple jobs running in a cluster, as often happens?
@snehitvaddi Месяц назад
Buddy! You got a new sub here. Loved your detailed explanation. I see no one explaining the query plain this detail and I believe this is the right way of learning. But I would love to see an entire Spark series.
@afaqueahmad7117 Месяц назад
Thank you @snehitvaddi for the kind appreciation. A full-fledged, in-depth course on Spark coming soon :)
@snehitvaddi Месяц назад
@@afaqueahmad7117 Most awaited. Keep up the 🚀
@piyushkumawat8042 Месяц назад
Why to give such a large fraction (0.4) to User memory as in the end when the transformations will be performed in a particular stage , whether we give it a user defined function or any other function execution memory will be only used . So Whats exactly the role of User Memory ??
@fitness_thakur Месяц назад
could you please make video on stack overflow like what are scenario when it can occur and how to fix it
@afaqueahmad7117 Месяц назад
Are you referring to OOM (out of memory errors) - Driver & Executor?
@fitness_thakur Месяц назад
@@afaqueahmad7117 No, basically when we have multiple layers under single session then at that time stack memory getting full so to break it we have to make sure we are using one session per layer. e.g- suppose we have 3 layers (internal, external, combined) and if you run these in single session then it will throw stackoverflow error at any place whenever its stack get overflow. We tried to increase stack as well but that was not working. Hence at the last we come up with approach like will run one layer and then close session likewise
@dasaratimadanagopalan-rf9ow Месяц назад
Thanks for the content, really appreciate it. My understanding is AQE take care of Shuffle Partition Optimization and we don't need to manually intervene (starting spark 3) to optimize shuffle partitions. Could you clarify this please?

Afaque Ahmad

Видео

Комментарии