Advancing Spark - Understanding the Spark UI

Advancing Analytics

Просмотров 57 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 фев 2025

Комментарии • 40

@sahilchitnis1688 15 дней назад ⁺¹
One of the best explanation of spark UI , extremely helpful. Thank you Simon
@yashodhannn 2 года назад ⁺¹¹
Incredibly useful. Appreciate the way it is explained. I suggest you pick an use case and resolve a long running problem by changing cluster configuration.
@joerokcz Год назад ⁺⁴
The way you are explaining complex stuff that is incredible. I am a data engineer with having more than 8 years of experience, and totally loved your content
@WKhan-fh2pp 3 года назад ⁺²
Extremely helpful and great to touch different aspects of databricks.
@akhilannan 4 года назад ⁺¹²
Very useful one! Thanks for making this. What would also be interesting to see is, once u find cause of a performance issue via SparkUI how you go about fixing it. Like the skew issue you mentioned, how do we fix it? Maybe a video on Spark Performance Tuning ? :)
@phy2sll 4 года назад
You can often eliminate skew by repartitioning. If, however, your operation is based on grouped data and the groups are skewed then you might need to rethink your approach or resize your nodes to fit the largest partition without spill.
@AdvancingAnalytics 4 года назад ⁺⁵
I skipped over the troubleshooting for that one as I go through the same example in the AQE demo. Adaptive Query Execution targets that exact problem, and you can use the Spark UI to see where it might be happening - ruclips.net/video/jlr8_RpAGuU/видео.html
But yes, in general there's a spark performance tuning video/session I should probably write at some time!
@AdvancingAnalytics 4 года назад ⁺²
@@phy2sll Yep, absolutely - though in 7.0 we've got AQE as an alternative path to fix some of that. Doesn't catch everything, which is when you'd drop back to the fixes you mentioned!
@LuciaCasucci 3 года назад
Excellent video! I am in the middle of optimizing the script for a client and well I have seen a lot of videos showing the UI as first thing but nobody talks exactly about how to take advantage of this resource. Thanks for sharing, and subscribing!
@jaimetirado6130 Год назад
Thank you - very useful as I prep for the ADV DBX DE cert!
@matthow91 4 года назад ⁺¹
Great video Simon - thanks!
@raviv5109 4 года назад
You are too good! Lot of important and tons of info. Thx for sharing!
@sergeypryvala7750 2 года назад
Thank you for such as simple and powerful explanation
@alacrty9290 Год назад
Fantastic explanation. Thanks a lot!
@auroraw6357 Год назад
this video is super helpful, thank you very much! :)
I would be very interested in the topic you mentioned briefly at the beginning about JVM. Do you explain this somewhere in more detail? Also how e.g. PySpark is interacting with JVM and how Scala comes into play here?
@samirdesai6438 4 года назад
I love you Sir... Please keep on adding such videos
@the.activist.nightingale 4 года назад
I was waiting for this !!!!!!
Finally ! Thanks 😊
@carltonpatterson5539 4 года назад
Really really appreciate this.
I was hoping you were going end with showing how we might be able to use Ganglia to make assessments on how to choose the appropriate cluster size for a particular job
@AdvancingAnalytics 4 года назад ⁺¹
Yeah, definitely a lot more to dive into around Ganglia & specific performance tuning. I need to find some time to make some specific troubleshooting examples for that video - when I got close to 30 mins I thought I should probably stop for this one!
Simon
@tqw1423 3 года назад
Super helpful!! Thank you so much!!!
@PakIslam2012 4 года назад ⁺¹
Thanks for introducing Ganglia, can you also make a video of how can i understand it and make more sense of the graph and data its showing please...that would be super useful
@GhernieM 4 года назад
Thanks for this intro. Get ready Spark jobs, you're gonna be examined
@MrDeedeeck 3 года назад
thanks for the great video! Pls do make a ganglia-focused one when you have time :)
@nastasiasaby8086 3 года назад
Thank you very much for this uplifting video :). I was used to working with the Cloudera interface. Then, I'm wondering where the application name is. Have we lost it?
@Debarghyo1 4 года назад
Thanks a lot for this.
@skms31 4 года назад
Really Wonderful stuff Simon !
Was wondering how spark /databricks handles keys , does databricks get data which already have keys from the upstream data or do you know how a dimension is created with keys being generated like a typical merge dimension procedure would do in SQL server.
@AdvancingAnalytics 4 года назад ⁺¹
Hey, thanks for watching! Like any analytics tool, you get busines keys from upstream/source systems then usually need to build new/composite keys. It's fairly common to create a hash over business keys, if you need to generate new "identity" columns there are a few patterns to follow - the monotonically_increasing_id() function is great to add new unique values on top of the a dataframe, and you can simply add the current max value of the dim if you are trying to create "surrogate key" style functionality.
Quite often it's easier to stick with hash values and deal with the SCD/latest version problems downstream
Simon
@skms31 4 года назад
@@AdvancingAnalytics ah I see, thanks for the insights !
@iParsaa 4 года назад
thank you, it is very clear. Do you agree to share the code used during the explanation?
@AdvancingAnalytics 4 года назад
I generally don't have time to tidy the code up and make it separately runnable - maybe in future :)
For this one, I grabbed the AQE demo from the Databricks blog, it's good to force skew, small partitions etc to use as diagnosis practice:
docs.databricks.com/_static/notebooks/aqe-demo.html
@eduardopalmiero6701 2 года назад
hi! do you know why sometimes executors on executor's tab turn blue?
@sid0000009 3 года назад
how should we decide using UI, if increasing number of nodes(cores) or increasing the SKU(memory) of the Node would give me more performance benefits.. Thank you! :)
@Pl4sm4feresi 4 года назад
You are awesome!!!
@rashmimalhotra123 4 года назад
How can we choose the right number of worker node for my job ..my job was using the max number of worker node when i change from 75 - 90 ..but both time job was running fine...I did not see any change in performance
@AdvancingAnalytics 4 года назад
While the job is running you can check the number of tasks against the number of slots - if your job has fewer tasks than there are slots then adding more workers won't change the processing time. If you want to utilise more of the cluster, you can repartition the dataframe up to spread it across more rdd blocks. That's a careful balance as it introduces a new shuffle which may cause more performance problems than the increased parallelism!
Hopefully that gives you something to look at, if not that helpful!
Simkn
@rashmimalhotra123 4 года назад
@@AdvancingAnalytics are you suggesting to use repartitions or not ? as my notebook takes one hours plus?
@NoahPitts713 6 месяцев назад
💡🔐🔥

Следующие

Автовоспроизведение