Advancing Spark - Understanding the Spark UI

Поделиться
HTML-код
  • Опубликовано: 16 сен 2020
  • When we first started with Spark, the Spark UI pages were something of a mystery, an arcane source of mysterious, hidden knowledge. Looking back, it's something that is so, so useful for understanding your spark cluster, diagnosing user issues and deciding whether your cluster is correctly sized.
    In this video, Simon gives a quick tour of the Spark UI, talking through the various tabs and the kind of troubleshooting/information they can provide.
    As always, don't forget to like & subscribe for more Sparky news, and stop by our website if you need some deeper advice, help or training around Databricks & Azure Analytics: www.advancinganalytics.co.uk/...

Комментарии • 39

  • @yashodhannn
    @yashodhannn 2 года назад +10

    Incredibly useful. Appreciate the way it is explained. I suggest you pick an use case and resolve a long running problem by changing cluster configuration.

  • @WKhan-fh2pp
    @WKhan-fh2pp 2 года назад +2

    Extremely helpful and great to touch different aspects of databricks.

  • @LuciaCasucci
    @LuciaCasucci 2 года назад

    Excellent video! I am in the middle of optimizing the script for a client and well I have seen a lot of videos showing the UI as first thing but nobody talks exactly about how to take advantage of this resource. Thanks for sharing, and subscribing!

  • @raviv5109
    @raviv5109 3 года назад

    You are too good! Lot of important and tons of info. Thx for sharing!

  • @matthow91
    @matthow91 3 года назад +1

    Great video Simon - thanks!

  • @jaimetirado6130
    @jaimetirado6130 Год назад

    Thank you - very useful as I prep for the ADV DBX DE cert!

  • @sergeypryvala7750
    @sergeypryvala7750 Год назад

    Thank you for such as simple and powerful explanation

  • @the.activist.nightingale
    @the.activist.nightingale 3 года назад

    I was waiting for this !!!!!!
    Finally ! Thanks 😊

  • @alacrty9290
    @alacrty9290 Год назад

    Fantastic explanation. Thanks a lot!

  • @joerokcz
    @joerokcz 11 месяцев назад +3

    The way you are explaining complex stuff that is incredible. I am a data engineer with having more than 8 years of experience, and totally loved your content

  • @tqw1423
    @tqw1423 2 года назад

    Super helpful!! Thank you so much!!!

  • @samirdesai6438
    @samirdesai6438 3 года назад

    I love you Sir... Please keep on adding such videos

  • @Debarghyo1
    @Debarghyo1 3 года назад

    Thanks a lot for this.

  • @akhilannan
    @akhilannan 3 года назад +12

    Very useful one! Thanks for making this. What would also be interesting to see is, once u find cause of a performance issue via SparkUI how you go about fixing it. Like the skew issue you mentioned, how do we fix it? Maybe a video on Spark Performance Tuning ? :)

    • @phy2sll
      @phy2sll 3 года назад

      You can often eliminate skew by repartitioning. If, however, your operation is based on grouped data and the groups are skewed then you might need to rethink your approach or resize your nodes to fit the largest partition without spill.

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +5

      I skipped over the troubleshooting for that one as I go through the same example in the AQE demo. Adaptive Query Execution targets that exact problem, and you can use the Spark UI to see where it might be happening - ruclips.net/video/jlr8_RpAGuU/видео.html
      But yes, in general there's a spark performance tuning video/session I should probably write at some time!

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +2

      @@phy2sll Yep, absolutely - though in 7.0 we've got AQE as an alternative path to fix some of that. Doesn't catch everything, which is when you'd drop back to the fixes you mentioned!

  • @carltonpatterson5539
    @carltonpatterson5539 3 года назад

    Really really appreciate this.
    I was hoping you were going end with showing how we might be able to use Ganglia to make assessments on how to choose the appropriate cluster size for a particular job

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      Yeah, definitely a lot more to dive into around Ganglia & specific performance tuning. I need to find some time to make some specific troubleshooting examples for that video - when I got close to 30 mins I thought I should probably stop for this one!
      Simon

  • @PakIslam2012
    @PakIslam2012 3 года назад +1

    Thanks for introducing Ganglia, can you also make a video of how can i understand it and make more sense of the graph and data its showing please...that would be super useful

  • @Pl4sm4feresi
    @Pl4sm4feresi 3 года назад

    You are awesome!!!

  • @nastasiasaby8086
    @nastasiasaby8086 3 года назад

    Thank you very much for this uplifting video :). I was used to working with the Cloudera interface. Then, I'm wondering where the application name is. Have we lost it?

  • @MrDeedeeck
    @MrDeedeeck 2 года назад

    thanks for the great video! Pls do make a ganglia-focused one when you have time :)

  • @auroraw6357
    @auroraw6357 6 месяцев назад

    this video is super helpful, thank you very much! :)
    I would be very interested in the topic you mentioned briefly at the beginning about JVM. Do you explain this somewhere in more detail? Also how e.g. PySpark is interacting with JVM and how Scala comes into play here?

  • @GhernieM
    @GhernieM 3 года назад

    Thanks for this intro. Get ready Spark jobs, you're gonna be examined

  • @skms31
    @skms31 3 года назад

    Really Wonderful stuff Simon !
    Was wondering how spark /databricks handles keys , does databricks get data which already have keys from the upstream data or do you know how a dimension is created with keys being generated like a typical merge dimension procedure would do in SQL server.

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      Hey, thanks for watching! Like any analytics tool, you get busines keys from upstream/source systems then usually need to build new/composite keys. It's fairly common to create a hash over business keys, if you need to generate new "identity" columns there are a few patterns to follow - the monotonically_increasing_id() function is great to add new unique values on top of the a dataframe, and you can simply add the current max value of the dim if you are trying to create "surrogate key" style functionality.
      Quite often it's easier to stick with hash values and deal with the SCD/latest version problems downstream
      Simon

    • @skms31
      @skms31 3 года назад

      @@AdvancingAnalytics ah I see, thanks for the insights !

  • @sid0000009
    @sid0000009 2 года назад

    how should we decide using UI, if increasing number of nodes(cores) or increasing the SKU(memory) of the Node would give me more performance benefits.. Thank you! :)

  • @eduardopalmiero6701
    @eduardopalmiero6701 Год назад

    hi! do you know why sometimes executors on executor's tab turn blue?

  • @NoahPitts713
    @NoahPitts713 5 дней назад

    💡🔐🔥

  • @iParsaa
    @iParsaa 3 года назад

    thank you, it is very clear. Do you agree to share the code used during the explanation?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад

      I generally don't have time to tidy the code up and make it separately runnable - maybe in future :)
      For this one, I grabbed the AQE demo from the Databricks blog, it's good to force skew, small partitions etc to use as diagnosis practice:
      docs.databricks.com/_static/notebooks/aqe-demo.html

  • @rashmimalhotra123
    @rashmimalhotra123 3 года назад

    How can we choose the right number of worker node for my job ..my job was using the max number of worker node when i change from 75 - 90 ..but both time job was running fine...I did not see any change in performance

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад

      While the job is running you can check the number of tasks against the number of slots - if your job has fewer tasks than there are slots then adding more workers won't change the processing time. If you want to utilise more of the cluster, you can repartition the dataframe up to spread it across more rdd blocks. That's a careful balance as it introduces a new shuffle which may cause more performance problems than the increased parallelism!
      Hopefully that gives you something to look at, if not that helpful!
      Simkn

    • @rashmimalhotra123
      @rashmimalhotra123 3 года назад

      @@AdvancingAnalytics are you suggesting to use repartitions or not ? as my notebook takes one hours plus?