Top 5 Mistakes When Writing Spark Applications

Поделиться
HTML-код
  • Опубликовано: 22 ноя 2024

Комментарии • 39

  • @johnengelhart4453
    @johnengelhart4453 8 лет назад +19

    I would love to see an example of the salting side that is missing

  • @35sherminator
    @35sherminator 2 года назад

    Thanks for superbly breaking down the mistakes and their solutions. Thanks for the excellent presentation.

  • @kumarrohit8311
    @kumarrohit8311 4 года назад +1

    Anyone noticed Sameer Farooqui clicking photos when QnA started?
    Awesome guys, all of them!

  • @Machin396
    @Machin396 4 года назад +2

    I am new to Spark and after viewing this presentation I see there's a lot to learn. I liked it a lot, thanks!

  • @rangarajanrao1994
    @rangarajanrao1994 Год назад

    Excellent. Best wishes.

  • @bensums
    @bensums 6 лет назад +2

    At 6:21 it should say divide by 1 + 0.07 not multiply by 1 - 0.07. Also, on more recent versions of Spark it's gone up from 7% to 10%.

  • @支那湾倭好好贱
    @支那湾倭好好贱 5 лет назад +1

    5 cores per executor did not work for us. For us, the best number is 3 for on-prem, 2 for EMR. Number larger than that gave us IO exception. You need to adjust case by case.

  • @PizdaRusni2023
    @PizdaRusni2023 2 года назад

    Great

  • @sahebbhattu
    @sahebbhattu 7 лет назад +2

    Hi Mark, awesome explanation regarding exe and exe mem calculations. But this is for how can we use max number of cores or exe in the environment provide to achieve max parallelism . I would like to add one more point that if we are having so much memory load to deal with, we have to trade off number of exe\cores for executor memory. That means in the case of massive memory load we may have to go with lesser number of executers ( lesser than 17 exe) and keeping higher exe mem per exe ( more than 19 gb .....Please correct me if I am wrong...Thanks.

  • @VasileSurdu
    @VasileSurdu 6 лет назад +3

    why can't they just let them speak and end their presentation for god's sake?? was it that big of a problem letting them finish their last 2 mistakes ? lol.. the last one (caching vs persisting) was very interesting

  • @charlesli5809
    @charlesli5809 8 лет назад

    awesome sharing, great thanks

  • @madhavareddy3927
    @madhavareddy3927 8 лет назад

    Thank you guys! Done a great job..

  • @sailpawar6164
    @sailpawar6164 Год назад

    damn 5 years ago...i absolutely loved the presentation
    engaging is a difficult job..u did great
    also
    is it me or anyone else..these 2 faces looks too familiar by the time video ends

  • @gounna1795
    @gounna1795 7 лет назад +3

    Great topic, Great explanation!

  • @andriimed6408
    @andriimed6408 5 лет назад +1

    it's awesome, thanks a lot!

  • @dtsleite
    @dtsleite 5 лет назад +2

    What Cloudera knows about spark applications they dont even update their versions.

  • @gauravkataria1
    @gauravkataria1 7 лет назад

    Thanks a lot. Very helpful!

  • @AlexanderWhillas
    @AlexanderWhillas 5 лет назад +1

    These are also the top reasons Spark is still relatively unpopular :-/

    • @Machin396
      @Machin396 4 года назад +2

      Really? I thought It was already popular in 2020. If not, what else is gaining attention instead?

  • @kambaalayashwanth123
    @kambaalayashwanth123 5 лет назад

    what about loading small files ?

  • @popicf
    @popicf 7 лет назад +1

    but what to do if you have only 7 node cluster with 4 cores and 8GB ram?

  • @CRTagadiya
    @CRTagadiya 8 лет назад

    awesome

  • @vinothsmart1
    @vinothsmart1 6 лет назад

    what was the tool he was talking about for Spark unit testing ?

  • @nguyen4so9
    @nguyen4so9 7 лет назад

    Very cool :) ..!

  • @TheSmartTrendTrader
    @TheSmartTrendTrader 5 лет назад

    What is that special collection to do ETL?

    • @letscodewithvivek5191
      @letscodewithvivek5191 3 года назад

      I have the same question..till now i have been doing etl using df only, never used any custom collections..

  • @sakthivel021
    @sakthivel021 5 лет назад

    what will be the solution of 2G Spark Shuffle size. ?

  • @rajjad
    @rajjad 7 лет назад

    where are the slides?

  • @JoHeN1990
    @JoHeN1990 4 года назад +1

    The data quality check article mentioned in 22:52 can be found here web.archive.org/web/20181116232422/blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

  • @nakget
    @nakget 3 года назад

    How each node gets 3 executors at ruclips.net/video/WyfHUNnMutg/видео.html ?

  • @StuggleIsSurreal
    @StuggleIsSurreal 3 года назад

    Spark, by itself, is not intended to handle CPU-intensive operations on your data. If you have a process against the data that requires a lot of CPU or memory resources and/or is consuming CPU time, move that process into a microservice or competing consumer pattern. This problem will bog down your data handling and prevent you from using Spark effectively.

  • @MisterKhash
    @MisterKhash 5 лет назад

    I can't understand what he is saying !!