You Should Be Using Spark, Not MapReduce | Systems Design Interview 0 to 1 With Ex-Google SWE

Поделиться
HTML-код
  • Опубликовано: 18 сен 2024

Комментарии • 30

  • @tarun4705
    @tarun4705 8 месяцев назад +8

    This playlist is 100 times better and offers an in-depth explanation than all those paid system design courses.

  • @LeoLeo-nx5gi
    @LeoLeo-nx5gi Год назад +4

    Ur explanations are so clear, thanks a ton!! Also 10k coming soon 💪

  • @Maxim6431
    @Maxim6431 4 месяца назад +1

    Thanks for the video, found some new info in this video.
    The gaps I found:
    - The Linage is not explained and Linage is how Spark recover from worker failures
    - 8:29 is not correct, you can trigger a writing to disk as a user to make it easier to recover from failures in a long Linage but Spark does not do that automatically
    - The API difference is not mentioned, that one is also very big benefit of Spark over MR

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 месяца назад

      Guessing you mean lineage here.
      1) Which part specifically am I not explaining how we aren't covering from worker failures? We either hope the node comes up or we restore state from the previous checkpoint to another node and recompute any work that we need to.
      2) Good to know, thank you!
      3) Which API benefit? It seems to me what you're mentioning is being able to use arbitrary operators over a mapper + reducer, which I do believe I mentioned but maybe I forgot this time around.

    • @Maxim6431
      @Maxim6431 4 месяца назад +1

      @@jordanhasnolife5163
      1) I meant that you could just define a RDD's Linage\Query plan, explicitly stating that a sequence of upstream RDD transformation from the last checkpoint is maintained and that it is used for the failure recovery if needed.
      3) Spark provides much richer and and more powerful API and after working with Spark engineers don't want to get back to using bare bone MR.

  • @firoufirou3161
    @firoufirou3161 Год назад +3

    Thanks for the content. With Spark there is no need to store the entire data in memory we can also persist part of it or all on disk, but that will make things slower. Also the data will be loaded by partition in the executors, so we don't need to fit the entire dataset in memory unless the input (file) format to read doesn't support reading by chunk (rare). I hope I am not missing something.

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +2

      Yep! Right, by partition fair point, and totally true with regards to not needing to fit it all, things just get slower haha

  • @andydataguy
    @andydataguy 11 месяцев назад +1

    Drunk vids are the best. Thanks for recording this!

  • @matveyshishov
    @matveyshishov 11 дней назад +1

    When I want to know how something hyped works, I usually look for someone who has beef with it due to having been there done that.
    For MapReduce, it's Stonebraker (he's literally like Schmidthuber but in the world of DBs), who wrote a paper "Why MapReduce is a dumb hype and sucks big time" (well, he renamed it later into "MapReduce: A major step backwards"), where he nicely shts on MR, with references.
    So yeah, MR is bs, and if it weren't for Google, nobody would've even touched it, but 2011 was the year when resume-driven development exploded big time, and most architects used every opportunity to prepare for an interview at Google at the expense of pointy-haired managers.
    HOWEVER, Jeff Dean isn't dumb, and IIRC MR was never developed to be efficient and what not, rather, there was a massuve underutilization problem, power saving tech didn't enter the picture, yet, and so if a shtty commodity hardware piece wasn't used 100%, it would fail for nothing, leaving behind only an electricity bill. MR was an attempt to run SOMETHING as low priority jobs, which if the machine is needed, would be killed with no remorse (i.e. spot instances). Thus if there was however dumb of an idea (like that famous ML project discovering that major eigenvalues of all youtube videos in the world look like kittens), running it at google scale was considered a good opportunity for the 20% projects. Those were good times, kids, and don't ask me what Google Wave was, we don't talk about it in decent societies.
    On the other hand, blitzscaling was entering the picture, with IQ of CS grads dropping like a stone, faster than the expectations of team leads, and map/fold (functor/monoid) was chosen as a simple enough concept for anyone to be able to operate on, basically a microwave of fp world. So Jeff and his team wrote the difficult parts (shuffling, coordinating), and dumbed down the exposed parts as much as possible. Stranglers? Who cares! Skewed reducers, all outputs mapped to the same key? You go girl!
    Don't get me quoted on this, I only heard this story from unreliable sources who probably lied.
    Better listen to Stonebraker and watch Andy Pavlo.

    • @jordanhasnolife5163
      @jordanhasnolife5163  11 дней назад +1

      Really funny you commented this, I read the article last night. Yeah it's an interesting one, but funny to see MR's popularity nonetheless.
      Andy Pavlo is great as well!

  • @msebrahim-007
    @msebrahim-007 2 месяца назад +1

    Have trouble following how Spark is addressing a wide-dependency failure (8:15).
    The solution to deal with a wide-dependency failure is to assume the wide-dependency succeed then write the results to disk?
    How does this address a failed wide-dependency? For instance, using your example diagram, what if the top node failed after {a: 3} an never got {a: 6}. Writing a partial result to disk here wouldn't be entirely useful.

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 месяца назад +1

      You'd have to redo the computation from the last checkpoint then up to this point.
      In the example you provided that would mean the bottom node went down: we'd spin up another and have it redo that local computation.

  • @KENTOSI
    @KENTOSI Год назад +1

    Excellent explanation and summary. Thanks!

  • @navdeepredhu4081
    @navdeepredhu4081 Год назад +5

    Why do you only have 10k subscribers!!!

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +2

      Haha you guys gotta tell your friends about the channel - randomly having a nice day subs wise here though

    • @navdeepredhu4081
      @navdeepredhu4081 Год назад +2

      You gotta start showing some skin if you want more subs😂

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад

      @@navdeepredhu4081 incoming toes next video

    • @user-se9zv8hq9r
      @user-se9zv8hq9r Год назад +2

      @@jordanhasnolife5163 instead of a facecam, how about a feetcam

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +1

      @@user-se9zv8hq9r Now these are the ideas I'm looking for, you're hired

  • @adrian333dev
    @adrian333dev 4 месяца назад +2

    Dude what are these jokes??? 😂😂

  • @varshard0
    @varshard0 8 месяцев назад +1

    With Flink, why do we still want to use Spark?

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 месяцев назад +1

      Flink is only useful for processing data as it comes in. With spark, the goal is to take existing data on disk and output more data on disk.

    • @varshard0
      @varshard0 8 месяцев назад

      @@jordanhasnolife5163 but with Spark Streaming, would that allows it to perform the same role as Flink, right?

  • @SicknessesPVP
    @SicknessesPVP 5 месяцев назад +1

    this old intro xD

  • @Kris-zy5qm
    @Kris-zy5qm 2 дня назад +1

    Map Reduce was already dead over a decade

  • @harris1801
    @harris1801 Месяц назад +1

    when i'm in a "open your video with a joke about sex, dropping a deuce, masturbation, or romantic self-deprecation" and my opponent is Jordan