You Should Be Using Spark, Not MapReduce | Systems Design Interview 0 to 1 With Ex-Google SWE

Jordan has no life

Просмотров 9 тыс.

229

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 10 фев 2025

Комментарии • 35

@tarun4705 Год назад ⁺¹⁹
This playlist is 100 times better and offers an in-depth explanation than all those paid system design courses.
@AlexFosterAI 3 месяца назад
agreed lol
@LeoLeo-nx5gi Год назад ⁺⁴
Ur explanations are so clear, thanks a ton!! Also 10k coming soon 💪
@jordanhasnolife5163 Год назад ⁺¹
Woohoo! @recursion may have actually botted me some subs...
@Maxim6431 9 месяцев назад ⁺³
Thanks for the video, found some new info in this video.
The gaps I found:
- The Linage is not explained and Linage is how Spark recover from worker failures
- 8:29 is not correct, you can trigger a writing to disk as a user to make it easier to recover from failures in a long Linage but Spark does not do that automatically
- The API difference is not mentioned, that one is also very big benefit of Spark over MR
@jordanhasnolife5163 9 месяцев назад
Guessing you mean lineage here.
1) Which part specifically am I not explaining how we aren't covering from worker failures? We either hope the node comes up or we restore state from the previous checkpoint to another node and recompute any work that we need to.
2) Good to know, thank you!
3) Which API benefit? It seems to me what you're mentioning is being able to use arbitrary operators over a mapper + reducer, which I do believe I mentioned but maybe I forgot this time around.
@Maxim6431 9 месяцев назад ⁺³
@@jordanhasnolife5163
1) I meant that you could just define a RDD's Linage\Query plan, explicitly stating that a sequence of upstream RDD transformation from the last checkpoint is maintained and that it is used for the failure recovery if needed.
3) Spark provides much richer and and more powerful API and after working with Spark engineers don't want to get back to using bare bone MR.
@firoufirou3161 Год назад ⁺³
Thanks for the content. With Spark there is no need to store the entire data in memory we can also persist part of it or all on disk, but that will make things slower. Also the data will be loaded by partition in the executors, so we don't need to fit the entire dataset in memory unless the input (file) format to read doesn't support reading by chunk (rare). I hope I am not missing something.
@jordanhasnolife5163 Год назад ⁺²
Yep! Right, by partition fair point, and totally true with regards to not needing to fit it all, things just get slower haha
@KENTOSI Год назад ⁺¹
Excellent explanation and summary. Thanks!
@andydataguy Год назад ⁺¹
Drunk vids are the best. Thanks for recording this!
@msebrahim-007 6 месяцев назад ⁺¹
Have trouble following how Spark is addressing a wide-dependency failure (8:15).
The solution to deal with a wide-dependency failure is to assume the wide-dependency succeed then write the results to disk?
How does this address a failed wide-dependency? For instance, using your example diagram, what if the top node failed after {a: 3} an never got {a: 6}. Writing a partial result to disk here wouldn't be entirely useful.
@jordanhasnolife5163 6 месяцев назад ⁺¹
You'd have to redo the computation from the last checkpoint then up to this point.
In the example you provided that would mean the bottom node went down: we'd spin up another and have it redo that local computation.
@AlexFosterAI 3 месяца назад ⁺¹
really curious what you think of lakesail's pysail bro. built on rust and suppsoedly 4x the speed and 90% less hardware cost than spark. pretty recent project but looks cool
@jordanhasnolife5163 3 месяца назад ⁺¹
I have yet to hear of it, but I'll try taking a look at some point if possible! If it's a similar API as spark and a lot faster, I imagine that would gain a lot of traction!
@matveyshishov 5 месяцев назад ⁺¹
When I want to know how something hyped works, I usually look for someone who has beef with it due to having been there done that.
For MapReduce, it's Stonebraker (he's literally like Schmidthuber but in the world of DBs), who wrote a paper "Why MapReduce is a dumb hype and sucks big time" (well, he renamed it later into "MapReduce: A major step backwards"), where he nicely shts on MR, with references.
So yeah, MR is bs, and if it weren't for Google, nobody would've even touched it, but 2011 was the year when resume-driven development exploded big time, and most architects used every opportunity to prepare for an interview at Google at the expense of pointy-haired managers.
HOWEVER, Jeff Dean isn't dumb, and IIRC MR was never developed to be efficient and what not, rather, there was a massuve underutilization problem, power saving tech didn't enter the picture, yet, and so if a shtty commodity hardware piece wasn't used 100%, it would fail for nothing, leaving behind only an electricity bill. MR was an attempt to run SOMETHING as low priority jobs, which if the machine is needed, would be killed with no remorse (i.e. spot instances). Thus if there was however dumb of an idea (like that famous ML project discovering that major eigenvalues of all youtube videos in the world look like kittens), running it at google scale was considered a good opportunity for the 20% projects. Those were good times, kids, and don't ask me what Google Wave was, we don't talk about it in decent societies.
On the other hand, blitzscaling was entering the picture, with IQ of CS grads dropping like a stone, faster than the expectations of team leads, and map/fold (functor/monoid) was chosen as a simple enough concept for anyone to be able to operate on, basically a microwave of fp world. So Jeff and his team wrote the difficult parts (shuffling, coordinating), and dumbed down the exposed parts as much as possible. Stranglers? Who cares! Skewed reducers, all outputs mapped to the same key? You go girl!
Don't get me quoted on this, I only heard this story from unreliable sources who probably lied.
Better listen to Stonebraker and watch Andy Pavlo.
@jordanhasnolife5163 5 месяцев назад ⁺¹
Really funny you commented this, I read the article last night. Yeah it's an interesting one, but funny to see MR's popularity nonetheless.
Andy Pavlo is great as well!
@navdeepredhu4081 Год назад ⁺⁵
Why do you only have 10k subscribers!!!
@jordanhasnolife5163 Год назад ⁺²
Haha you guys gotta tell your friends about the channel - randomly having a nice day subs wise here though
@navdeepredhu4081 Год назад ⁺²
You gotta start showing some skin if you want more subs😂
@jordanhasnolife5163 Год назад
@@navdeepredhu4081 incoming toes next video
@user-se9zv8hq9r Год назад ⁺²
@@jordanhasnolife5163 instead of a facecam, how about a feetcam
@jordanhasnolife5163 Год назад ⁺¹
@@user-se9zv8hq9r Now these are the ideas I'm looking for, you're hired
@vetiarvind 29 дней назад ⁺¹
wide dependency write to disk in intermediate state is like map reduce 😅 what changed?
@jordanhasnolife5163 29 дней назад
Map reduce treats everything as a "wide dependency", even when it isn't necessarily (purely map jobs or something like that). Also, chaining together map reduce jobs materializes intermediate state to a distributed file system.
@Secret4us День назад ⁺¹
I'm sure you received a bunch of ui tickets when you returned to work on Monday as a result of this vid.
@jordanhasnolife5163 Час назад
No doubt
@varshard0 Год назад ⁺¹
With Flink, why do we still want to use Spark?
@jordanhasnolife5163 Год назад ⁺²
Flink is only useful for processing data as it comes in. With spark, the goal is to take existing data on disk and output more data on disk.
@varshard0 Год назад
@@jordanhasnolife5163 but with Spark Streaming, would that allows it to perform the same role as Flink, right?
@SicknessesPVP 10 месяцев назад ⁺¹
this old intro xD
@adrian333dev 9 месяцев назад ⁺²
Dude what are these jokes??? 😂😂
@jordanhasnolife5163 9 месяцев назад ⁺¹
I wish I could answer
@Kris-zy5qm 4 месяца назад ⁺²
Map Reduce was already dead over a decade
@jordanhasnolife5163 4 месяца назад
Golly!

Следующие

Автовоспроизведение

What's Stream Processing + When Do We Use It? | Systems Design Interview 0 to 1 with Ex-Google SWE