@@jordanhasnolife5163 my advice to you as you go down this route is that if you make your channel more specialized, you risk losing a general audience. you'll have to decide whether the tradeoff is worth it -- or you'll have to put in more effort trying to fit the specialized topic to the general audience. I am interested in these papers but also feel like I get lost in a lot of the details -- and I am still struggling to feel competent at general system design. I think what's helpful / interesting here is we see how real world problems get solved, not just the toy examples you have to work through. but if you want to make these papers relevant to a broader audience, you might have to work a little harder to help uncover the lessons that don't apply to the particular system but that we can think about when we get interview questions, when we're working, etc.
Oh my I wanted to replicate this GFS alby myself and was searching for good resources this is literally so golden thanks a lot :D Always enjoy your system design videos and following your no life strategy 💯👍
Jordan sat alone in his dimly lit room, eyes fixed on the screen. Kafka Streams flowed before him like a slow, seductive dance. His fingers moved over the keyboard, sending commands that made the data bend to his will-smooth, precise, and totally in his control. “Processing this much data feels like... processing my love life,” he chuckled, leaning in closer. “A little messy at first, but once I get my hands on it, everything falls into place... perfectly.” The logs rolled in real time, a rhythm that matched his heartbeat. Who needed real life when the streams responded to him this way? Controlled. Obedient. Alive. Jordan didn’t just process data-he made it swoon.
Hey Jordan! It took me a while to realize that writing is done in chunks and chunks can get stored in different servers! I think this is a basic principle and would be beneficial if shared in the video. Actually why we want to store multiple chunks of the same file to different servers? we can achieve high throughput assigning multiple servers to store a file while each server processes one file. Thanks for the video!
Thanks Mark! 1) I feel like I tried to cover this at 3:48 2) We use chunks to better distribute data for super large files, but 64MB is still big enough that there isn't a massive overhead to having to jump around every once in a while to a new server to load the next chunk 3) We'll eagerly load the metadata of all chunk locations as an optimization when try to read one file, and at that point you're pretty much not losing any read performance
Thanks a lot for such a great content! A question: at 33:47 about Check Sums on Writes - when/why could we have corrupted fragment of 64KB region - C1. Didn't get why could it happen... Thanks!
@@jordanhasnolife5163 As data could be corrupted randomly, shouldn't regions checksums be recalculated everytime, not only when there is a write which involves multiple 64KB regions? Guess not as sounds like pretty expensive... In provided example seems the reason to recalculate all three region's checksums come from the fact that the write involves those regions, but maybe Im miss smth. My confusion here from why potential corruption even matters.
@@ВалерийГоловко-т9я On every write certainly! Corruption matters because then we no longer have data that we thought we could access. Now if we just had replica, sure detecting corruption is useless. But when you have 3, you can see one has been corrupted, and make another replica off of one of the uncorrupted ones.
Thanks for the great video! Question about atomic appends. Why is interleaving bad? Is it because 2 different chunks could be appended next to each other so adding buffer allows concurrent writes still? Confusing 😢
If I want to write some data, it's possible that data doesn't make any sense if I only see the first few lines and then somewhere much later in the next chunk I see the last three lines. If two clients append data A and data B, I don't want to read the first half of A, then B, then the second half of A (because there was a chunk boundary). A = "My name is Jordan" B = "Corinna Kopf" If we interleave these they become "My name is Corinna Kopf Jordan" Doesn't make much sense anymore does it?
if network bandwidth was not a concern, then would it be more efficient for the client to write to all three replicas in parallel rather than "data pipelining" to the closest chunk server? also, after successful replication and storing in memory, would the client get an "ACK" from the closest chunk server which then forwards that on to the master where it would perhaps append to its op log?
Not really sure what you mean by most efficient. If you mean the fastest write throughput, then possibly, but I imagine that depends on the client's outgoing bandwidth. The client receives an ack from the primary chunk server, as that's the one that initiates the write.
This channel’s title is becoming more and more accurate as the series continues.
Amen to that brother
@@jordanhasnolife5163 my advice to you as you go down this route is that if you make your channel more specialized, you risk losing a general audience. you'll have to decide whether the tradeoff is worth it -- or you'll have to put in more effort trying to fit the specialized topic to the general audience. I am interested in these papers but also feel like I get lost in a lot of the details -- and I am still struggling to feel competent at general system design. I think what's helpful / interesting here is we see how real world problems get solved, not just the toy examples you have to work through. but if you want to make these papers relevant to a broader audience, you might have to work a little harder to help uncover the lessons that don't apply to the particular system but that we can think about when we get interview questions, when we're working, etc.
This is your best video I checked so far! Great job!
Jordan, keep up the amazing work. Thank you.
gold, pure gold, thanks for this
Oh my I wanted to replicate this GFS alby myself and was searching for good resources this is literally so golden thanks a lot :D
Always enjoy your system design videos and following your no life strategy 💯👍
Jordan sat alone in his dimly lit room, eyes fixed on the screen. Kafka Streams flowed before him like a slow, seductive dance. His fingers moved over the keyboard, sending commands that made the data bend to his will-smooth, precise, and totally in his control.
“Processing this much data feels like... processing my love life,” he chuckled, leaning in closer. “A little messy at first, but once I get my hands on it, everything falls into place... perfectly.”
The logs rolled in real time, a rhythm that matched his heartbeat. Who needed real life when the streams responded to him this way? Controlled. Obedient. Alive. Jordan didn’t just process data-he made it swoon.
Real or ChatGPT? Either way, well done, I do think about kafka streams a lot
Hey Jordan! It took me a while to realize that writing is done in chunks and chunks can get stored in different servers! I think this is a basic principle and would be beneficial if shared in the video. Actually why we want to store multiple chunks of the same file to different servers? we can achieve high throughput assigning multiple servers to store a file while each server processes one file. Thanks for the video!
Thanks Mark!
1) I feel like I tried to cover this at 3:48
2) We use chunks to better distribute data for super large files, but 64MB is still big enough that there isn't a massive overhead to having to jump around every once in a while to a new server to load the next chunk
3) We'll eagerly load the metadata of all chunk locations as an optimization when try to read one file, and at that point you're pretty much not losing any read performance
Thanks a lot for such a great content!
A question: at 33:47 about Check Sums on Writes - when/why could we have corrupted fragment of 64KB region - C1. Didn't get why could it happen... Thanks!
With enough standard commodity disks data can just get corrupted randomly sometimes due to hardware failures.
@@jordanhasnolife5163 As data could be corrupted randomly, shouldn't regions checksums be recalculated everytime, not only when there is a write which involves multiple 64KB regions? Guess not as sounds like pretty expensive...
In provided example seems the reason to recalculate all three region's checksums come from the fact that the write involves those regions, but maybe Im miss smth. My confusion here from why potential corruption even matters.
@@ВалерийГоловко-т9я On every write certainly!
Corruption matters because then we no longer have data that we thought we could access.
Now if we just had replica, sure detecting corruption is useless. But when you have 3, you can see one has been corrupted, and make another replica off of one of the uncorrupted ones.
love the subtle pun, whether intended or not 😂
Thanks for the great video! Question about atomic appends. Why is interleaving bad? Is it because 2 different chunks could be appended next to each other so adding buffer allows concurrent writes still? Confusing 😢
If I want to write some data, it's possible that data doesn't make any sense if I only see the first few lines and then somewhere much later in the next chunk I see the last three lines. If two clients append data A and data B, I don't want to read the first half of A, then B, then the second half of A (because there was a chunk boundary).
A = "My name is Jordan"
B = "Corinna Kopf"
If we interleave these they become "My name is Corinna Kopf Jordan" Doesn't make much sense anymore does it?
Great Content !!! ♥
if network bandwidth was not a concern, then would it be more efficient for the client to write to all three replicas in parallel rather than "data pipelining" to the closest chunk server?
also, after successful replication and storing in memory, would the client get an "ACK" from the closest chunk server which then forwards that on to the master where it would perhaps append to its op log?
Not really sure what you mean by most efficient. If you mean the fastest write throughput, then possibly, but I imagine that depends on the client's outgoing bandwidth.
The client receives an ack from the primary chunk server, as that's the one that initiates the write.
Great content
This is just to let you know that you didn't have anything in the files metadata description of this video. ie. no pointers to the paper etc.
I'm almost certain the first link in the description is a link to the paper.
Sipes Roads
hi, I am sleeping. how can you get hire at Google?
I don't know what this means
Dude u lost alot of weight Nice work
Ha thank you man though I'm afraid it's about to be bulking season so I'll be right back
One must imagine Sisyphus happy
@@huz1 I like that
Has anyone tried implementing it from scratch?
Google, Hadoop, I'm sure plenty of others
A quick question though, what did you have for starters today? 😅
Care to elaborate?