Hey Jordan, just wanted to drop a huge thank you for your system design videos! They were crucial in helping me land an E4 offer at Facebook Singapore (I did product architecture instead of system design). Really appreciate the knowledge and insights you've shared. Cheers!
Are you able to give me some guidance on what to expect and the aspects that you felt was important to cover for the Product Architecture interview? I have one coming up and I'm at a complete lost as to where they'll steer the conversations.
@@hl4113 1. Contact your recruiter for a detailed outline of the interview's structure, focusing on timelines and key areas. 2. Use Jordan's channel and Grokking the API design course for preparation All the best!
Talking points 4:42 Naive solution with locks (avoid conflicts), can talk about file formats (markdown?), security, old files in cold storage, web sockets, offline behaviour Write path 6:30 Operational transform single node, can discuss problems with multi nodes (order of changes inconsistent) 10:35 CRDT overview 13.10 State vs Operation CRDT, idempotency, edge case assignment to same field 21:00 Idempotency, version vectors 33:33 Data schema, writes -cdc-> snapshots, partition on doc_id, old writes may be archived after snapshot. 42:35 Diagram
Great explanation on CRDT and version vector. Couple thoughts: 1. still not convinced if we need one document on multiple shards. we could just shard by document id, and no matter what one shard can still easily serve many documents. someone could argue that there will be a cross-region latency between DB and the document if it's in a different region, but for the operation to propagate from one user to another in different region, we will have that anyway. 2. besides, we can always have the most popular document operations on regional redis caches as form of a sorted set for example (again sharded by document id), so that that also serves as a snapshot with strong consistency. 3. we also don't need to wait for the database to save successfully to fanout to others, we can produce to the next kafka queue and then once database commits, we commit the last queue. as long as there's indempotency here we're fine. maybe i'm missing sth here
Fair points! 1) I think cross region and offline editing for it is pretty compelling 2) If you want cross data center redis caches to be strongly consistent, you need consensus between them no? 3) Generally agree with your point, though it could be possible that unsuccessful writes to the DB that we don't end up retrying for some reason become present on the documents of others!
1.The problem is not with big document, but OT requires complex transformation functions to resolve conflicts between concurrent operations. As the number of concurrent users increases, the number of transformations required grows, making it difficult to manage and ensure consistency on single server
Thanks for the great content Jordan. I have a thought and I want your input: Since your text CRDT basically eliminates conflict, you don't need a versioned vector for resolving conflict again. But I think we can still use it for: 1. avoid unnecessarily merging the CRDTs (i.e. if two nodes have the same VV, then they don't need to merge, or if one vector is strictly smaller than the other, then it can just simply discard itself and the other's CRDT values) 2. use the VV to filter out the unnecessary writes. (i think you covered this implicitly) 3. we use VV to create a partial order and thus achieve causality, although I can't be sure whether we need causality? (conflict free should already promise convergence, we might not care about how we eventually get there, except that we want document versioning?)
So this isn't a state based CRDT (sending the whole document), it's an operation based CRDT (just sending incremental changes), which is why I believe the version vector is important here, operational based CRDT messages are not necessarily idempotent. It enables us to 1) Make sure we don't duplicate apply any messages 2) Make sure we apply all messages
Jordan ! Great video as always 🎉. I have a question , have you considered expanding into maybe dissecting an open source product in a video explaining why certain design decisions were made & discuss maybe how you would alternatively try to solve them ? Once again love all the work you put in, this is GOLD. Thanks !
That's an interesting idea! To tell you the truth, while I'm curious about doing this, the truth is that the amount of time that I'd probably have to put into looking into those codebases would be pretty wild haha. Not to mention that the guys working on open source software are a lot more talented than me!
The level of detail in this video makes me want to burn all those stupid superficial bs i have been reading all these years. Imma name my 3rd kid after your channel dude ;).... the 2nd one is gotta be martin tho
Great content as always! One thing that's not clear to me is the need for write tables and Kafka queues. Why don't we (write-servers) write directly to the "snapshot" table? Since we eventually do that and sort the entries on the Kafka consumer side, wasn’t the main purpose of the snapshot to have the edits sorted by their position, rather than the order in which they arrive? Is the reason for using CDC-Kafka and sorting to slow down the incoming edits, ensuring that no user edits are lost? Can you a bit clarify that? because these new components impose delay and overhead
You're correct that if there weren't a writes table I think it would be totally fine to write directly to the snapshot table. The reason I want a writes table is so that we can ask it for any versions of writes that we need to replay (how can we ask a document server to replay prior writes when it just keeps track of its own local state that isn't necessarily persistent?), all of which should be associated with a version vector and easily accessible for clients that are loading up. I then figured that in order to keep all of our data consistent we could use change data capture. I think an additional point is that because the writes tables exist, I don't particularly care whether the snapshot tables are a bit delayed.
Hands down the most in-depth coverage of the topic! One question that I had - is MYSQL a good choice for write db considering that they will be write-heavy?
Well, maybe not, just since I wonder how good of a write throughput we can get with an acid database using b trees, that being said I'm sure it's fine realistically
Jordan, Been watching all your videos and can't thank you enough how useful they have been to go into depth of system design interviews. However, my Q is that is one supposed to cover all the subsystems and areas in that effectively 50 mins of the interview? I found I can only cover 1-2 subsystems within that time between all the back and forth talk with the interviewer.
Hey Jordan, great video. Learnt a lot, thanks for sharing your thoughts. I’m just wondering the level of detail you would go in, for things like conflict resolutions. Just wondering if it be better to give a hand wavy but convincing example, an move on?
I think that for a problem like this, the sequence CRDT is something that PhDs are actively working on. So I highly doubt your interviewer is really going to press you on that unless you're literally going to the google docs team to go work in it yourself lmao. That being said, I think the overarching methodology of the operation based CRDT is worth mentioning!
Thanks for the great video as always! a few questions: 1. when saying to get the doc for new user, we need to sort write with their char position? do we need it, we can just replay every write in their write order in linear time (but maybe more CRDT conflict solving needed)? 2. why the single snapshot table solution will make this DB as bottleneck? (as we need to process all write in server instead of quickly writing into DB or we have more write to server)? 3. when we use more flexible way (not force the same server to broadcast the change to all users), does that mean we need to keep all server connected with all user (in ws if we want real time)
1) Replaying every write is expensive, so I'd expect that to take a while to load. I'd prefer to load a snapshot of the document, where each character is itself already sorted in the DB index. 2) It doesn't inherently become a bottleneck, but you can have thousands of servers processing updates for a single document, so for those to all get coalesced into a single snapshot database seems like a potential flaw in our system, unless we're able to throttle those somehow into kafka or something. 3) Not necessarily, a user could be connected to one server, and the servers could send messages amongst themselves to maintain their own local state of the document and then send that to the user.
@@jordanhasnolife5163 Thanks! for 3. I see... I was thinking it might be easy to assign the same server for one doc (but yay, some popular doc might have a lot of users there). some follow up questions: 1. even with CRDT, do we still have cursor level lock for each doc (it feels we cannot edit the same pos from user experience)? (if that lock is existing and crossing multiple server, we need something like distributed lock?). 2. Also if the data is lost while sending from one server to the other server, how can we ensure they can retry (queue? but it might hurt latency a bit?) and will be eventual consistency between servers?
@@Anonymous-ym6st On a per server level you'd have to do some locking yeah, assuming it can handle many requests at once. 2) We have a bunch of retry mechanisms that we use for loading document state to a client, you can do the same for loading them to a server.
Hi Jordan! Just watching the CRDT part of the video where you mention giving fractional ids to the characters, between 0 and 1. I was wondering how/at what point these ids are assigned. For instance, if you create a blank document and start typing, what would it look like? And if you then add a few paragraphs at the end, how would these new indexes be assigned? The example you gave (and that I've seen in other places) treat it as an already existing document with already assigned indexes and you just inserting stuff in between. I was thinking it might be a session thing - i.e. the first user that opens a connection to the file gets these assigned and stores in memory or something, but I watched another video where you mention it being indexed in a database. I'd love to know!
I think I understood in the end, maybe? indexes 0 and 1 don't actually exist - your first character will be around 0.5, second character around 0.75, and so on... you're only going to get indexes < 0.4 if you go back on the text and add characters before the first character you added. If you write without stopping or going back, you'll get 0.5, 0.75, 0.875, 0.9365 and so on?
Hey! I think this is probably implementation dependent, but I imagine the idea here is that there's some frontend logic to batch quick keystrokes together so that they're all assigned similar indices as opposed to constantly bisecting the outer characters (see the BIRD and CAT) example.
Hey Jordan, is there anyway you can make some content regarding how to tackle product architecture interview? I have one from meta coming up and couldnt find many sources of examples for content more focused on API design, client server interactions, extendibility, etc...? There are no examples I can find related to this on youtube. Thank you for all your content!
Hey! I've never done this interview myself so perhaps I'm not the most qualified. But considering that I've had multiple people on here say that they've passed meta interviews, I imagine it's pretty similar to systems design.
In the final design of this video, the write server persists a write to MySQL before broadcasting the write. What do you think of using a different database, perhaps one using a log-structured merge tree, to reduce the time it takes for the write to get to other clients? Under the current design, if I am not mistaken, it may take a couple of seconds for other clients to see a change. I can see this being frustrating if, say, two collaborators are working on a Google Doc together while sharing it using Google Meet screen-sharing. An even better option than even that may be to use a log-structured merge tree and write-ahead log on the write server itself. The write server can asynchronously replicate to a persistent backing store; effectively, the write server would be a write-back cache.
1) Totally agreed on the LSM tree DB! 2) Yeah was just thinking about the log on the server itself in response to some other comment. Agree that it should work, but with a cache approach like that you have to evaluate whether you're willing to deal with potential consistency issues if the server itself goes down.
Great video Jordan. Two questions on final design screen: 1. Write DB sharding: What is the difference between sharding by DocId vs DocId+ServerId? 2. Document Snapshot DB: We are sharding by docID and indexing by docId+character position, is this correct?
Hey Jordan! You did a great job with this one, thanks for you hard work! After watching this video and looking at the final design I didn't quite get to which place would a reader connect to receive updates about new changes in the document? I see that there are arrows to cache, to vectors db, to snapshots db and to write db, but don't see any WS server or something Could you clarify please?
Leader first gets a snapshot with a version vector from the vectors and snapshot db, and from there subscribes to changes on document servers, applying any subsequent updates
Frankly I think you'd have less collisions which probably means you can get away using a single leader node and not be that bottlenecked. If for some reason you did need to do this, you'd basically need a way of combining writes to the same cells, which doesn't really make much sense intuitively. I'd say if you want to do multi leader you should probably at least incorporate a distributed lock so that if two people decide to edit cells at the same time, we conclusively know which one came first.
38:41 I have a question about bottleneck being the database. Isn't it expected that there are a single database handling many documents that many different people are handling, and a single document will not overload a single database?
While that is most likely the case in reality, I don't mind creating the partitions this way in the event that they get prohibitively big. I want to be able to decouple all of the servers from one another in the event we have multi geographic regions or something
how does the derived data through CDC magic works? in our CDC writes are coming no in order in we want to keep them sorted by their float values. we can batch read them and sort them before a write sure, but once we write them to the database how can we make sure our the snapshot db is always kept sorted. More over how do partition our snapshot db?
Hey Jordan, first of all, thnx for the great video! I have a question: can we use event-sourcing design approach instead of CDC? Meaning that using Kafka topics as the main source of truth instead of the writes' DB. We can consume from Kafka and build snapshots DB, and also users can consume from the needed Kafka partition to get the latest document changes. Thus we automatically get an order for writes inside any single partition and have persistence for writes. WDYT?
Absolutely! Keep in mind though that this implies that the single kafka queue becomes a point that all writes need to go through, which we want to avoid. If we do event sourcing with multiple kafka queues and assign ids to each write based on the queue id and the position in the queue, then use the vector resolution logic that I discuss, I think that this would be better!
Thnx, of course I have in mind using separated Kafka partitions for each document (or set of documents), and using topic's offsets to store for using with snapshots. I'm not sure although if we can use the only one topic with multiple partitions for all writes, because if we have too many partitions for one topic it can increase latency. Maybe it's better to somehow split the incoming data and use many topics, to avoid this problem.@@jordanhasnolife5163
What if we used some kind of gossip protocol to send writes to client? ie. - Each server that receives writes from clients sends it to x random servers (and those random servers sends those writes to the connected clients for those servers). This is useful we anyway want a persistent connection with the client to send them data and each server doesn't have to know about all clients? - This obviously means we need to know about other servers using something like ZK which we anyway likely will do because we want all writes to reach other servers? (may be not in the latter section where all writes go to a DB)
That's basically what we're doing. You can use any protocol you want so long as document state goes from one client to the rest, but you need things like version vectors in order to deduplicate writes.
I think it's an interesting idea, though my thinking was we really want a single leader here so that snapshots are consistent with the entry in the version vector DB
I’m 30 minutes in. Got the sense each client just gets all these messages from other clients and applies them using some merge function that guarantees the result of applying messages in the order received makes sense - with a little bit greater consistency (via version vectors) for writes from the same client. But I’m wondering - is there any sync point at which all users are guaranteed to see the same version of the document? Because if not clients could just diverge more and more over time…
The version vector for a document can exist on the same partition as the documents partition. If we assume a document can only reach megabytes and not gigabytes it's safe to assume a single document can exist on a single partition. Even if a single document has to be chunked, then we can still colocate the version vector for that chunk.
@@joshg7097 Hey Josh, you can co-locate it, but it still becomes a distributed transaction which needs to use 2pc. Also, ideally, we don't have to be too rack aware in our write storing I feel like, because if we were to use something like AWS we don't necessarily have those controls. I agree with your point though, probably 99.9% of the time a document won't span multiple partitions and in such an event you should store the version vector local to its partition and don't need 2pc.
I have an onsite with META, but I am not sure which one to choose. I don't have much front end experience but I also don't know very deep into various flavors of db and stuff. Although I can come up with some kind of flow like in these video. Any suggestions?
Thanks for the video Jordan. At ruclips.net/video/YCjVIDv0zQY/видео.html How does the new client that has no content fetched so far get the content from Snapshot DB directly? What does it ask the Write DB or Document DB at this point?
You go to the snapshot DB, get a snapshot at some time T, and then poll the writes db for writes beyond that snapshot until you're caught up (e.g. incoming writes to the document are the next ones on top of what you already have).
Hey Jordan, just wanted to drop a huge thank you for your system design videos! They were crucial in helping me land an E4 offer at Facebook Singapore (I did product architecture instead of system design). Really appreciate the knowledge and insights you've shared. Cheers!
That's amazing man! Congrats, your hard work paid off!
Are you able to give me some guidance on what to expect and the aspects that you felt was important to cover for the Product Architecture interview? I have one coming up and I'm at a complete lost as to where they'll steer the conversations.
@@hl4113 1. Contact your recruiter for a detailed outline of the interview's structure, focusing on timelines and key areas.
2. Use Jordan's channel and Grokking the API design course for preparation
All the best!
this is production level detail - definitely requires a second sweep to memorize better!
I've always wondered how google doc is done, this is an amazing video, thank you so much!
This channel is so underrated.
Talking points
4:42 Naive solution with locks (avoid conflicts), can talk about file formats (markdown?), security, old files in cold storage, web sockets, offline behaviour
Write path
6:30 Operational transform single node, can discuss problems with multi nodes (order of changes inconsistent)
10:35 CRDT overview
13.10 State vs Operation CRDT, idempotency, edge case assignment to same field
21:00 Idempotency, version vectors
33:33 Data schema, writes -cdc-> snapshots, partition on doc_id, old writes may be archived after snapshot.
42:35 Diagram
Got an offer from LinkedIn. Your videos were great help in system design interview ❤.
Legend!! Congratulations!
Great explanation on CRDT and version vector. Couple thoughts:
1. still not convinced if we need one document on multiple shards. we could just shard by document id, and no matter what one shard can still easily serve many documents. someone could argue that there will be a cross-region latency between DB and the document if it's in a different region, but for the operation to propagate from one user to another in different region, we will have that anyway.
2. besides, we can always have the most popular document operations on regional redis caches as form of a sorted set for example (again sharded by document id), so that that also serves as a snapshot with strong consistency.
3. we also don't need to wait for the database to save successfully to fanout to others, we can produce to the next kafka queue and then once database commits, we commit the last queue. as long as there's indempotency here we're fine.
maybe i'm missing sth here
Fair points!
1) I think cross region and offline editing for it is pretty compelling
2) If you want cross data center redis caches to be strongly consistent, you need consensus between them no?
3) Generally agree with your point, though it could be possible that unsuccessful writes to the DB that we don't end up retrying for some reason become present on the documents of others!
1.The problem is not with big document, but OT requires complex transformation functions to resolve conflicts between concurrent operations. As the number of concurrent users increases, the number of transformations required grows, making it difficult to manage and ensure consistency on single server
Thanks for the great content Jordan. I have a thought and I want your input:
Since your text CRDT basically eliminates conflict, you don't need a versioned vector for resolving conflict again. But I think we can still use it for:
1. avoid unnecessarily merging the CRDTs (i.e. if two nodes have the same VV, then they don't need to merge, or if one vector is strictly smaller than the other, then it can just simply discard itself and the other's CRDT values)
2. use the VV to filter out the unnecessary writes. (i think you covered this implicitly)
3. we use VV to create a partial order and thus achieve causality, although I can't be sure whether we need causality? (conflict free should already promise convergence, we might not care about how we eventually get there, except that we want document versioning?)
So this isn't a state based CRDT (sending the whole document), it's an operation based CRDT (just sending incremental changes), which is why I believe the version vector is important here, operational based CRDT messages are not necessarily idempotent.
It enables us to
1) Make sure we don't duplicate apply any messages
2) Make sure we apply all messages
writing an ot has operationally transformed my free time into wasted free time
Sounds like a solid operation to me!
Jordan ! Great video as always 🎉.
I have a question , have you considered expanding into maybe dissecting an open source product in a video explaining why certain design decisions were made & discuss maybe how you would alternatively try to solve them ? Once again love all the work you put in, this is GOLD. Thanks !
That's an interesting idea! To tell you the truth, while I'm curious about doing this, the truth is that the amount of time that I'd probably have to put into looking into those codebases would be pretty wild haha.
Not to mention that the guys working on open source software are a lot more talented than me!
Excellent detailed coverage of online text editor. And you made it easy to understand the concepts.
Huge help in landing L4 at Netflix. Much thanks!
Legend!! Congrats :)
The level of detail in this video makes me want to burn all those stupid superficial bs i have been reading all these years. Imma name my 3rd kid after your channel dude ;).... the 2nd one is gotta be martin tho
"Jordan has no life" is a great name for a kid, I endorse
Great content as always! One thing that's not clear to me is the need for write tables and Kafka queues. Why don't we (write-servers) write directly to the "snapshot" table? Since we eventually do that and sort the entries on the Kafka consumer side, wasn’t the main purpose of the snapshot to have the edits sorted by their position, rather than the order in which they arrive? Is the reason for using CDC-Kafka and sorting to slow down the incoming edits, ensuring that no user edits are lost? Can you a bit clarify that? because these new components impose delay and overhead
You're correct that if there weren't a writes table I think it would be totally fine to write directly to the snapshot table.
The reason I want a writes table is so that we can ask it for any versions of writes that we need to replay (how can we ask a document server to replay prior writes when it just keeps track of its own local state that isn't necessarily persistent?), all of which should be associated with a version vector and easily accessible for clients that are loading up. I then figured that in order to keep all of our data consistent we could use change data capture.
I think an additional point is that because the writes tables exist, I don't particularly care whether the snapshot tables are a bit delayed.
Hands down the most in-depth coverage of the topic!
One question that I had - is MYSQL a good choice for write db considering that they will be write-heavy?
Well, maybe not, just since I wonder how good of a write throughput we can get with an acid database using b trees, that being said I'm sure it's fine realistically
Jordan, Been watching all your videos and can't thank you enough how useful they have been to go into depth of system design interviews. However, my Q is that is one supposed to cover all the subsystems and areas in that effectively 50 mins of the interview? I found I can only cover 1-2 subsystems within that time between all the back and forth talk with the interviewer.
I think that you basically have to try and tell from your interviewer which specific pieces of the problem that they'd like to focus on!
@ I thought so as I get too hard on myself that I don’t get enough time to go into depth on multiple areas.
@@akshayd82 yeah I think you should just go into a couple of them or try to mock one out with friends to prove it to yourself
dang this guy is really good! Thanks for making the video!
Hey Jordan, great video. Learnt a lot, thanks for sharing your thoughts. I’m just wondering the level of detail you would go in, for things like conflict resolutions. Just wondering if it be better to give a hand wavy but convincing example, an move on?
But the explanation of CRDT reminded me of a very elevated experience I’ve had haha! Amazing
I think that for a problem like this, the sequence CRDT is something that PhDs are actively working on. So I highly doubt your interviewer is really going to press you on that unless you're literally going to the google docs team to go work in it yourself lmao.
That being said, I think the overarching methodology of the operation based CRDT is worth mentioning!
Thanks for the great video as always! a few questions:
1. when saying to get the doc for new user, we need to sort write with their char position? do we need it, we can just replay every write in their write order in linear time (but maybe more CRDT conflict solving needed)?
2. why the single snapshot table solution will make this DB as bottleneck? (as we need to process all write in server instead of quickly writing into DB or we have more write to server)?
3. when we use more flexible way (not force the same server to broadcast the change to all users), does that mean we need to keep all server connected with all user (in ws if we want real time)
1) Replaying every write is expensive, so I'd expect that to take a while to load. I'd prefer to load a snapshot of the document, where each character is itself already sorted in the DB index.
2) It doesn't inherently become a bottleneck, but you can have thousands of servers processing updates for a single document, so for those to all get coalesced into a single snapshot database seems like a potential flaw in our system, unless we're able to throttle those somehow into kafka or something.
3) Not necessarily, a user could be connected to one server, and the servers could send messages amongst themselves to maintain their own local state of the document and then send that to the user.
@@jordanhasnolife5163 Thanks!
for 3. I see... I was thinking it might be easy to assign the same server for one doc (but yay, some popular doc might have a lot of users there). some follow up questions:
1. even with CRDT, do we still have cursor level lock for each doc (it feels we cannot edit the same pos from user experience)? (if that lock is existing and crossing multiple server, we need something like distributed lock?).
2. Also if the data is lost while sending from one server to the other server, how can we ensure they can retry (queue? but it might hurt latency a bit?) and will be eventual consistency between servers?
@@Anonymous-ym6st On a per server level you'd have to do some locking yeah, assuming it can handle many requests at once.
2) We have a bunch of retry mechanisms that we use for loading document state to a client, you can do the same for loading them to a server.
Hi Jordan! Just watching the CRDT part of the video where you mention giving fractional ids to the characters, between 0 and 1. I was wondering how/at what point these ids are assigned. For instance, if you create a blank document and start typing, what would it look like? And if you then add a few paragraphs at the end, how would these new indexes be assigned? The example you gave (and that I've seen in other places) treat it as an already existing document with already assigned indexes and you just inserting stuff in between.
I was thinking it might be a session thing - i.e. the first user that opens a connection to the file gets these assigned and stores in memory or something, but I watched another video where you mention it being indexed in a database. I'd love to know!
I think I understood in the end, maybe? indexes 0 and 1 don't actually exist - your first character will be around 0.5, second character around 0.75, and so on... you're only going to get indexes < 0.4 if you go back on the text and add characters before the first character you added. If you write without stopping or going back, you'll get 0.5, 0.75, 0.875, 0.9365 and so on?
Hey! I think this is probably implementation dependent, but I imagine the idea here is that there's some frontend logic to batch quick keystrokes together so that they're all assigned similar indices as opposed to constantly bisecting the outer characters (see the BIRD and CAT) example.
Hey Jordan, is there anyway you can make some content regarding how to tackle product architecture interview? I have one from meta coming up and couldnt find many sources of examples for content more focused on API design, client server interactions, extendibility, etc...? There are no examples I can find related to this on youtube. Thank you for all your content!
Hey! I've never done this interview myself so perhaps I'm not the most qualified. But considering that I've had multiple people on here say that they've passed meta interviews, I imagine it's pretty similar to systems design.
In the final design of this video, the write server persists a write to MySQL before broadcasting the write. What do you think of using a different database, perhaps one using a log-structured merge tree, to reduce the time it takes for the write to get to other clients?
Under the current design, if I am not mistaken, it may take a couple of seconds for other clients to see a change. I can see this being frustrating if, say, two collaborators are working on a Google Doc together while sharing it using Google Meet screen-sharing.
An even better option than even that may be to use a log-structured merge tree and write-ahead log on the write server itself. The write server can asynchronously replicate to a persistent backing store; effectively, the write server would be a write-back cache.
1) Totally agreed on the LSM tree DB!
2) Yeah was just thinking about the log on the server itself in response to some other comment. Agree that it should work, but with a cache approach like that you have to evaluate whether you're willing to deal with potential consistency issues if the server itself goes down.
Great video Jordan.
Two questions on final design screen:
1. Write DB sharding: What is the difference between sharding by DocId vs DocId+ServerId?
2. Document Snapshot DB: We are sharding by docID and indexing by docId+character position, is this correct?
1) We now become bottlenecked by a single database node. If we shard by doc and server id each server can write to a close by database.
2) Yep!
Hey Jordan!
You did a great job with this one, thanks for you hard work!
After watching this video and looking at the final design I didn't quite get to which place would a reader connect to receive updates about new changes in the document?
I see that there are arrows to cache, to vectors db, to snapshots db and to write db, but don't see any WS server or something
Could you clarify please?
Leader first gets a snapshot with a version vector from the vectors and snapshot db, and from there subscribes to changes on document servers, applying any subsequent updates
Much appreciate it
Great video! just curious what might be different if this was for a google sheets like product, rather than a document.
Frankly I think you'd have less collisions which probably means you can get away using a single leader node and not be that bottlenecked. If for some reason you did need to do this, you'd basically need a way of combining writes to the same cells, which doesn't really make much sense intuitively. I'd say if you want to do multi leader you should probably at least incorporate a distributed lock so that if two people decide to edit cells at the same time, we conclusively know which one came first.
@@jordanhasnolife5163 Was thinking the same thing, have them write to the same leader, and let the leader's own concurrent write detection decide.
great video and the information at 07:54 (Fortunatly there are engineers who has no life ... 😂😂) made the practicle touch
I was searching for a comment quoting this. What subtle low-profile sarcasm! :D ... Loved it!
38:41 I have a question about bottleneck being the database. Isn't it expected that there are a single database handling many documents that many different people are handling, and a single document will not overload a single database?
While that is most likely the case in reality, I don't mind creating the partitions this way in the event that they get prohibitively big. I want to be able to decouple all of the servers from one another in the event we have multi geographic regions or something
how does the derived data through CDC magic works? in our CDC writes are coming no in order in we want to keep them sorted by their float values. we can batch read them and sort them before a write sure, but once we write them to the database how can we make sure our the snapshot db is always kept sorted. More over how do partition our snapshot db?
We use an index on float value in the snapshot db
Our snapshot db is partitioned by document id
Hey Jordan, first of all, thnx for the great video! I have a question: can we use event-sourcing design approach instead of CDC? Meaning that using Kafka topics as the main source of truth instead of the writes' DB. We can consume from Kafka and build snapshots DB, and also users can consume from the needed Kafka partition to get the latest document changes. Thus we automatically get an order for writes inside any single partition and have persistence for writes. WDYT?
Absolutely! Keep in mind though that this implies that the single kafka queue becomes a point that all writes need to go through, which we want to avoid. If we do event sourcing with multiple kafka queues and assign ids to each write based on the queue id and the position in the queue, then use the vector resolution logic that I discuss, I think that this would be better!
Thnx, of course I have in mind using separated Kafka partitions for each document (or set of documents), and using topic's offsets to store for using with snapshots. I'm not sure although if we can use the only one topic with multiple partitions for all writes, because if we have too many partitions for one topic it can increase latency. Maybe it's better to somehow split the incoming data and use many topics, to avoid this problem.@@jordanhasnolife5163
What if we used some kind of gossip protocol to send writes to client? ie.
- Each server that receives writes from clients sends it to x random servers (and those random servers sends those writes to the connected clients for those servers). This is useful we anyway want a persistent connection with the client to send them data and each server doesn't have to know about all clients?
- This obviously means we need to know about other servers using something like ZK which we anyway likely will do because we want all writes to reach other servers? (may be not in the latter section where all writes go to a DB)
That's basically what we're doing. You can use any protocol you want so long as document state goes from one client to the rest, but you need things like version vectors in order to deduplicate writes.
🙇 interesting concepts covered. Thank you
I guess Cassandra is a good choice for Snapshot DB since we can use the character position as the clustering key. WDYT?
I think it's an interesting idea, though my thinking was we really want a single leader here so that snapshots are consistent with the entry in the version vector DB
@@jordanhasnolife5163 Would you also use something like s3 to store big docs' snapshots in your system?
I’m 30 minutes in. Got the sense each client just gets all these messages from other clients and applies them using some merge function that guarantees the result of applying messages in the order received makes sense - with a little bit greater consistency (via version vectors) for writes from the same client. But I’m wondering - is there any sync point at which all users are guaranteed to see the same version of the document? Because if not clients could just diverge more and more over time…
Yep - no there is not any sync point. If we wanted to, we could occasionally poll the db on an interval to ensure we don't get too out of wack.
I wonder why you would use another db plus two phase commit for the version vector table, instead of using the same db and use transactions instead.
If I have to partition the data for a big document over multiple tables I need a distributed transaction
If we assume all documents can fit on a single database totally agree that's a much better approach
The version vector for a document can exist on the same partition as the documents partition. If we assume a document can only reach megabytes and not gigabytes it's safe to assume a single document can exist on a single partition. Even if a single document has to be chunked, then we can still colocate the version vector for that chunk.
@@joshg7097 Hey Josh, you can co-locate it, but it still becomes a distributed transaction which needs to use 2pc. Also, ideally, we don't have to be too rack aware in our write storing I feel like, because if we were to use something like AWS we don't necessarily have those controls.
I agree with your point though, probably 99.9% of the time a document won't span multiple partitions and in such an event you should store the version vector local to its partition and don't need 2pc.
@@jordanhasnolife5163 I accepted an L5 meta offer a few months, I watched every single one of your videos, huge thanks to the gigachad 😁
I have an onsite with META, but I am not sure which one to choose. I don't have much front end experience but I also don't know very deep into various flavors of db and stuff. Although I can come up with some kind of flow like in these video. Any suggestions?
"I am not sure which one to choose"
Choose what exactly?
Thank you for this video! Pretty cool
Thanks!
Thank you amazing video 🎉
19:15 the result of interleaving of "cat" and "bird" should be "bciartd", right?
Ah yeah maybe a typo on my end
@@jordanhasnolife5163 Yeh right. No worries. Great video, thanks man
Beautiful!!!
Dayum boi
Easy peasy
Thanks for the video Jordan. At ruclips.net/video/YCjVIDv0zQY/видео.html How does the new client that has no content fetched so far get the content from Snapshot DB directly? What does it ask the Write DB or Document DB at this point?
You go to the snapshot DB, get a snapshot at some time T, and then poll the writes db for writes beyond that snapshot until you're caught up (e.g. incoming writes to the document are the next ones on top of what you already have).