Bookmarks Read Path 3:50 Schema, replication, partitioning, caching 6:20 lat, long solution vs Geospatial index/Quad trees, finding nearby places can add range query using geohash. 16:40 Restaurant schema, data choices 20:00 Caching for reviews and popular places 22:00 Geo Sharding based on geo hash prefix/range, can talk more about geohash precision. Diagram 28:30
In the earlier section of the design, the restaurant table is partitioned by geohash, but when it comes to the final architecture, the partition key suddenly changes to restaurant id. Are we using two global indexes here?
Do you have a Patreon? Had some successful system design interviews and I feel like your high level coverage of most common building blocks really got me over a hump in my designs, specifically with learning how to incorporate replication and more complex streaming into trade off discussions. Would love to show support!
- For geospatial partitioning you could probably use population as an initial proxy until some usage data is collected - CDC for reviews updating restaurant rows is overkill. There just aren't that many reviews per item on Yelp (e.g. the super touristy Katz's deli in NYC only has 15k). 2PC would be fine.
@@jordanhasnolife5163 Both the reviews and restaurants table are sharded by restaurantId, so I wonder if 2PC or CDC is really required here. Since the sharding is the same, can't both these tables live on the same database instance, but on two separate tables. Assuming we're using MySQL, which is ACID-compliant, we'd be able to update both these tables atomically.
Really appreciate your system design videos! In case you run out of ideas, I'd love to see a Slack/Discord system design, especially in regards to SaaS and multitenant architecture. I also assume the messaging aspect heavily differs between Slack/Discord and Facebook Messenger/WhatsApp. I am working on such a platform as a side project for some time and plan to release it as open source soon. Even though I don't plan to make it as scalable as Slack or Discord, it would be interesting to see your take on such a system. Anyway, thanks for the great work you've done so far.
@@jordanhasnolife5163 Hey, thanks for answering. The main reason for my assumption is that platforms like Discord support significantly more users; instead of supporting just 2-10 users per chat, a Discord server can handle up to 500,000 concurrent users (per chat). Additionally, a Slack workspace or Discord server acts as its own tenant, compared to a global chat system. But yeah, 'heavily' probably is not the right word. Some other general differences include Slack's support for various integrations (e.g., Webhooks, OAuth), channels, threads, organizational hierarchy, permissions, etc. I know some of these details might be too specific for a system design overview, just for the purpose of inspiration. There is also an interesting article about how Discord evolved from MongoDB to Cassandra to ScyllaDB. Hope I don't come across as the 'Please design my system for me' type of guy. My side project is more or less complete and in terms of system architecture will be much simpler than any system you discussed on your channel, I'm just interested in this type of systems at the moment.
@@buddh4r Haha not at all, thanks for sharing! Yeah I'll try to look into it/give it some thought, and if it feels like there's some unique elements in terms of the design here I'll eventually make a video on it!
Hi Jordan, thank you for your these system design videos that I'm pretty sure whose quality most people take INSANELY granted for. These videos are better the ones creators on Udemy literally $100 for. But anyways, I had a request. Can you make a guide on recruiting for new grad 2025 (or just new grad in general)? I don't have anything lined up for the summer and I heard system design is apparently expected ON TOP of all the leetcode patterns you need to be well practiced in for technical interviews. It's overwhelming and I'm not sure how to best prepare myself to secure a new grad role. I would really appreciate if you can make a video on that (if you can/have the time). I'd gladly donate to a patreon or paypal link for all the insane trouble you go through to make these videos (and the one I'm requesting if possible).
i dont think you need to prepare that much on system design because you are new, new engineers are expected to have strong technical coding skills because most high-level technical decisions are made by seniors and staff engineers, and they expect the engineers under their supervision to be able to implement their directions in a competent way. tldr: focus more on algos and not bombing the dsa interview
Agreed with the above comment. As a new grad, I would mainly focus on leetcoding and just getting referrals for my applications. In my day, I had what I would consider to be two genuine systems design interviews as a new grad (robinhood and hubspot), but the rest mainly cared much more about coding. Other companies have "system design" interviews at the new grad level but it's really more low level stuff rather than distributed systems. Not to say you shouldn't learn this stuff! Just want you to prioritize your time well. I have videos on this channel devoted to recruitment, you've gotta look through them but they're there - mainly it boils down to: 1) have a resume with some internships or personal projects that show you can write code 2) apply with referrals, you can get them pretty easily off of blind or linkedin 3) Study a LOT of leetcode Good luck! I don't want your money for now man, keep killing it ;)
great video! I was thinking for reviews database one could make the case to use dynamoDB. It would allow to have reviews sorted by timestamp, and speed up reads (i.e, no need to sort them to show to the user). A bit of a stretch - we could cache sorted results - , but an interviewer might like to hear some pros/cons
Apart from this, I know you love CDC (I also do), but in a situation with a low volume of writes, one could argue to have 2PC, since it is cheaper than maintaining Kafka, spark and flink clusters. WDYT? Anyway, we don't pay the bills lol
Thanks for the video! As a follow up constraint, how would we ensure users can only submit 1 review per business? We could use a composite key with both businessId and userId, but then we would lose sorting on time right?
I think what you're proposing works if you use a local secondary index. Also, there's probably not gonna be more than like 1000 reviews per restaurant, so if you have to scan all of them it's not too bad.
Hey Jordan! Thanks a lot for the incredible content. I have a quick question: I've seen a bunch of system design videos about Proximity Services (such as Yelp), and most of them seem to use GeoHashes and/or QuadTrees. Is there a reason R-Trees aren't typically presented in these solutions? Aren't they considered to be faster than QuadTrees for KNN as well as Range queries (such as the ones covered in this video)? 🤔 Would really appreciate your insight here. Thank you, again!
Hi, Had a general question about the write flow. How do you determine that we should just directly write to a DB or we should write using a message queue. Is there a rule of thumb that you use to determine at what TPS you want to introduce a message queue?
TPS, probably not realistically. But I'd say that if you have any fear that a write is going to involve a significant amount of processing (such that returning an instant result could hog our server capacity), or that it would be better to just be able to handle all of the writes on one thread for correctness's sake (but maybe later), introducing a message broker could be a good idea!
I am wondering how do you feel about the tradeoff about updating the reviews / restaurant info in an offline way vs real time? My thinking on key points are potentially: 1. update frequency (if update frequency is not insane, I think online update should be fine? 2. real time update's customer experience could be better, but could lead to more outage if not handled carefully (here we use CDC to enable at least delivery but spark's micro batching should get some delay on data freshness (which I think much shorter than any other offline solution like a snapshot service every 1 hour?) I didn't find a lot of materials discussing about online / offline data update from write path (while it is a ready heavy service), would like to learn from your expertise here!
To be honest, I think that's a requirement of the service, and basically just whatever your interviewer asks of you. Running batch jobs is easier 100% of the time, but sometimes people want their data quick haha
Hey Jordan! great content as usual. Thank you for doing all the hard work. I have a question. We are capturing the change restaurant data and processing in Flink. However, if the user just want to read about the restaurant or stars, it will fetch the data from the follower replica of the Restaurants table right? Is my understanding correct? Because the read arrow is directly from the flink node. So, I got confused.
I am a bit confused around the final design diagram. So u have a MySQL table for restaurants ( populated by business owners I guess which is not shown here), and constantly updated with the aggregated results of likes/ views etc which makes sense, but then there is a geo index at the bottom, I thought geo index should be an embedded part of the restaurant db( is it possible)? Else how do we ensure the range query and all that good stuff if we simply shard by restaurant Id (unless this is is prefix with the geo index?)
Depends on the DB, it is possible, I believe you can do this in PostGres for example. I personally would just denormalize the data, put a "preview" of each restaurant in some elasticsearch cluster which is sharded by location, and then upon clicking the restaurant use its ID to go find it in the MySQL table.
Do you have a video that explains the CDC / derived data / Kafka / Flink pattern? You use this a lot and I'd like to understand it better. Great video as always!
If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, couldn't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, I think we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components. We'd be able to update the restaurant's rating and post a user's review atomically because both tables would live on the same machine.
Agree here, mostly did this since I didn't want any contention on the actual "star count" or "number of reviews" column on an inidvidual restaurant. That being said, there's probably not nearly enough reviews IRL for this to ever be an issue.
Hey Jordan, thanks for the great video. One question I have, do we really need to partition the geohash table? 10 million restaurants isn't a lot, maybe doesn't exceed 2GBs in size. We can distribute the workload by replication. If we want to further reduce the latency and achieve data locality for dense areas, we can utilize CDNs. I also saw that in System Design Interview book, the database also was partitioned without a clear justification? So I don't get what is the point of partitioning here, is there anything I am missing? Thanks
Agree with your point, in reality we may not need to partition. It's just helpful to understand how partitioning works for location based data. One scenario where you may want to partition is caching since those are smaller.
From some of the videos I have seen, who have used geohash for locality services - The idea is to present a customer with restaurants within 1/5/10 miles (lets say). And essentially the length of the geohash ID can determine this. For example, if the geohash for the current location of a customer is 98bewp, then show the restaurants which has a geohash id similar to 98bew% or 98be%%. This makes sense to me and simple to implement. Mostly because I think you can cover upto 1 mile radius with just 6 characters (hence no issue on scalability). However the idea you presented here with the sorting, binary search and pythagoras theorem isn't fully clear to me.
I think what I said is just a generalizable approach to get the exact radius that you want, as opposed to yours which relies on the size of the bounding box. Same idea though. I just get all of the points in the neighboring bounding boxes, and then use their lat long to determine the exact distance from me, and filter them out if they're outside the radius I care about. The sorting is just to demonstrate that if you make an index on the geohashes (which is effectively sorting them), points that are closest to one another will be next to each other in the index.
I am wondering if you think we should keep a separate DB for location (geohash indexed) data for query? or just separate table (but under the same DB with restaurant)?
may I ask , what's widget you use to draw on ur laptop? are u using a graphic pen and drawing tablet? I am interviewing with companies, and I found hard to draw SD diagram on site like Miro using touch pad in my mac book pro.
Hey there! Thank you for the video. I was wondering if the reads will feel slow, since for every new nearby call, we first spend O(log n) time to figure out geoHash, another O(logn) to get all placeId from the geoHash indices and then filter out for radius r. Is there any optimization besides caching that we can do to make reads faster?
Hey! I think in reality, partitioning will make things feel faster, you're not doing a logarithmic search on the whole table. Other than that though, yeah you're basically out of luck
Hey! I am going to get around to this eventually, but will probably upload them all at once in bulk after I stop making this round of videos (give it a couple months).
Thanks for the great content Jordan. Just had a question about geohashes - in this case they represent a location for a restaurant or a square location that could contain multiple restaurants? It seems like the former but if that’s the case then does a geohash represent a fixed point that a restaurant may not be assigned too? Ie. There’s restaurants at aaa and aac but none at aab?
Yep GeoHashes are just bounding boxes. However, eventually geohashes become so small that it's more or less accurate to just say some geohash is equal to my current location. As a restaurant, I am technically located within multiple bounding boxes, but the geohash we'd want to use here is the smallest one that contains my whole restaurant.
Hi Jordan, I have a question developed around the overall DB choice after following each of your videos. I noticed that you always look at "read heavy/write heavy" + "replication pattern" to make decisions, but almost never look at the overall load. Like in this video for deciding DB choice for comments, since it is obviously read heavy and there is not really challenges against single leader replication, you chose MySQL. I wonder if the number of comments grow very large let's say 100B comments (I think RUclips has more than that scale) and the read QPS is hundreds of Kilos, should we still stay with MySQL? What is the threshold of the overall number of items and QPS load for you to stay with or shun away from MySQL? (My old impression of SQL's QPS limit is just thousands of QPS, but maybe that is for a single Mysql instance, and may be the story before "Mysql Cluster" was introduced?) In between SQL and NoSQL, there is also "NewSQL" DB like Spanner and CockcroahDB. As I am aware of, both of them use LSM + SStable (or variant), which is actually slower in read than BTree index. But in reality, I feel the sentiment is Spanner is favored over MySQL in such large scale use cases. Does the slight edge of "BTree index gets lookup'/range scan queries faster" get offset by other considerations? Thanks a lot!
1) I think regardless of how many QPS we need, your database choice isn't going to make you able to handle orders of magnitudes more - at that point you just need more shards. 2) Spanner is nice when you want to make consistent reads spanning multiple partitions, or you want strong consistency, neither of which we really need here. I'm not saying spanner will be slower in reality, but I can't make any theoretical argument in favor of it here other than looking up some actual benchmarks.
Thanks for the great content as always. I am a bit lost about the second binary search relative to the # of restaurants, we are doing a range query as the restaurants are indexed on their geohash (so we are basically describe the range query as log(N)? Another questions: I am wondering what we usually select for the bounding box dimension (say the radius is 0.3mile, what's the tradeoff of choosing between 1 mile, 0.5 mile and 0.25mile)? does it matter with the user location (like close to a smaller dimension's boundary, we choose a bigger dimension)? Thanks a lot in advance!
1) I think you described it very well, what exactly are you confused about? 2) You'd just want to choose the smallest bounding box that includes your whole radius ideally (or potentially a smaller one than that but now you have to look at multiple bounding boxes). It all just boils down to whatever is going to let you have to evaluate the fewest number of points. A bigger bounding box will have more points that you need to check to see if they're actually in the radius that you want.
@@jordanhasnolife5163 Thanks a lot! It makes sense! I am wondering if the returned list of restaurances is very long and we only want the nearest 100 results, how can we do it? (learnt from the video, we might want to avoid send raw list to application layer? is there a way we can do it in DB like keep a heap of that?
@@Anonymous-ym6st Yeah so to be clear, the process that I described is probably actually something that would run on a db itself. If we were building this from scratch, we'd have to code that logic ourselves, but I think that for something like Postgres, you can get it all done there!
@@jordanhasnolife5163 Thanks for your reply. I meant do we need really partition as we may not have lot of reviews.. ignore me :) .. hard to explain in comment.. thanks
I wouldn't call it a database, it just provides a few different types of indexing methods. Conceptually, think of those indexes as separate tables though yeah
Thanks for the response! Just to confirm, search indexes and geo hashes on built on top of our existing tables, but not creating new and separate tables right? I always assume that we are building new tables on top of that, but this is not the case. They are mainly secondary index of existing tables. Can you help confirm that my understanding is correct here? Thanks again! :) @@jordanhasnolife5163
Both the reviews and restaurants table are sharded by restaurantId, so I wonder if CDC (or 2PC) is really required here. Since the sharding is the same, can't both these tables live on the same database instance, but on two separate tables. Assuming we're using MySQL, which is ACID-compliant, we'd be able to update both these tables atomically.
Definitely could, it's just a matter of whether that's feasible when these tables get particularly large. Sharding by restaurantId doesn't necessarily mean that the same groups of restaurant ids are in identical partitions for both tables.
@@jordanhasnolife5163 If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, don't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components.
@@jordanhasnolife5163 If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, don't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components.
@@jordanhasnolife5163 If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, couldn't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, I think we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components. We'd be able to update the restaurant's rating and post a user's review atomically because both tables would live on the same machine.
@@jordanhasnolife5163I’m confused. Why doesn’t sharding by restaurantID guarantees that the same groups of restaurant ids are in identical partitions for both tables?
what happens when your num_reviews and total_stars aren't aligning, lets say you see 501 total stars on 100 reviews where the max star per review is 5.
@@jordanhasnolife5163 This is not a problem just a question. Content is the best but I am currently preping and it would be easier for me to navigate through the video
Brief Outline
00:00:41 Brief Introduction to Yelp
00:01:21 Problem Requirements
00:02:25 Capacity Estimates
00:03:45 Reading Reviews
00:05:59 Finding Restaurants
00:08:48 Finding Restaurants
00:11:54 Finding Restaurants - Example
00:16:38 Restaurant Data Choices
00:19:48 Data Caching
00:21:56 Geo Sharding
00:23:20 Geo Sharding
00:25:40 Partition Balancing Illustrated
00:27:07 Searching by Name
00:28:32 Final Diagram - Yelp
Thanks, Jordan~
Thank you, Jordan! Your system design videos are fantastic! I really appreciate the opportunity to learn such valuable content for free!!
Very good explanation to understand geospatial index. Thank you so much
Bookmarks
Read Path
3:50 Schema, replication, partitioning, caching
6:20 lat, long solution vs Geospatial index/Quad trees, finding nearby places can add range query using geohash.
16:40 Restaurant schema, data choices
20:00 Caching for reviews and popular places
22:00 Geo Sharding based on geo hash prefix/range, can talk more about geohash precision.
Diagram 28:30
In the earlier section of the design, the restaurant table is partitioned by geohash, but when it comes to the final architecture, the partition key suddenly changes to restaurant id. Are we using two global indexes here?
Yeah I ended up just going with derived data from the original restaurants table to build out the geospatial index
Do you have a Patreon? Had some successful system design interviews and I feel like your high level coverage of most common building blocks really got me over a hump in my designs, specifically with learning how to incorporate replication and more complex streaming into trade off discussions. Would love to show support!
Nope! No need to pay me man, just pay it forwards!
Thank you so much for the video, God bless you.
How come your explanation is so clear for every single video?!
I've got a lot of practice gaslighting the romantic partners in my life, gotta learn how to talk clearly
Thank you so much Jordan!
- For geospatial partitioning you could probably use population as an initial proxy until some usage data is collected
- CDC for reviews updating restaurant rows is overkill. There just aren't that many reviews per item on Yelp (e.g. the super touristy Katz's deli in NYC only has 15k). 2PC would be fine.
1) agree
2) agree, but it's a systems design interview so figured I may as well show the overoptimized solution just in case
@@jordanhasnolife5163 Both the reviews and restaurants table are sharded by restaurantId, so I wonder if 2PC or CDC is really required here. Since the sharding is the same, can't both these tables live on the same database instance, but on two separate tables. Assuming we're using MySQL, which is ACID-compliant, we'd be able to update both these tables atomically.
Really appreciate your system design videos! In case you run out of ideas, I'd love to see a Slack/Discord system design, especially in regards to SaaS and multitenant architecture. I also assume the messaging aspect heavily differs between Slack/Discord and Facebook Messenger/WhatsApp. I am working on such a platform as a side project for some time and plan to release it as open source soon. Even though I don't plan to make it as scalable as Slack or Discord, it would be interesting to see your take on such a system. Anyway, thanks for the great work you've done so far.
Hey Buddha! Why do you feel that the architecture is much different from messenger?
@@jordanhasnolife5163 Hey, thanks for answering. The main reason for my assumption is that platforms like Discord support significantly more users; instead of supporting just 2-10 users per chat, a Discord server can handle up to 500,000 concurrent users (per chat). Additionally, a Slack workspace or Discord server acts as its own tenant, compared to a global chat system. But yeah, 'heavily' probably is not the right word. Some other general differences include Slack's support for various integrations (e.g., Webhooks, OAuth), channels, threads, organizational hierarchy, permissions, etc. I know some of these details might be too specific for a system design overview, just for the purpose of inspiration. There is also an interesting article about how Discord evolved from MongoDB to Cassandra to ScyllaDB.
Hope I don't come across as the 'Please design my system for me' type of guy. My side project is more or less complete and in terms of system architecture will be much simpler than any system you discussed on your channel, I'm just interested in this type of systems at the moment.
@@buddh4r Haha not at all, thanks for sharing! Yeah I'll try to look into it/give it some thought, and if it feels like there's some unique elements in terms of the design here I'll eventually make a video on it!
Hi Jordan, thank you for your these system design videos that I'm pretty sure whose quality most people take INSANELY granted for. These videos are better the ones creators on Udemy literally $100 for. But anyways, I had a request. Can you make a guide on recruiting for new grad 2025 (or just new grad in general)? I don't have anything lined up for the summer and I heard system design is apparently expected ON TOP of all the leetcode patterns you need to be well practiced in for technical interviews. It's overwhelming and I'm not sure how to best prepare myself to secure a new grad role. I would really appreciate if you can make a video on that (if you can/have the time). I'd gladly donate to a patreon or paypal link for all the insane trouble you go through to make these videos (and the one I'm requesting if possible).
i dont think you need to prepare that much on system design because you are new, new engineers are expected to have strong technical coding skills because most high-level technical decisions are made by seniors and staff engineers, and they expect the engineers under their supervision to be able to implement their directions in a competent way. tldr: focus more on algos and not bombing the dsa interview
Agreed with the above comment. As a new grad, I would mainly focus on leetcoding and just getting referrals for my applications. In my day, I had what I would consider to be two genuine systems design interviews as a new grad (robinhood and hubspot), but the rest mainly cared much more about coding. Other companies have "system design" interviews at the new grad level but it's really more low level stuff rather than distributed systems.
Not to say you shouldn't learn this stuff! Just want you to prioritize your time well. I have videos on this channel devoted to recruitment, you've gotta look through them but they're there - mainly it boils down to:
1) have a resume with some internships or personal projects that show you can write code
2) apply with referrals, you can get them pretty easily off of blind or linkedin
3) Study a LOT of leetcode
Good luck! I don't want your money for now man, keep killing it ;)
great video! I was thinking for reviews database one could make the case to use dynamoDB. It would allow to have reviews sorted by timestamp, and speed up reads (i.e, no need to sort them to show to the user). A bit of a stretch - we could cache sorted results - , but an interviewer might like to hear some pros/cons
Apart from this, I know you love CDC (I also do), but in a situation with a low volume of writes, one could argue to have 2PC, since it is cheaper than maintaining Kafka, spark and flink clusters. WDYT? Anyway, we don't pay the bills lol
Yeah fair enough on the 2pc point. As for DynamoDB, why can't other databases sort by timestamp? This is just an index you're describing
@@jordanhasnolife5163 indeed, I got confused and mixed things up. Thanks!
Thanks for the video!
As a follow up constraint, how would we ensure users can only submit 1 review per business? We could use a composite key with both businessId and userId, but then we would lose sorting on time right?
I think what you're proposing works if you use a local secondary index. Also, there's probably not gonna be more than like 1000 reviews per restaurant, so if you have to scan all of them it's not too bad.
Hey Jordan! Thanks a lot for the incredible content. I have a quick question: I've seen a bunch of system design videos about Proximity Services (such as Yelp), and most of them seem to use GeoHashes and/or QuadTrees. Is there a reason R-Trees aren't typically presented in these solutions? Aren't they considered to be faster than QuadTrees for KNN as well as Range queries (such as the ones covered in this video)? 🤔 Would really appreciate your insight here.
Thank you, again!
Hey!! R-Trees are mainly for indexing actual shapes on a map, whereas we're mostly concerned with radius search here.
Hope that makes sense!
Thank you, Jordan!
Just wondering why you search da and dd ? they are not in the radius
I'm not searching them, they're the upper bound of my search, meaning I'm explicitly excluding them
Hi, Had a general question about the write flow. How do you determine that we should just directly write to a DB or we should write using a message queue. Is there a rule of thumb that you use to determine at what TPS you want to introduce a message queue?
TPS, probably not realistically.
But I'd say that if you have any fear that a write is going to involve a significant amount of processing (such that returning an instant result could hog our server capacity), or that it would be better to just be able to handle all of the writes on one thread for correctness's sake (but maybe later), introducing a message broker could be a good idea!
I am wondering how do you feel about the tradeoff about updating the reviews / restaurant info in an offline way vs real time? My thinking on key points are potentially:
1. update frequency (if update frequency is not insane, I think online update should be fine?
2. real time update's customer experience could be better, but could lead to more outage if not handled carefully (here we use CDC to enable at least delivery but spark's micro batching should get some delay on data freshness (which I think much shorter than any other offline solution like a snapshot service every 1 hour?)
I didn't find a lot of materials discussing about online / offline data update from write path (while it is a ready heavy service), would like to learn from your expertise here!
To be honest, I think that's a requirement of the service, and basically just whatever your interviewer asks of you. Running batch jobs is easier 100% of the time, but sometimes people want their data quick haha
Very good videos. Lots to learn.
Hey Jordan! great content as usual. Thank you for doing all the hard work.
I have a question. We are capturing the change restaurant data and processing in Flink. However, if the user just want to read about the restaurant or stars, it will fetch the data from the follower replica of the Restaurants table right? Is my understanding correct? Because the read arrow is directly from the flink node. So, I got confused.
Yes you're correct
I am a bit confused around the final design diagram. So u have a MySQL table for restaurants ( populated by business owners I guess which is not shown here), and constantly updated with the aggregated results of likes/ views etc which makes sense, but then there is a geo index at the bottom, I thought geo index should be an embedded part of the restaurant db( is it possible)? Else how do we ensure the range query and all that good stuff if we simply shard by restaurant Id (unless this is is prefix with the geo index?)
Depends on the DB, it is possible, I believe you can do this in PostGres for example.
I personally would just denormalize the data, put a "preview" of each restaurant in some elasticsearch cluster which is sharded by location, and then upon clicking the restaurant use its ID to go find it in the MySQL table.
Do you have a video that explains the CDC / derived data / Kafka / Flink pattern? You use this a lot and I'd like to understand it better. Great video as always!
Yep! So I have like 3 videos dedicated to stream processing in the 2.0 playlist and one for flink as well, would recommend watching those!
If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, couldn't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, I think we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components. We'd be able to update the restaurant's rating and post a user's review atomically because both tables would live on the same machine.
Agree here, mostly did this since I didn't want any contention on the actual "star count" or "number of reviews" column on an inidvidual restaurant. That being said, there's probably not nearly enough reviews IRL for this to ever be an issue.
Makes sense!
Hey Jordan, thanks for the great video.
One question I have, do we really need to partition the geohash table? 10 million restaurants isn't a lot, maybe doesn't exceed 2GBs in size.
We can distribute the workload by replication. If we want to further reduce the latency and achieve data locality for dense areas, we can utilize CDNs.
I also saw that in System Design Interview book, the database also was partitioned without a clear justification?
So I don't get what is the point of partitioning here, is there anything I am missing? Thanks
Agree with your point, in reality we may not need to partition. It's just helpful to understand how partitioning works for location based data.
One scenario where you may want to partition is caching since those are smaller.
From some of the videos I have seen, who have used geohash for locality services - The idea is to present a customer with restaurants within 1/5/10 miles (lets say). And essentially the length of the geohash ID can determine this. For example, if the geohash for the current location of a customer is 98bewp, then show the restaurants which has a geohash id similar to 98bew% or 98be%%. This makes sense to me and simple to implement. Mostly because I think you can cover upto 1 mile radius with just 6 characters (hence no issue on scalability). However the idea you presented here with the sorting, binary search and pythagoras theorem isn't fully clear to me.
I think what I said is just a generalizable approach to get the exact radius that you want, as opposed to yours which relies on the size of the bounding box.
Same idea though. I just get all of the points in the neighboring bounding boxes, and then use their lat long to determine the exact distance from me, and filter them out if they're outside the radius I care about.
The sorting is just to demonstrate that if you make an index on the geohashes (which is effectively sorting them), points that are closest to one another will be next to each other in the index.
@@jordanhasnolife5163 This makes sense. Thank you for the response
I am wondering if you think we should keep a separate DB for location (geohash indexed) data for query? or just separate table (but under the same DB with restaurant)?
I'm not sure it really matters for the sake of our interview beyond like convenience for the developer/maybe overloading the physical machine.
may I ask , what's widget you use to draw on ur laptop? are u using a graphic pen and drawing tablet? I am interviewing with companies, and I found hard to draw SD diagram on site like Miro using touch pad in my mac book pro.
Drawing on my iPad with OneNote and screen recording. Don't know if this will work/be allowed for interviews.
Hey there! Thank you for the video. I was wondering if the reads will feel slow, since for every new nearby call, we first spend O(log n) time to figure out geoHash, another O(logn) to get all placeId from the geoHash indices and then filter out for radius r. Is there any optimization besides caching that we can do to make reads faster?
Hey! I think in reality, partitioning will make things feel faster, you're not doing a logarithmic search on the whole table. Other than that though, yeah you're basically out of luck
Hi Jordan, I am a big fan of your videos. I want to know can we ge the slides of these video as well. playlist: "system design questions 2.0".
Hey! I am going to get around to this eventually, but will probably upload them all at once in bulk after I stop making this round of videos (give it a couple months).
Thanks for the great content Jordan. Just had a question about geohashes - in this case they represent a location for a restaurant or a square location that could contain multiple restaurants? It seems like the former but if that’s the case then does a geohash represent a fixed point that a restaurant may not be assigned too? Ie. There’s restaurants at aaa and aac but none at aab?
Yep GeoHashes are just bounding boxes. However, eventually geohashes become so small that it's more or less accurate to just say some geohash is equal to my current location. As a restaurant, I am technically located within multiple bounding boxes, but the geohash we'd want to use here is the smallest one that contains my whole restaurant.
Great Video.
Can you please share the slides of system design 2.0 series as well as the design questions series. Thanks.
Going to do this in batch once I finish the current series
how does the geohash gets calculated ? Also will the geoHash will be in Db as well as in elastic?
You basically know the lat lng bounds of every box, and then you traverse down the tree in lograithmic time to find your geohash for a given lat lng
@@jordanhasnolife5163 Thank you so much
Hi Jordan, I have a question developed around the overall DB choice after following each of your videos.
I noticed that you always look at "read heavy/write heavy" + "replication pattern" to make decisions, but almost never look at the overall load.
Like in this video for deciding DB choice for comments, since it is obviously read heavy and there is not really challenges against single leader replication, you chose MySQL.
I wonder if the number of comments grow very large let's say 100B comments (I think RUclips has more than that scale) and the read QPS is hundreds of Kilos, should we still stay with MySQL? What is the threshold of the overall number of items and QPS load for you to stay with or shun away from MySQL? (My old impression of SQL's QPS limit is just thousands of QPS, but maybe that is for a single Mysql instance, and may be the story before "Mysql Cluster" was introduced?)
In between SQL and NoSQL, there is also "NewSQL" DB like Spanner and CockcroahDB. As I am aware of, both of them use LSM + SStable (or variant), which is actually slower in read than BTree index. But in reality, I feel the sentiment is Spanner is favored over MySQL in such large scale use cases. Does the slight edge of "BTree index gets lookup'/range scan queries faster" get offset by other considerations?
Thanks a lot!
1) I think regardless of how many QPS we need, your database choice isn't going to make you able to handle orders of magnitudes more - at that point you just need more shards.
2) Spanner is nice when you want to make consistent reads spanning multiple partitions, or you want strong consistency, neither of which we really need here. I'm not saying spanner will be slower in reality, but I can't make any theoretical argument in favor of it here other than looking up some actual benchmarks.
Thanks for the great content as always.
I am a bit lost about the second binary search relative to the # of restaurants, we are doing a range query as the restaurants are indexed on their geohash (so we are basically describe the range query as log(N)?
Another questions: I am wondering what we usually select for the bounding box dimension (say the radius is 0.3mile, what's the tradeoff of choosing between 1 mile, 0.5 mile and 0.25mile)? does it matter with the user location (like close to a smaller dimension's boundary, we choose a bigger dimension)?
Thanks a lot in advance!
1) I think you described it very well, what exactly are you confused about?
2) You'd just want to choose the smallest bounding box that includes your whole radius ideally (or potentially a smaller one than that but now you have to look at multiple bounding boxes). It all just boils down to whatever is going to let you have to evaluate the fewest number of points. A bigger bounding box will have more points that you need to check to see if they're actually in the radius that you want.
@@jordanhasnolife5163 Thanks a lot! It makes sense! I am wondering if the returned list of restaurances is very long and we only want the nearest 100 results, how can we do it? (learnt from the video, we might want to avoid send raw list to application layer? is there a way we can do it in DB like keep a heap of that?
@@Anonymous-ym6st Yeah so to be clear, the process that I described is probably actually something that would run on a db itself. If we were building this from scratch, we'd have to code that logic ourselves, but I think that for something like Postgres, you can get it all done there!
do we really need to partition? Can't we just have replicas , any thoughts?
How does having more replicas solve the need to partition due to lots of data?
@@jordanhasnolife5163 Thanks for your reply. I meant do we need really partition as we may not have lot of reviews.. ignore me :) .. hard to explain in comment.. thanks
Curious is elastic search a DB table? i.e. both Geo Hash Index and Inverted Index stores in 2 separate DB?
I wouldn't call it a database, it just provides a few different types of indexing methods. Conceptually, think of those indexes as separate tables though yeah
Thanks for the response! Just to confirm, search indexes and geo hashes on built on top of our existing tables, but not creating new and separate tables right? I always assume that we are building new tables on top of that, but this is not the case. They are mainly secondary index of existing tables. Can you help confirm that my understanding is correct here? Thanks again! :) @@jordanhasnolife5163
@@ariali2067 no, they're new tables in this case
Both the reviews and restaurants table are sharded by restaurantId, so I wonder if CDC (or 2PC) is really required here. Since the sharding is the same, can't both these tables live on the same database instance, but on two separate tables. Assuming we're using MySQL, which is ACID-compliant, we'd be able to update both these tables atomically.
Definitely could, it's just a matter of whether that's feasible when these tables get particularly large.
Sharding by restaurantId doesn't necessarily mean that the same groups of restaurant ids are in identical partitions for both tables.
@@jordanhasnolife5163 If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, don't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components.
@@jordanhasnolife5163 If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, don't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components.
@@jordanhasnolife5163 If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, couldn't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, I think we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components. We'd be able to update the restaurant's rating and post a user's review atomically because both tables would live on the same machine.
@@jordanhasnolife5163I’m confused. Why doesn’t sharding by restaurantID guarantees that the same groups of restaurant ids are in identical partitions for both tables?
is it lograithmic to the base 4 in search?
Yeah
what happens when your num_reviews and total_stars aren't aligning, lets say you see 501 total stars on 100 reviews where the max star per review is 5.
Nothing - it's not the end of the world for this to happen, it's okay to deal with eventual consistency
Why don't you use timecodes in video? I remember you did this
General laziness - maybe I'll go back and add them in at some point
@@jordanhasnolife5163 This is not a problem just a question. Content is the best but I am currently preping and it would be easier for me to navigate through the video
@@maxvettel7337 Understood - I've gotta get around to doing this but I will at some point, thanks for the suggestion!
Nice video, though you admitting being an avid taco bell lover just highlights that you know very little about authentic mexican system design.
Murica number 1