So are you guys interested in working at Twitter? 😅Btw, don't forget to "Batch" click the like & subscribe buttons. 🚀 neetcode.io/ - Get lifetime access to every course I ever create!
@@NeetCode you should look at your website, I tried to go pro but for some reason the google api won’t let me sign up. I don’t know if I’m the only one the problem or if it is general.
Man, finding a job as a Software Engineer is just crazy. You need to go through at least 4 to 6 rounds of interviews, starting with a technical take home challenge, then a follow up discussion about that challenge, and then another live technical coding interview, and then a live behavioral interview, and then a live system design interview, and then maybe a product delivery interview, and probably a chat with CTO or VP at the end of that. And then, once you're hired, you're just gonna be focused on fixing bugs and building features, it is very rare that you are creating a fresh system from scratch, unless you're working at a start-up, and even then, you're going to be working with other Engineers to design that system. In most other industries, you typically learn what you need for the job over time, through hands on experience. Only in Software Engineering do all Companies just expect you to be a data structure and algorithm wiz, have previous experience so you can answer those behavioral questions, and then design some abstract system from scratch within 1 hour, just to get hired.
Being an SWE these days is just insane. Any other job, you'd get hired then learn the system over time and by working with people at the company. As a SWE you have to already know how Twitter works just to get through one of the six or so interviews to get a job fixing bugs or writing new features. Does every other SWE know this shit just from going to school or working in the field for a few years? Ive been a SWE for 10 years and these are all semi-new concepts to me. Ive never once had to design a system like this but I guess now companies want you to be an expert on day one. I thought I could avoid cramming algorithims and system design stuff if I didnt try to get a job at FAANG but now every little startup expects you to be a senior level engineer just to make 140k. I feel like my 10 years of experience count for literally nothing.
10 Years and you barely did system design? Typically getting up in seniority means having to take a higher level approach to problems and leaving the implementation to juniors
@@garlicpress6121 I feel like the web based software has skewed everyones perception and it makes people think that this is the only kind of sodtware dev in the work. There are so many other domains which would never need to know this sort of stuff for interviews or even for their work. For example, someone working on low level programming for drivers, or OS level sofware or desktop applications.
I hear you. I am 15 years exp and I am finding this very strange. I functioned without knowing Leetcode algos and these insane System Design stuff. And I did pretty well! I dont know what value these things are adding, TBH.
@@garlicpress6121 I think for a typical SWE, doing system design is common, but not to this degree. Normally it's working on top of or improve existing systems to add features or improve performance/scale/reliability etc.
Starting from 4:26 to 7:38, that's pretty much superfluous arithmetic you're going to be doing during a systems design interview. The time you spend mentally calculating those numbers is going to be wasted, just to arrive at a conclusion of "it's a lot", which is almost a given in any systems design interview. Your time will be better spent calculating those numbers while you're doing your high-level design portion, if needed. One example of needing to calculate those numbers is a TopK system for trending topics in a social media feed (which doesn't pertain to a basic Twitter implementation). Ask your interviewer for DAUs and if it's anything over 100M, move on to the core components section (Tweet, User, Feed), rather than calculate capacity estimations.
I can't help but find it slightly hilarious that you released this video during the ongoing controversies happening at Twitter. But in all seriousness, amazing content!
This is a fantastic example of a realistic architecture screen. I would note for viewers that you will almost certainly not be able to think of and describe everything that was covered here and as someone who conducts 3 or 4 of these every week, I don't expect candidates to cover everything here in the 20-30 minutes I have with them. But as you go through this video, the issues presented scale really well with the expectations that go along with the seniority of the candidate and position. We actually skip a lot of the preliminary setup so that we can delve into the more complex issues for more senior candidates. If you're a mid level, I'm not expecting you to come at me talking about batching out feeds and dynamically updating them based on high popularity tweets.
no, with such test you filter already for ex-twitter employees. That would be fine if you build a social network, but you'd miss out on all the all the brilliant devs who for example designed large e-commerce or data-pipeline architectures, because that requires a very different approach.
It will also be a case study on if these software companies are truly over staffed or not. If Twitter survives after laying off so many people it may inspire other companies to consider down staffing
@@KennethBoneth I think the main issue with scaling down on employees is that the remaining employees will essentially have to monitor and handle the same amount of work as before scaling down, which will cause additional stress and probably a less than healthy work life balance.
Not really, Tesla and SpaceX both are well known for the horrendous work environment. So it depends on the management and the owner of the company in this case.
@@Mattarii That is true if you were properly staffed to begin with. If twitter is as overstaffed as many people believe, then a large chunk of employees are effectively doing nothing. IF twitter goes from properly staffed to understaffed, you are correct. If twitter is going from overstaffed to properly staffed, then that won't happen.
Great video! One question (or perhaps a mistake), in 18:20, you say all the people this guy follows should be on one shard but I don't think that's possible. If person A follows B and C, then B and C should be on one shard. if person E follows C and D, C and D should be on one shard, but its already on a different shard. Maybe B,C,D are all one shard, but as long as each person follows another different person, we will only have one shard.
Thanks for this comment, I really didnt get this sharding thing :) it is looking impossible to sharding per user. I thought that maybe I misunderstood this point but, after your comment it's clear.
Sounds like he meant "each person the user follows will be located uniquely within a single shard" and not "all the people he follows will be in the same shard". The phrasing isn't great.
Love your content, your video help me land a position at Twitter one year ago. but I just got laid from Twitter and will start checking your video again 😅
Once I had an interview explaining how to design something. I totally missed the point. This definitely give us a clear idea. It's not about writing a user story, and not even building the actual application, but identifying the most critical points and possible components and to come up with how to solve it. Thanks again.
This level of quality content is available for free, it blows my mind! Also, I am churning through your Blind 75 list of questions and I am loving your solution videos.
Loved it! The only issue I see is sharding having all the people who follow each other in the same shard. That's just not possible, as a friend of yours will follow someone in another shard group at some point. I haven't got a good answer for that yet, apart from saying we should use a GraphDB here that hopefully is optimised for sharding this kind of data...
Yes, that seems like a big oversight. Each shard will have a subset of a users followees, so the proposed user id as a shard key really doesn't do anything for us.
Just paused at that part, seems incorrect. The best sharding I think may be tweet id (assuming using chronological IDs like snowflake) as people are generally accessing the latest tweets so can grab them in a single request if it misses cache
You've got it wrong. The idea is to have all the _followers_ of the user in one shard. This way, when the user posts a tweet, you would get all their followers ids from one shard with one query. Then you'd use this list of ids, to update their respective feeds with the tweet. When the user request their feed, they get it pre-computed from the cache, not built on-the-fly.
DDIA is the most comprehensive resource (assuming you have at least some experience). Also, most companies (including twitter) release blog posts and white papers about technical challenges they faced and how they overcame them. I think many beginners miss these, but they are an extremely valuable and free resource, which is why they are commonly referenced by system design textbooks.
At 18:12, how do you manage to get all the users one follows in a single shard? It seems strange to me. If that can work, then all the users need to be in a single shard, which fails the purpose of sharding.
Sounds to me like he meant "each person the user follows will be located uniquely within a single shard" and not "all the people he follows will be in the same shard"
@@arthur723 for a given user following users a,b,c with shards 1,2,3: User a posts -->> shard 2 User b posts -->> shard 3 User c posts --->> shard 1 But not User a--> some posts shard1 and some posts in shard2 So for any given user, all of their data will be in a single shard. However, not all of the people you follow will all have data in the same shard
If sharding by user id then, to retrieve a single tweet (e.g. by a direct link), you would need to request all shards. Is it something tolerable or how do you overcome it? And what about hot user problem? Sharding by user id does not work well in this case.
Very good tutorial as always from NeetCode. Kudos. One confusion though: I am aware of publisher / subscriver pattern and I am also aware of message queue - What is new is "Pub/Sub message queue". Not sure what that is. From what it looks more like a message queue behaviour auther is indicating instead of a pub/sub. The impact you are creating is far better and huge than anyone working for FAANG.
That's correct, I meant that while NoSQL is easier to scale (automatically or by specifying a shard key), we can still scale relational DBs via sharding.
So if we want return the cache only, but if the user follows celerity then it will not be up to date. That mean every time user comes we still need to query the list of people that user is subbed to right? To check whether there is celebrity
I just got asked this question in an interview, but with the added feature to follow interests too, and I am surprised I answered pretty much the same thing that is stated here and I passed the interview!, one thing to mention is that some companies/interviewers want to see SQL queries written in order to see how you make joins to the tables, so be prepared on that I would say
Caching the Feed page in the CDN and purge it on update(feed is tagged with User_ids), the infrastructure is basically a multi layer data retrieval, uid->followee->tweets(sorted by timestamps) and then merge to get the final result. The uid->followee mapping can be compactly stored and updated if needed. (K/V or RDB) followee->tweets would be a sharded DB with all tweets posted. (K/V). it would just be a simple backend and most of the load would be handled by the CDN.
That more or less is I think what he described for his feed cache description. But it doesn't solve the problem he brings up where we don't want to update all the followers' feed cache whenever a popular user posts a tweet. Also, I don't know how to do it, but when you say "on update", I'm assuming that whenever a person posts a tweet, all the users following that person gets "updated". In that case, then only thing that needs to be changed is inserting that new tweet into the feed (and probably popping out whatever oldest or least important tweet that is in the feed that this new tweet will replace). In that case, I don't think retrieving and merging all the relevant tweets each time there is an "update" makes sense. I think that's why he brought up pub/sub. So it's just a queue where whenever a new one comes the least important one gets popped out.
@@marspark6351 Maybe it's possible to determine a "popular" user and when those users create a tweet, only cache that tweet instead of allowing a message to go through the pub/sub when they post a tweet.
I have a question. in most of the read internsive applications . most of the design is to add a cache layer like redis to block the db traffic. Can i not add any cache but add as many as read-only replicas of mysql to distribute the traffice ? as cache also need to consider the sync problem between redis and mysql. but read-only replica can get rid of this hassle .
I would believe this has less to do with whether it's SQL or noSQL, but probably more to do with that Redis makes better use of RAM than mysql. Don't take my word tho. Just a possible assumption
Pretty sure he means that someone shouldn't be able to use something like Postman to send a request with someone else's user id and retrieve all of their tweets.
I have a question on how on 23:46 on how that "update of the feed upon request instead of during when a tweet is created" would work. So would the feed of a user keep continuously get updated via the message queue whenever there's a new tweet, except for the tweets of the popular one? And when that user requests the feed, it will somehow just fetch that missing tweet and fill it in the feed? How would that work? Isn't that the same issue as what it's described at 19:57 where 19 of your 20 tweets could be cached but you'll have to go to the disk to find that one tweet?
Before returning the feed, the app server would check if the user follows any celebrity (one query to follow table). Then get the tweets of the celebrities which user follows from the cache, and inject them into the feed based on the timestamp. This approach has significant downsides like increasing latency for all users, so I believe this problem is addressed differently in real world.
If you have the capacity for asynchronously pre-building timelines for all (active) users, why don't you increase the capacity of the cache layer for the RLDB, or store the tweets in a fast KV NoSQL?
Probably, having NoSQL KV-store with such massive reads you'd have to deal with its sharding anyways. Don't think you'd just set up Cassandra and start throwing in nodes to the cluster mindlessly. So, author, choosing SQL DB, just makes that logic explicit.
Something I didn't understand: You suggests sharding on user ID as then the people a user follows will be grouped on the same shard. However, users can have a lot of followers and their followers will be distributed across different shards. So you have to duplicate a user's tweets across every shard that has someone following them in it in which case you probably have enough fanout that you're not really sharding anymore, it's just replicas with more steps (at least for the read case, writes would be meaningfully sharded). Am I missing something here? It feels like to get any value out of sharding you'd have to do something MUCH more complicated like assign users to shards based off similarity graphs.
Don't guess the capacity, there are infinite servers, infinite ram, infinite disk. Don't calculate. Only poor calculate. Is the design horizontally scalable? Yes. Go home now
If user A follows B and C and B follows back to A then all three should be on same shard and same way if B follows 10 more people and even one person follows back then all those 10 should be on same shard and it goes on with all data on single shard . looks like very abstract way , i am not sure why people not think little more rather thn explaining that abstract way
I am a little confused about the DB schema - can someone explain why would we favour indexing based on the follower rather than the followee? What's the advantage here? I would assume the former is more logical to implement but I can be wrong.
If we wanted all the people that user1 follows (in order to generate their news feed) we could run a query like: SELECT * FROM followers WHERE followers.followerId = user1 Notice we are filtering by followerId.
Whenever we create the newsfeed we want to populate it with tweets of people that a user follows(followee). So when we index the follower, we can query the followees relatively quickly, meaning that we can get all the people that a user follows, which makes it easier to create their newsfeed.
I understood it as: when the timeline is loaded and we need to populate tweets, you're going to be querying for tweets based on the follower of that tweet as opposed to the person being followed (followee). So if we say the user loading the timeline is the current_user a query (in a simplified world) would be like: SELECT tweets.content FROM tweets WHERE tweet.follower_id = current_user.id Note how our WHERE clause is on follower_id of that tweet and NOT the person who wrote that tweet (followee).
Agree on the part that, the data is more on relational side. But why can't we put the tweet in any NoSql db like cassandra, scylla. As from our follow table i know which followee's tweet i have to fetch. Now that i know, i simply have to search in shards the followee's tweet stored.
Splendid! Solid content with crystal clear pronunciation and comfortable speed. How did you practice your speaking? I wish I could speak no er----en-----aa those no meaning words in a system design interview.
I understood why the userId helps as shard key but I did not understand why choosing "tweetId" as shard key does not help. Why to we have to query all the shards if we shard based on "tweetId"? can someone explain pls?
There shouldn't be any userId in the POST /v1/tweet/create endpoint. This is because we will get the id of the user initiating the request from the authentication token in the request header. Putting sensitive information like authentication tokens in the request body is a security risk
There's no difference in security, whether you put the token in the headers or the body. But it's better to put it in the headers because your gateway can start checking it or sending the request to the destination API before it downloads the body. Putting the userId in the body doesn't make sense here, but it would allow you to have other features like "postponed tweets". And another service with an internal token (without the userId) could call the existing API to post those messages.
One of the defining features of twitter is timely notifications about new tweets from people you follow. Could you please describe how could it be implemented in this architecture? Likes and comments allow users to attach their content to a potentially popular tweet. How would it affect our storage layer? What challenges, if any, we would face with multi-az deployment of such system? Thank you for your time and interest in our company.
Why you need relational db, all this relationship data you can store in document db as json for high performance, low latency and scalability. Usage of relational db will not be efficient in this scenario because we need to achieve high availability, we need eventual consistency so NoSql Mongo db is preferred over relational db in this scenario. Correct me if I am wrong.
Speaking of popular users. We can separate tweet data by some follower threshold (say 10k followers) and, when popular profile post a new tweet, we only need to update that feed. Every normal profile will check that feed in case they follow popular profiles.
That’s amazing how this kind of large-scale system can grow and become so complex with amount of components and “moving parts”, also it’s impressive how it works with a massive amount of users and data storage like petabytes. In the end, I didn’t understand if your solution was using sharding or not on the database, if it is using, how do you solve the issue about the sharding-key, ‘cause it looks like not possible to use the “every account followed by someone” strategy due the reasons you even talked about. Is it possible to have sharding and reading replicas at the same time? And how to handle it, using many load balancers, each one after sharding for a single replicas cluster?
Use a DB like Cassandra: users, tweets, followers, follows, feed. Everything sharded by user ID to colocate relevant data. Fan out to followers feeds on tweet. For celebrity users, fetch the celebrity tweets from cache when building the feed. Have some background jobs pre-populate some other good feed candidates, Rank the feed by some scoring system. Push likes, retweets to an event stream and update cached like counters in Redis from the stream every so often. Shard on tweet ID and spin up some read replicas if needed
I wouldn't combine reads of tweets with "reads" of videos into a single number of data we're going to read from our "storage" as storing videos and streaming videos and storing and reading text tweets + meta data are completely different tasks which access and deals with data in a completely different way.
I would send the tweet timestamp from the client. If you handle it server-side and something breaks and delays the server-side ingestion of the tweet, you'd have an incorrect timestamp. ("Wow, what an amazing touchdown!" posted 2 hours after the touchdown and way out of context on feeds etc)
Thank you for interesting video. I however doubt that relation database can store the tweets. I've just asked to design twitter during a job interview and constructed something very similar. But I suggested to use aerospike for messages using the following schema: id->list off messages. Aerospike is horisontaly scaled, so there is no need to think about sharding.
Ngl as a aspiring software engineer, I find this video helpful in terms of macro design. New video style over the different duties of a software engineer? 👀👀
I don't understand the logic behind the statement on 18:20. "All the people this user follows will be on the single shard". Why so? Tweets of one user will be on one shard - but the following accounts (their tweets) can be scattered across all the shards. Or maybe you meant it by "logic of our sharding" - but it would be impossible to maintain our sharding on every users follow-unfollow.
Thank you so much for this video and its good content. Actually one thing to correct maybe is that 12:24 it's not good to save authorization token in db due to security reasons. so maybe if one says that in interview , the interviewer thinks the interviewee does not care about this, and reject him/her
do you mean by passing user ids along with the request implies that the auth token is stored in db? because I don't see him mention it explicitly in the video where to store auth tokens. Also out of curiosity where do we store the auth tokens then?
Yes, you must be really disliking Elon Musk so much (to say it mildly ). > Who is most popular on Twitter? Kim Kardashian. probably over 100 million followers . ..... -- Putting the subject aside - you made a good content - thank you!
Would it make sense to only store the tweetId of the tweets in the feed cache, so when someone popular edits their tweet, the edited version will probably be in the tweet cache already, from where we can quickly grab it?
You’re already pushing tweet related info to the feed cache why would we limit it to tweet id only? That’s actually more of an overhead since we’ll need to do another request to actually fetch the tweet detail. Also for update we can always use a lastUpdate timestamp to compare and only push to the cache if it changed
I appreciate the effort and care you put into this video but I think it could use a little more focus. Especially at the sharding-for-writes portion. You jumped around a lot to digressions that made that line of thought hard to follow.
I loved your video, very much and thanks a lot for he afford you made. These are the question we actually face when you are working on the BE side. One small question, If someone asks you, what kind/type of architecture is this? What will be your answer?
It's so silly and cursed situation with system design interviews. Usually you have functional requirements to support 10e6+ users but you can't make even remotely viable design to support these requirements. It's always a hand wavy "a thing" you can't apply to real life in any way. And the most outrageous thing: in real life you never design for scale without already working product. It's always post tweaks for current and near future loads, numbers you have on hand.
So are you guys interested in working at Twitter? 😅Btw, don't forget to "Batch" click the like & subscribe buttons.
🚀 neetcode.io/ - Get lifetime access to every course I ever create!
You should leave Google for Twitter
tweet this video to Elon , he might make you CEO, he is weird like that.
@@BhargavSushant lol maybe i should
Yes hire then next day fire
@@NeetCode you should look at your website, I tried to go pro but for some reason the google api won’t let me sign up. I don’t know if I’m the only one the problem or if it is general.
Man, finding a job as a Software Engineer is just crazy.
You need to go through at least 4 to 6 rounds of interviews, starting with a technical take home challenge, then a follow up discussion about that challenge, and then another live technical coding interview, and then a live behavioral interview, and then a live system design interview, and then maybe a product delivery interview, and probably a chat with CTO or VP at the end of that.
And then, once you're hired, you're just gonna be focused on fixing bugs and building features, it is very rare that you are creating a fresh system from scratch, unless you're working at a start-up, and even then, you're going to be working with other Engineers to design that system.
In most other industries, you typically learn what you need for the job over time, through hands on experience. Only in Software Engineering do all Companies just expect you to be a data structure and algorithm wiz, have previous experience so you can answer those behavioral questions, and then design some abstract system from scratch within 1 hour, just to get hired.
and then they lay you off
Being an SWE these days is just insane. Any other job, you'd get hired then learn the system over time and by working with people at the company. As a SWE you have to already know how Twitter works just to get through one of the six or so interviews to get a job fixing bugs or writing new features. Does every other SWE know this shit just from going to school or working in the field for a few years? Ive been a SWE for 10 years and these are all semi-new concepts to me. Ive never once had to design a system like this but I guess now companies want you to be an expert on day one. I thought I could avoid cramming algorithims and system design stuff if I didnt try to get a job at FAANG but now every little startup expects you to be a senior level engineer just to make 140k. I feel like my 10 years of experience count for literally nothing.
10 Years and you barely did system design? Typically getting up in seniority means having to take a higher level approach to problems and leaving the implementation to juniors
@@garlicpress6121 I feel like the web based software has skewed everyones perception and it makes people think that this is the only kind of sodtware dev in the work. There are so many other domains which would never need to know this sort of stuff for interviews or even for their work. For example, someone working on low level programming for drivers, or OS level sofware or desktop applications.
I hear you. I am 15 years exp and I am finding this very strange. I functioned without knowing Leetcode algos and these insane System Design stuff. And I did pretty well! I dont know what value these things are adding, TBH.
@@garlicpress6121 I think for a typical SWE, doing system design is common, but not to this degree. Normally it's working on top of or improve existing systems to add features or improve performance/scale/reliability etc.
@@vhchoang if you work in startup and new product is created from ou get oppurtunity to design these kind of things.
I would love to see more System Design content !! nice video man
Thank you, more to come!
@@NeetCode I think he meant on your youtube channel haha..
@@sanskarkaazi3830 obviously what else could her mean?
@@indiging8330 neetcode has premium courses on his website as well so not there but here.. you get what i mean?
Can you talk about Pinterest, or someone link some available content.
Starting from 4:26 to 7:38, that's pretty much superfluous arithmetic you're going to be doing during a systems design interview. The time you spend mentally calculating those numbers is going to be wasted, just to arrive at a conclusion of "it's a lot", which is almost a given in any systems design interview. Your time will be better spent calculating those numbers while you're doing your high-level design portion, if needed. One example of needing to calculate those numbers is a TopK system for trending topics in a social media feed (which doesn't pertain to a basic Twitter implementation).
Ask your interviewer for DAUs and if it's anything over 100M, move on to the core components section (Tweet, User, Feed), rather than calculate capacity estimations.
wow !!! from algorithms to system design, love to see more on system design videos
I can't help but find it slightly hilarious that you released this video during the ongoing controversies happening at Twitter.
But in all seriousness, amazing content!
Musk will hire him
Its because Musk tweeted the HLD of twitter on twitter. You can see that in the thumbnail of this video too
This is a fantastic example of a realistic architecture screen. I would note for viewers that you will almost certainly not be able to think of and describe everything that was covered here and as someone who conducts 3 or 4 of these every week, I don't expect candidates to cover everything here in the 20-30 minutes I have with them. But as you go through this video, the issues presented scale really well with the expectations that go along with the seniority of the candidate and position. We actually skip a lot of the preliminary setup so that we can delve into the more complex issues for more senior candidates. If you're a mid level, I'm not expecting you to come at me talking about batching out feeds and dynamically updating them based on high popularity tweets.
no, with such test you filter already for ex-twitter employees. That would be fine if you build a social network, but you'd miss out on all the all the brilliant devs who for example designed large e-commerce or data-pipeline architectures, because that requires a very different approach.
I guess twitter will be a case study in “does talent matter” and “how interchangeable/disposable are sw engineers”.
It will also be a case study on if these software companies are truly over staffed or not. If Twitter survives after laying off so many people it may inspire other companies to consider down staffing
@@KennethBoneth I think the main issue with scaling down on employees is that the remaining employees will essentially have to monitor and handle the same amount of work as before scaling down, which will cause additional stress and probably a less than healthy work life balance.
Not really, Tesla and SpaceX both are well known for the horrendous work environment. So it depends on the management and the owner of the company in this case.
@@bryanyang7626 that's true, might not work well with other companies once people start realizing their lives are worth more than slaving away
@@Mattarii That is true if you were properly staffed to begin with. If twitter is as overstaffed as many people believe, then a large chunk of employees are effectively doing nothing. IF twitter goes from properly staffed to understaffed, you are correct. If twitter is going from overstaffed to properly staffed, then that won't happen.
the biggest thing about sharding is that we could potentially lose the joins, and it adds a huge layer of complexity on the application.
Great video! One question (or perhaps a mistake), in 18:20, you say all the people this guy follows should be on one shard but I don't think that's possible. If person A follows B and C, then B and C should be on one shard. if person E follows C and D, C and D should be on one shard, but its already on a different shard. Maybe B,C,D are all one shard, but as long as each person follows another different person, we will only have one shard.
Thanks for this comment, I really didnt get this sharding thing :) it is looking impossible to sharding per user. I thought that maybe I misunderstood this point but, after your comment it's clear.
Sounds like he meant "each person the user follows will be located uniquely within a single shard" and not "all the people he follows will be in the same shard". The phrasing isn't great.
Your content is way, WAY better than the others on RUclips! Great work!
Love your content, your video help me land a position at Twitter one year ago. but I just got laid from Twitter and will start checking your video again 😅
I'm sorry to hear that, wish you the best - it's only a matter of time!!!
me too😂😂
Once I had an interview explaining how to design something. I totally missed the point. This definitely give us a clear idea.
It's not about writing a user story, and not even building the actual application, but identifying the most critical points and possible components and to come up with how to solve it.
Thanks again.
This level of quality content is available for free, it blows my mind! Also, I am churning through your Blind 75 list of questions and I am loving your solution videos.
how is this a quality content?
@@umarqureshi8499what's wrong with it?
Loved it!
The only issue I see is sharding having all the people who follow each other in the same shard. That's just not possible, as a friend of yours will follow someone in another shard group at some point.
I haven't got a good answer for that yet, apart from saying we should use a GraphDB here that hopefully is optimised for sharding this kind of data...
Yes, that seems like a big oversight. Each shard will have a subset of a users followees, so the proposed user id as a shard key really doesn't do anything for us.
Yeah I felt like I was missing something when he said sharding and scrolled down to the comments to confirm
Just paused at that part, seems incorrect. The best sharding I think may be tweet id (assuming using chronological IDs like snowflake) as people are generally accessing the latest tweets so can grab them in a single request if it misses cache
@@salient244 yeah, but still you'd need to store the friends relationships somehow and you'd get into the sharing issue when it scales up
You've got it wrong. The idea is to have all the _followers_ of the user in one shard. This way, when the user posts a tweet, you would get all their followers ids from one shard with one query. Then you'd use this list of ids, to update their respective feeds with the tweet. When the user request their feed, they get it pre-computed from the cache, not built on-the-fly.
Very much enjoyed the video, the explanation, the simplicity and the clarity it brought out. Thank you
Glad it was helpful!
Extremely good discussion in this video, more of this please!
What books / sources did you refer to get a strong grip on system design?
DDIA is the most comprehensive resource (assuming you have at least some experience).
Also, most companies (including twitter) release blog posts and white papers about technical challenges they faced and how they overcame them. I think many beginners miss these, but they are an extremely valuable and free resource, which is why they are commonly referenced by system design textbooks.
Thanks!!
@@NeetCode Is there a central url where you find those blog posts or do you just google them?
A good place to start is by learning the classic OOP design patterns. It's less about the OOP and more about the patterns.
I can't believe that I just found this channel now. Great content
Wow! That's a lot to take in maybe because I'm sleepy but sparked at the same time. Put out more of this please.
Thanks!
If we shard based on a used id, won't it become a hotspot (if user is a celebrity or has large no of tweets)?
Amazing! This one of the best System Design videos I watched :) Great job!
Thank you for explaining in such detail. I learned about sharding, definitely will use in my projects.
At 18:12, how do you manage to get all the users one follows in a single shard? It seems strange to me. If that can work, then all the users need to be in a single shard, which fails the purpose of sharding.
Sounds to me like he meant "each person the user follows will be located uniquely within a single shard" and not "all the people he follows will be in the same shard"
@@Squigglybiggly the two sentences you said mean the same to me. Or maybe my English is bad.
@@arthur723 for a given user following users a,b,c with shards 1,2,3:
User a posts -->> shard 2
User b posts -->> shard 3
User c posts --->> shard 1
But not
User a--> some posts shard1 and some posts in shard2
So for any given user, all of their data will be in a single shard. However, not all of the people you follow will all have data in the same shard
14:11, why put index on follower? I think once we index by followee, the followeer list would be grouped inside DB.
Literally Amazing man. Take a bow🙇♂️
Looking forward to part 2!!! More in-depth
First time Kim Kardashian has come up in any tech video I've watched
If sharding by user id then, to retrieve a single tweet (e.g. by a direct link), you would need to request all shards. Is it something tolerable or how do you overcome it?
And what about hot user problem? Sharding by user id does not work well in this case.
Yep, but there is no requirement in this case to be able to request tweet by id directly without knowing the author of the tweet.
Very good tutorial as always from NeetCode. Kudos.
One confusion though: I am aware of publisher / subscriver pattern and I am also aware of message queue - What is new is "Pub/Sub message queue". Not sure what that is. From what it looks more like a message queue behaviour auther is indicating instead of a pub/sub.
The impact you are creating is far better and huge than anyone working for FAANG.
I almost spilled my coffee when i heard the word "How hard can it be?" LOL
Correction 9:01 We can also implement sharding in most nosql databases.
That's correct, I meant that while NoSQL is easier to scale (automatically or by specifying a shard key), we can still scale relational DBs via sharding.
Nice catch boss
Nice video! Gotta love some systems design
So if we want return the cache only, but if the user follows celerity then it will not be up to date. That mean every time user comes we still need to query the list of people that user is subbed to right? To check whether there is celebrity
I just got asked this question in an interview, but with the added feature to follow interests too, and I am surprised I answered pretty much the same thing that is stated here and I passed the interview!, one thing to mention is that some companies/interviewers want to see SQL queries written in order to see how you make joins to the tables, so be prepared on that I would say
Caching the Feed page in the CDN and purge it on update(feed is tagged with User_ids), the infrastructure is basically a multi layer data retrieval, uid->followee->tweets(sorted by timestamps) and then merge to get the final result.
The uid->followee mapping can be compactly stored and updated if needed. (K/V or RDB)
followee->tweets would be a sharded DB with all tweets posted. (K/V).
it would just be a simple backend and most of the load would be handled by the CDN.
That more or less is I think what he described for his feed cache description.
But it doesn't solve the problem he brings up where we don't want to update all the followers' feed cache whenever a popular user posts a tweet.
Also, I don't know how to do it, but when you say "on update", I'm assuming that whenever a person posts a tweet, all the users following that person gets "updated". In that case, then only thing that needs to be changed is inserting that new tweet into the feed (and probably popping out whatever oldest or least important tweet that is in the feed that this new tweet will replace). In that case, I don't think retrieving and merging all the relevant tweets each time there is an "update" makes sense. I think that's why he brought up pub/sub. So it's just a queue where whenever a new one comes the least important one gets popped out.
@@marspark6351 Maybe it's possible to determine a "popular" user and when those users create a tweet, only cache that tweet instead of allowing a message to go through the pub/sub when they post a tweet.
I have a question. in most of the read internsive applications . most of the design is to add a cache layer like redis to block the db traffic. Can i not add any cache but add as many as read-only replicas of mysql to distribute the traffice ? as cache also need to consider the sync problem between redis and mysql. but read-only replica can get rid of this hassle .
I would believe this has less to do with whether it's SQL or noSQL, but probably more to do with that Redis makes better use of RAM than mysql. Don't take my word tho. Just a possible assumption
12:48
Can you clarify what you meant by "I shouldn't be able to pass in your uid"?
Are you saying that function should actually not take uid as input?
Pretty sure he means that someone shouldn't be able to use something like Postman to send a request with someone else's user id and retrieve all of their tweets.
I have a question on how on 23:46 on how that "update of the feed upon request instead of during when a tweet is created" would work.
So would the feed of a user keep continuously get updated via the message queue whenever there's a new tweet, except for the tweets of the popular one? And when that user requests the feed, it will somehow just fetch that missing tweet and fill it in the feed? How would that work?
Isn't that the same issue as what it's described at 19:57 where 19 of your 20 tweets could be cached but you'll have to go to the disk to find that one tweet?
Before returning the feed, the app server would check if the user follows any celebrity (one query to follow table). Then get the tweets of the celebrities which user follows from the cache, and inject them into the feed based on the timestamp. This approach has significant downsides like increasing latency for all users, so I believe this problem is addressed differently in real world.
If you have the capacity for asynchronously pre-building timelines for all (active) users, why don't you increase the capacity of the cache layer for the RLDB, or store the tweets in a fast KV NoSQL?
Probably, having NoSQL KV-store with such massive reads you'd have to deal with its sharding anyways. Don't think you'd just set up Cassandra and start throwing in nodes to the cluster mindlessly. So, author, choosing SQL DB, just makes that logic explicit.
Something I didn't understand:
You suggests sharding on user ID as then the people a user follows will be grouped on the same shard.
However, users can have a lot of followers and their followers will be distributed across different shards. So you have to duplicate a user's tweets across every shard that has someone following them in it in which case you probably have enough fanout that you're not really sharding anymore, it's just replicas with more steps (at least for the read case, writes would be meaningfully sharded).
Am I missing something here? It feels like to get any value out of sharding you'd have to do something MUCH more complicated like assign users to shards based off similarity graphs.
Don't guess the capacity, there are infinite servers, infinite ram, infinite disk. Don't calculate. Only poor calculate. Is the design horizontally scalable? Yes. Go home now
If user A follows B and C and B follows back to A then all three should be on same shard and same way if B follows 10 more people and even one person follows back then all those 10 should be on same shard and it goes on with all data on single shard . looks like very abstract way , i am not sure why people not think little more rather thn explaining that abstract way
which tool are you using to draw the diagrams?
If we have read heavy system why are we not using slave and master design
I really enjoy watching your video!!
The abstract design is vital! Now I have realized this point.
I am a little confused about the DB schema - can someone explain why would we favour indexing based on the follower rather than the followee? What's the advantage here?
I would assume the former is more logical to implement but I can be wrong.
If we wanted all the people that user1 follows (in order to generate their news feed) we could run a query like:
SELECT *
FROM followers
WHERE followers.followerId = user1
Notice we are filtering by followerId.
Whenever we create the newsfeed we want to populate it with tweets of people that a user follows(followee). So when we index the follower, we can query the followees relatively quickly, meaning that we can get all the people that a user follows, which makes it easier to create their newsfeed.
I understood it as: when the timeline is loaded and we need to populate tweets, you're going to be querying for tweets based on the follower of that tweet as opposed to the person being followed (followee).
So if we say the user loading the timeline is the current_user a query (in a simplified world) would be like:
SELECT tweets.content FROM tweets WHERE tweet.follower_id = current_user.id
Note how our WHERE clause is on follower_id of that tweet and NOT the person who wrote that tweet (followee).
I see. Thank you all for explaining this!
Agree on the part that, the data is more on relational side. But why can't we put the tweet in any NoSql db like cassandra, scylla. As from our follow table i know which followee's tweet i have to fetch. Now that i know, i simply have to search in shards the followee's tweet stored.
Amazing video, this has made me curious about systems design roles in industry
What a nice video, I learnt a lot even being a junior developer.
Btw, how can I find the official twitter engineering paper you mentioned at the end?
I’d try checking their engineering blog for leads.
Splendid! Solid content with crystal clear pronunciation and comfortable speed. How did you practice your speaking? I wish I could speak no er----en-----aa those no meaning words in a system design interview.
why do you index on follower, instead of making 2 db index on both
I understood why the userId helps as shard key but I did not understand why choosing "tweetId" as shard key does not help. Why to we have to query all the shards if we shard based on "tweetId"? can someone explain pls?
Awesome video, what are you using as your board?
how would feed cache work when user scrolls indefinitely
There shouldn't be any userId in the POST /v1/tweet/create endpoint. This is because we will get the id of the user initiating the request from the authentication token in the request header. Putting sensitive information like authentication tokens in the request body is a security risk
There's no difference in security, whether you put the token in the headers or the body. But it's better to put it in the headers because your gateway can start checking it or sending the request to the destination API before it downloads the body. Putting the userId in the body doesn't make sense here, but it would allow you to have other features like "postponed tweets". And another service with an internal token (without the userId) could call the existing API to post those messages.
One of the defining features of twitter is timely notifications about new tweets from people you follow. Could you please describe how could it be implemented in this architecture? Likes and comments allow users to attach their content to a potentially popular tweet. How would it affect our storage layer? What challenges, if any, we would face with multi-az deployment of such system? Thank you for your time and interest in our company.
What software and device do you use for the drawing?
That initial diss on twitter is everything 😂😂
Why you need relational db, all this relationship data you can store in document db as json for high performance, low latency and scalability. Usage of relational db will not be efficient in this scenario because we need to achieve high availability, we need eventual consistency so NoSql Mongo db is preferred over relational db in this scenario. Correct me if I am wrong.
Speaking of popular users. We can separate tweet data by some follower threshold (say 10k followers) and, when popular profile post a new tweet, we only need to update that feed. Every normal profile will check that feed in case they follow popular profiles.
So...use the average Twitterer's tweets as load dampening. They should do that. It will make Twitter even less popular.
Where do those caches live? Are they separate servers? Or are we caching on the app servers?
Twitter uses redis. Separate servers, sharded by tweet ID, with read replicas
omg, you're insane. thank you!
This is great! Thank you!
That’s amazing how this kind of large-scale system can grow and become so complex with amount of components and “moving parts”, also it’s impressive how it works with a massive amount of users and data storage like petabytes. In the end, I didn’t understand if your solution was using sharding or not on the database, if it is using, how do you solve the issue about the sharding-key, ‘cause it looks like not possible to use the “every account followed by someone” strategy due the reasons you even talked about.
Is it possible to have sharding and reading replicas at the same time? And how to handle it, using many load balancers, each one after sharding for a single replicas cluster?
I was left with the same impression. I don't see how this sharding could work
Use a DB like Cassandra: users, tweets, followers, follows, feed. Everything sharded by user ID to colocate relevant data.
Fan out to followers feeds on tweet. For celebrity users, fetch the celebrity tweets from cache when building the feed. Have some background jobs pre-populate some other good feed candidates, Rank the feed by some scoring system.
Push likes, retweets to an event stream and update cached like counters in Redis from the stream every so often. Shard on tweet ID and spin up some read replicas if needed
This is great. I loled at 0:48 .This video is neet.
we need more of these for sure
I don't even have a twitter account or did get the reall need.
So do the interviewers gives inputs what is the twitter is used for?
Don't forget ads. Imagine how complex this whole thing becomes when we add in ads.
my IT classes coming in clutch
I'm confused on how pub/sub works? can anyone explain to me what its suppose to do? if you can explain like I'm five that would be great!. THX
considering how many joins you would have to do in a relational DB, it would be hard to justify that for twitter.
I wouldn't combine reads of tweets with "reads" of videos into a single number of data we're going to read from our "storage" as storing videos and streaming videos and storing and reading text tweets + meta data are completely different tasks which access and deals with data in a completely different way.
which hardware you use for writing?
I would send the tweet timestamp from the client. If you handle it server-side and something breaks and delays the server-side ingestion of the tweet, you'd have an incorrect timestamp. ("Wow, what an amazing touchdown!" posted 2 hours after the touchdown and way out of context on feeds etc)
Thank you for interesting video. I however doubt that relation database can store the tweets. I've just asked to design twitter during a job interview and constructed something very similar. But I suggested to use aerospike for messages using the following schema: id->list off messages. Aerospike is horisontaly scaled, so there is no need to think about sharding.
Ngl as a aspiring software engineer, I find this video helpful in terms of macro design. New video style over the different duties of a software engineer? 👀👀
If the interviewer is Elon, all you need to do is remember the word “turboencabulator”.
I don't understand the logic behind the statement on 18:20. "All the people this user follows will be on the single shard". Why so? Tweets of one user will be on one shard - but the following accounts (their tweets) can be scattered across all the shards. Or maybe you meant it by "logic of our sharding" - but it would be impossible to maintain our sharding on every users follow-unfollow.
Great video. Learnt a lot.
Thank you so much for this video and its good content. Actually one thing to correct maybe is that 12:24 it's not good to save authorization token in db due to security reasons. so maybe if one says that in interview , the interviewer thinks the interviewee does not care about this, and reject him/her
do you mean by passing user ids along with the request implies that the auth token is stored in db? because I don't see him mention it explicitly in the video where to store auth tokens. Also out of curiosity where do we store the auth tokens then?
@@zhenghaohe4727 we don't store them, we validate them against our secrets
Yes, you must be really disliking Elon Musk so much (to say it mildly ).
> Who is most popular on Twitter? Kim Kardashian. probably over 100 million followers .
.....
--
Putting the subject aside - you made a good content - thank you!
Would it make sense to only store the tweetId of the tweets in the feed cache, so when someone popular edits their tweet, the edited version will probably be in the tweet cache already, from where we can quickly grab it?
You’re already pushing tweet related info to the feed cache why would we limit it to tweet id only? That’s actually more of an overhead since we’ll need to do another request to actually fetch the tweet detail. Also for update we can always use a lastUpdate timestamp to compare and only push to the cache if it changed
People watch netflix, I watch neetcode.
I appreciate the effort and care you put into this video but I think it could use a little more focus. Especially at the sharding-for-writes portion. You jumped around a lot to digressions that made that line of thought hard to follow.
so what's a batch RPC? Asking for a friend...
inserting/retrieving multiple things at once rather than separately
The problem right now is not about designing a workable system but a system that works smoothly without spending much $$$ on the infrastructure.
I loved your video, very much and thanks a lot for he afford you made. These are the question we actually face when you are working on the BE side.
One small question,
If someone asks you, what kind/type of architecture is this? What will be your answer?
Gracias - Thanks, great video.
Wat tools do you use to draw?
Can someone please tell me what drawing tool he use here?
What software do you use to record drawings like this?
Paint3d and streamlabs obs
Neet: How hard could it be?
Candidate: *sweats profusely seeing Elon* 😥
Bro how do you draw so good with the mouse
It's so silly and cursed situation with system design interviews. Usually you have functional requirements to support 10e6+ users but you can't make even remotely viable design to support these requirements. It's always a hand wavy "a thing" you can't apply to real life in any way. And the most outrageous thing: in real life you never design for scale without already working product. It's always post tweaks for current and near future loads, numbers you have on hand.
I would argue you need an index on both followee and follower because in twitter you can see both ways