Of all the system design prep videos, I find Sandeep’s videos to be the most comprehensive. I’ve been interviewing the past month and I ended up with 7 offers with system design interview being my strongest in those. All thanks to code karle videos - I cannot appreciate enough how much of a help those were during interview prep.
@@AbhishekKumar-vf3cu I don't think I could get lucky 7 times over. Based off the feedback from those companies, my offers did not look like pity hires. But if you still think the hiring bar is been low across the board, well everyone can benefit from it. Good luck with your interviews!
Videos by Sandeep are the most comprehensive system design videos anyone can see - The flow of information is smooth, logical and easy to follow. Heartfelt thanks to Sandeep for being one of the best tutors on the net.
No doubt the content of video is informative for all the system design enthusiasts but I have few points that I was looking for but was missing . 1. If you could have talked about services interfaces like design of some api's , request and response high level design . 2. If you could have talked about the guess estimates like on requests per min , no of users and bandwidth consumption and how we can optimize on those points . 3. Clear cut explanation of why we used Kafka as it's a fancy term and what's the actual use case why not simple and lightweight messaging components if at all we require like active mq . 4 . Some deep dive on internals of redis how we use in this use case . 5 . How we can handle failure scenarios as they can be many in such an interconnected system. 6. Some internals of database design , concepts of indexing , sharding, partitioning as there would be lot of read/writes .
Amazing video. A suggestion: Mentioning about sharding mechanisms, configuration svcs such as ZK, consistent hashing etc. might help clarify the complete picture for a particular service
This is for the first time i landed here and found it so fascinating out of all the system design content... Thank you so much for being here for us ❤️❤️
Great information and explanation! As a continuation, here are some thoughts or questions, to help clarify myself. 1. Whenever a group msg is sent, on the client side, a group msg to be shown in the group window, but not in the individual msg window right? to do this, when server is sending group msg to individual users in the group, it needs to tell client that its a group msg 2. May be if data model is defined, showing tables, what fields each table has, the above point can be more clear, not sure how imp it is to describe data model in an interview 3. Last seen service could get info from "Web Socket Manager' from the previous diagram, as it knows which is user is connected to which handler, the moment a user loses connection, this service can inform Last seen service to stop updating time, right? (SORRY MY BAD, the understanding was wrong! Last seen time is based on when the user is "on the app", not just when he has a "live connection with the server") --------------This comment might help others as well 4. How long whatsapp server keeps the connection alive? is it forever until user logs out? or until we kill the running instance of the app on the client side?
You are PRO. Thanks for so many amazing videos. I have seen others video as well on chat system but none can beat you in terms of deep knowledge, clarity.What a wonderful explainations..
Love your content, esp because you capture a lot of breadth for all problems. I don't think 64k limit is true on the server side. The server is always listening on a single port and supports simultaneous connections by accepting connections on a different socket. So for a server, the number of file descriptors a server can create is the limiting criterion AFAIK
That's correct. A connection is uniquely defined by the (client IP, client Port, Server IP, Server Port) tuple. A server can serve multiple clients on the same port. For each such connection server keeps the track using file descriptors. So, essentially file descriptors become the bottleneck. That theoretically can be in the millions range. But, practically serving each request requires other resources like memory and cpu. Usually, modern commodity hardware servers (8 core cpu, 16GB RAM) can easily handle 60-100k "lightweight" connection.
@@pankajsingh17787 A server can handle6 4k websocket connections with one network interface, more network interfaces can be added to the server to allow it to handle more websocket connections.
@@pankajsingh17787 No really. Any TCP connection has the 5 tuple and every accepted connection needs to have a dynamic port associated with it. There are 64K ports. So per IP address, you can have at max 64K connections.
@@arushkharbanda This could be the case if the memory is less on the system. Assuming abundant memory, number of connections are limited by number of ports supported by the TCP/IP stack. First 1024 ports are special ports and are associated with processes with euid 0. You don't want to run your websocket server as root due to obvious security concerns. So, at max you can have 63K connections per IP address per server assuming resources hold up.
since you have mentioned cassandra and redis databases. Sharing sample table structure and queries would surely help us in the visualization of the data and would give us some ideas about the underlying databases as well. But Great Effort in making the concepts very simple
By far the best system design videos you can find on RUclips. Amazing how you have so much clarity on these. I would love to have a video on tools like Smartsheet. Thanks in advance.
Great video. Some of the issue like 64k limit is already pointed out in other comments. In the last you talked about decreasing the lag in kafka by adding more consumer. I guess you must be talking about adding more consumers in a consumer group but unfortunately it won't help once the count exceeds no of partitions. Excess consumers would sit idle in this case.
Hi Sandeep, want to take a moment to express my gratitude to your video series on system design concepts and scenarios. It was immensely helpful to ace through the system design interview. The system design interview scenario was different but the thinking pattern that I got from your videos helped to have a great dialogue with the interviewer.
I liked it because even though its about a chat application you discussed about other avenues that will utilize this app and make money out of it like your touch on analytics, and also you go end to end like a finished product from beginning to end. its complete, and to the point. Good work.
This is a great video. It would be great if you could also talk about how message ordering is ensured in case of FB messenger. I doubt it can solely happen on basis of timestamps (client/server), since depending on where users are and the timezones they are in, they might have different timestamps value. This is typically done through messageID's generated by client. You did pick up that, but that can be used for ensuring a consistent ordering of messages is missed. Please do include that as well if you can.
I would rather to it as a combination of an auto increment numeric message id from a user who is sending and a timestamp in UTC for a conversation /group. The message ID was more of a UUID/primary-key that I had talked about in the video. It's not efficient to use it for ordering I believe
This is my 3rd comment on your videos. I'm really liking your videos. Thank you so much! Here are my few questions. Please don't mind if you feel these questions dumb. I'm in initial phase of learning. Questions -1. Who decides which user will connect to which web handler? 2. You mentioned in Redis, we'll store 2 things - a) UserId-WebSocketHandler mapping and b) WebSocketHandler-List mapping . Why b is required? 3. What if WebSocket Handler machine goes down? What will be the behavior in that case. Do we need to remove mapping in redis and who will do this? 4. Are there chances of loosing the message. Let's say WebSocket machine went down before message was delivered and before message was put in Cassandra. Will their be retry logic from Client. What are your thoughts of using message queue here? 5. Is WebSocket Manager microservice? Is this single point of failure? If this goes down?
Hope this answers your questions. 1. Load balancer will connect the user to any one machine. It could maintain the connection then. In an advanced implementation web socket manager could act as a traffic distributor. 2. b is not required for any functional flow, but might be used for debugging or for any analysis incase there is some issue with some users or some machines. 3. Users will initiate the connection again. Apps will have this heartbeat logic. 4. Client would retry if there is no ack from the server. Message queue would also have this data loss issue if it's not handled via ack. Message queue here is not to much added advantage but it does add a lot of latency and impacts the NFR of low latency. 5. It's a microservice, but that does not mean it's a single point of failure. There should be multiple instances in multiple data centres to handle any outage. That's why it's not storing data in memory.
thank you for the detailed explanation, so much better than any whatsapp design video i saw. The summary link helps a lot to revise on the design! Great work!
Really sad why you didn't receive same number of likes as views:) thank you for teaching such good content for free. I wish I could press the like button 100 times. Great efforts!!
Thank you so much for your dedication and the great content very helpful. In your blog summary, I think there's a slight correction to be done the diagram. In 1:1 messaging where "Websocket Manager" is not communicating with "Message Service" which I think it should given that "Message Service" is the one handling storing the messages to Cassandra
The best content I have come across from any all the content producers. All the components are explained with necessary technical details which interviewers drill down to and ask questions about. Thank you! Just one comment: Investment in a high-quality microphone would elevate your videos to a whole new level which are already unparalleled. Please keep making them.
Thank you so much for this wonderful video. I have a question. Isn't the web-socket manager a single point of failure? If it briefly goes down, how will it know the scope of the existing web-socket connections on the handlers?
Great content. Thanks for the hard work .. Could you please help answer a few questions 1) In the non functional requirements, you don't mention consistency. Shouldn't we talk about what consistency means for this system, and if it should be maintained 2) Why use MySql for storing User and Group Data? Why not noSql, it's easy to scale as compared to sql dbs as there could be millions of users 3) How is the message order taken care of in case of failures? Ex: timestamp1: U1 sends U2 message M1 timestamp2: U1 sends U2 message M2 Delivery of M1 fails during first attempt, but succeeds on successive retries M2 succeeds on first try .. as a result, M2 is sent before M1
Consistency should be maintained at a reasonable level. We cannot make an inconsistent system here. The reason it's not too big a problem here is because there are these two possibilities: 1. in case the users are Live and engaging in a continuous chat, or just online, the message flows through the Weksocket Managers/handlers and we do not bother about what happens on the DB front. 2. The second scenario is when one of the users is offline. This is the scenario where consistency matters. But it's not that it takes minutes for the DB to get consistent. It'll get consistent within a second(maybe a few seconds on a worst case of extreme traffic), and since the user is offline, we would have the message replicated by the time they come online. One edge case is that there is race condition between them coming online and the message being sent. In order to take care of that, I suggested to fetch messages for the user after a few seconds of them coming online to handle this scenario as well. MySQL would be able to scale to a million of rows, that's not a concern. But more importantly, the data for these services is structured, which makes a good case for MySQL. We could use a NoSQL though, just that I don't see any amazing benefit of using a NoSQL here. I would suggest to look at this video for this: ruclips.net/video/cODCpXtPHbQ/видео.html For the retry in case of message delivery, I think it is okay if they are shown out of order if we show it on the UI to both the parties, alternatively, we can build a sender side message ordering using some incrementing ids to handle this scenario.
Can some answer my doubts? 1) How would the packet flow be if user1 sends a message to user 2? 2) Will all web socket handlers be connected in a mesh? 3) Will web socket handler 1 push the message to message service and web socket handler 2 take it from the message service? 4) what protocol would be maintained between web socket handler and web socket manager? Is REST enough? 5) The cache maintained by web socket handlers. Are they in-memory cache or are they distributed?
pretty good Sandeep. I feel you cover more in very less amount of time. I was wondering what your thought process was in having group messaging as separate. You can also think of direct 1-1 messaging as a group of 2 , no ? Also when in case of 1-1 messaging, when you are sending message directly from websocket_handler_A => websocket_handler_B, wouldn't it lead to system crash in case of rush hour ? In short my thought is like handling 1-1 messaging same as group messaging, in this case the async queue can buffer the messages and send with a slighter more delay, instead of a web_socker server crash. Let us know. Anyways it was fun video dude. Keep posting.
I was also thinking in that direction due to the decoupling provided inherently by Kafka but I think the problem with that is creating a topic for each user in Kafka can lead to performance issues since there are billions of users, but since there are lesser number of groups (maybe millions) this approach works for group messaging.
Thanks @Sandeep @codeKarle for the great explanation. Myself Rishi, currently working as a Senior Software Engineer having completed my Bachelors in Computer Science from BITS. As is the case with most Design discussions I do have a few questions: 1) For the Group messages how do we handle the sequence of messages from a particular user. I understand that messages from multiple users can/need not be guaranteed a specific sequence, but all messages in a Group chat need to be in a proper sequence to avoid context misses in the conversation. Do we use a messageId or ordering of some sort for this? Or maybe use the ID Cassandra generated by the initial entry of the message into Cassandra ? 2) I have some concern with you preferring Cassandra over Redis for the last seen service. Redis being distributed and highly scalable, along with proper sharding can be appropriate for this use case IMO. We used it extensively for our gaming application for both read/write serving millions of I/O requests per minute.
Great. I agree with Ayesha, that SanDeep's seem most complete. ex, he goes into race conditions example , and the websock handlers ... that i don't see in other videos.
Thanks for this video. Very scrip and informative . Have one doubt Host supports 6500 ports not connections. Number of connection does not depend on number ports. Simply a single port can handle more than one connection . Please clarify.
This is great. Request you to make videos on the individual components such as load balancers, databases, web sockets, caching etc. That would help many of us know the details of these things and designing systems will be much easier then
Thanks buddy! We could use a MySQL as well here, and it'll work well to a great extent, but the parameters that you can consider to decide on a DB are many more other than just the scale. For example, in this case, we might always query just on one field(let's say user_id) to fetch the chats of a person. Here we don't need complicated where clauses, we can just live with a DB that provides all values(chat details) against a key(user_id). If you look at it from this angle, there are a lot of DBs that are more optimal than a MySQL for this kind of a query pattern. I would recommend to have a look at this video: ruclips.net/video/cODCpXtPHbQ/видео.html. Here, I am covering exactly this question of which DB should be used when, and hopefully it'll answer your question :)
this is great stuff. i have also brought your udemy course. couple of questions 1. shouldnt we offload what we offload some of things websocket handler does 2. how is msg order achieved and 3. why isnt kafka used for passing messages between services as every thing happening in the system is essentially asyc in nature. thanks
Big fan of you sandeep. One quick q - if the target is offline why store the messages locally on sender. Why not store in cassandra and have it delivered when the user is online again
There are various nuances of the group messaging service. - What should be the workflow for delivering messages to a group, Assuming the group has a huge membership. do we want to delay the delivery ? Does the Group service cache the information for user to machine mapping so as not to overwhelm the redis instance ? - How to handle duplicate message delivery ? Should we delegate the deduplication at the client end so that devices will keep track ? - Should all messages be delivered by a single node ? What happens if that crashes ? How to store the delivery status ? - How should the failed delivery messages be retried ?
Nice video .have two question . 1 when is the messages getting deleted from Cassandra ? 2 when socket handler 1 request socket handler 2 to a send message Does socket handler 1 sends message to socket handler 2 or socket handler 2 pulls from Cassandra
Hi Karl, Nice explanation. Can you please clarify how the communication between the WS Handlers happen? What protocol does WS Handler1 use to talk to WS Handler 2? Thanks!
I think there a distributed messaging queue like Kafka might help as that will ensure atleast once delivery also will keep the websocket handlers from getting bombarded.
Thanks for explaining it such a nice manner. I have a question though - In this design, all web socket handlers talk to each other. Are these connections also web socket connections? When there are tens of thousands of servers, isn't this a problem? Is there any other alternative?
great video and really good details. Thank you. couple questions, 1- what are some over the counter 3rd party solutions or standards that can be used as a web socket handler and a web socket manager? I could not find any online. Maybe zookeeper for web socket manager. So every enterprise needs to build its own solution customer web socket handler or manager? 2- What is the protocol that can be used in between web socket handlers and between web socket handlers and web socket managers?
Hey you make such great designs and explanations, A+. But it would be so much better if you fixed the sound quality! Your videos would then be perfect.
Thanks for efforts. Your content is such a gem that it made a dumb(like me) good at these concepts. Thank you so much. Also, I do have question, at 18.50, you talked about web-socket handler would talk to asset service if there is any image/video upload. I was thinking if it is good that web-socket handler talk to message service in case of image/video upload as well. And then, message service can decide that I am getting a static media in message so I should call asset service first and tell asset service to do all the transformation with image/video and send the CDN link to message service. This way we can also store the link in cassandra as well. What do you think ? Does this make any sense(I am still a beginner and it is just a thought so pls bear me up)
Hi @codeKarle and @Sandeep, Thanks for the video! I did not get how one websockethandler will talk to another websockethandler on which user u2 is having an open connection .. Can you please share some thoughts on this?
you well explained what is going on background thank you so much. But I could not understand that what about offline group members ? We can check message service to fech all undelivered messages in the first time that connected to wsh-x. But we do not keep any status indicator for group messages. So when user who is member of a group, if not connected to wsh then miss all messages which sent before.
I have a query in image upload part, how will the msg reach asset service? mobile -> LB1 -> web socket handler -> LB2 -> asset service? If yes why so, why can we directly have something that will store blob data for msg and return msg_link? i.e., mobile -> LB1 -> asset service?
very good content but your presentation could be better , instead of keeping the same tone you should emphasize / pause / highlight what is important to engage the audeince all the best
First Kudos to quench my thirst for a proper system design interview. However - it would be much appreciated - if you can draw the diagram and discuss it simultaneously - instead of dumping the viewer with the FINAL DIAGRAM. Due to this issue - your videos are for informational purposes. Nobody can learn to draw those fancy diagrams because only you know how you came up with them.
Great Video! Thank you! Keep making videos like this one! I have one question though - which kind of strategy will the load balancer will use to maintain the web sockets? Is it IP based caching?
There are few improvements that can be added like which protocol to be used for message transfer like MQTT or XMPP and there pros and cons on device specification etc.
You explained WSM stores 2 sets of info: i) Which user is connected to which WSH ii) Which WSH is connected to which users . What is the purpose for storing the second piece of info here?
when user connects to load balancer, then how can load balancer forwards TCP request to websocket because it has to be between direct client and server?Can there be loadbalancer in between?
Cassandra has performance issues with the updates of done on regular basis because of its append only architecture, so Cassandra might not be a good choice , maybe we can use redis.
For asset operation where will we store receipt for asset in message service against some asset id? Also what will happed if u1 gets to know machine1 for u2 from websocket manager but just right then u2 connection resets and connected to machine2?
Well prepared and gracefully delivered. It would be glad if you can back to this channel and post video whenever you get some time or may be short videos. Thanks BTW.
Very good content. I just have one request. Can you explain what do you mean when you say "Query pattern we have aligns with query pattern Cassandra is good at" What query pattern is Cassandra good at ?
That's a good question! I have tried to answer that in detail in this video: ruclips.net/video/cODCpXtPHbQ/видео.html But basically the idea is that if you have huge scale of queries, but you have less variety in queries. Also if most of your queries are in a way where you can include a common partition key in the where clause in each query, then Cassandra works beautifully! All you need to make sure is let's say if you have 10 varieties of queries, they all query on some or the other partition key. Hope the helps! More details @ www.codekarle.com/system-design/Database-system-design.html
Great video! One question - For group messages why can't we give responsibility of fanning out to multiple users to WebSocket Handler? Instead of a separate Group service can we have this logic built in the WebSocket Handler?
Websocket handlers are lightweight servers whose responsibility is to send and receive messages from users. Its better to give fanout responsibility to a separate service
How will cassandra address queries like "Give me all unread messages for a user" ? If partition key is user_id for message table, then hot partitions will be a potential issue but if message_id is partition_key then all shards need to be queried for getting all the messages for the user and that will be costly.
Please explain why are we using kafka or any messaging queue for group messaging, can't the web socket handler directly call the group service and fetch which all users it needs to send the message to? I mean why do we need kafka here, isn't it same as one to one messaging ?
DOUBT : It seems like whatsapp stores all the messages in our mobile device's storage. When I turn off the internet, I can still view everything. So, I have 3 questions: 1. Is it reliable to store on mobile storage ? 2. How does it sync the messages from mobile storage and the actual DB ? 3. WhatsApp uses ~15 GB data on the phone. Is that a correct estimate for the data requirement for 1 user in an app like WhatsApp ?
Around 3:30 I think you are confused between ports and sockets. Theoritically a port can make billions of connection. And there are 65k ports. Please correct me if I'm wrong.
I have one question, so when u1 is sending message m1 to u2, then web socket thandler1 is connecting with message service to store the message m1, would it be correct to shift this work from web socket handler to web socket manager which then connects with message service, thus web socket handler1 role will only be to receive and send messages for a device.
why there is a loadbalancer in front of the websocket manager? is only first request is going through the load balancer? is loadbalancer is also communicating to websocket manager to make decision about where to forward the request?
Of all the system design prep videos, I find Sandeep’s videos to be the most comprehensive. I’ve been interviewing the past month and I ended up with 7 offers with system design interview being my strongest in those. All thanks to code karle videos - I cannot appreciate enough how much of a help those were during interview prep.
agreed
7 offer you got because of diversity hiring benefit with low hiring bar..Not due to video
@@AbhishekKumar-vf3cu I don't think I could get lucky 7 times over. Based off the feedback from those companies, my offers did not look like pity hires. But if you still think the hiring bar is been low across the board, well everyone can benefit from it. Good luck with your interviews!
@@ayeshaarzbegi3178 hiring bar is low for diverse people. Every company is crazily hiring diverse people.
@@AbhishekKumar-vf3cu I think its time for you to get some diverse skills that'll help you navigate the interviews. Good luck!
Videos by Sandeep are the most comprehensive system design videos anyone can see - The flow of information is smooth, logical and easy to follow. Heartfelt thanks to Sandeep for being one of the best tutors on the net.
No doubt the content of video is informative for all the system design enthusiasts but I have few points that I was looking for but was missing .
1. If you could have talked about services interfaces like design of some api's , request and response high level design .
2. If you could have talked about the guess estimates like on requests per min , no of users and bandwidth consumption and how we can optimize on those points .
3. Clear cut explanation of why we used Kafka as it's a fancy term and what's the actual use case why not simple and lightweight messaging components if at all we require like active mq .
4 . Some deep dive on internals of redis how we use in this use case .
5 . How we can handle failure scenarios as they can be many in such an interconnected system.
6. Some internals of database design , concepts of indexing , sharding, partitioning as there would be lot of read/writes .
This is by far the best explanation of WhatsApp messaging system design!
Amazing video. A suggestion: Mentioning about sharding mechanisms, configuration svcs such as ZK, consistent hashing etc. might help clarify the complete picture for a particular service
This is for the first time i landed here and found it so fascinating out of all the system design content... Thank you so much for being here for us ❤️❤️
Good job. I watched similar videos on the same topic. This is actually the most informative for me. Thank you so much.
Thanks for the kind words!! We'll keep making more overtime.
Great information and explanation! As a continuation, here are some thoughts or questions, to help clarify myself.
1. Whenever a group msg is sent, on the client side, a group msg to be shown in the group window, but not in the individual msg window right? to do this, when server is sending group msg to individual users in the group, it needs to tell client that its a group msg
2. May be if data model is defined, showing tables, what fields each table has, the above point can be more clear, not sure how imp it is to describe data model in an interview
3. Last seen service could get info from "Web Socket Manager' from the previous diagram, as it knows which is user is connected to which handler, the moment a user loses connection, this service can inform Last seen service to stop updating time, right? (SORRY MY BAD, the understanding was wrong! Last seen time is based on when the user is "on the app", not just when he has a "live connection with the server") --------------This comment might help others as well
4. How long whatsapp server keeps the connection alive? is it forever until user logs out? or until we kill the running instance of the app on the client side?
You are PRO. Thanks for so many amazing videos. I have seen others video as well on chat system but none can beat you in terms of deep knowledge, clarity.What a wonderful explainations..
Love your content, esp because you capture a lot of breadth for all problems. I don't think 64k limit is true on the server side. The server is always listening on a single port and supports simultaneous connections by accepting connections on a different socket. So for a server, the number of file descriptors a server can create is the limiting criterion AFAIK
That's correct. A connection is uniquely defined by the (client IP, client Port, Server IP, Server Port) tuple. A server can serve multiple clients on the same port. For each such connection server keeps the track using file descriptors. So, essentially file descriptors become the bottleneck. That theoretically can be in the millions range. But, practically serving each request requires other resources like memory and cpu. Usually, modern commodity hardware servers (8 core cpu, 16GB RAM) can easily handle 60-100k "lightweight" connection.
@@pankajsingh17787 A server can handle6 4k websocket connections with one network interface, more network interfaces can be added to the server to allow it to handle more websocket connections.
Kuch bhi
@@pankajsingh17787 No really. Any TCP connection has the 5 tuple and every accepted connection needs to have a dynamic port associated with it. There are 64K ports. So per IP address, you can have at max 64K connections.
@@arushkharbanda This could be the case if the memory is less on the system. Assuming abundant memory, number of connections are limited by number of ports supported by the TCP/IP stack. First 1024 ports are special ports and are associated with processes with euid 0. You don't want to run your websocket server as root due to obvious security concerns. So, at max you can have 63K connections per IP address per server assuming resources hold up.
since you have mentioned cassandra and redis databases. Sharing sample table structure and queries would surely help us in the visualization of the data and would give us some ideas about the underlying databases as well. But Great Effort in making the concepts very simple
By far the best system design videos you can find on RUclips. Amazing how you have so much clarity on these. I would love to have a video on tools like Smartsheet. Thanks in advance.
Bhaiya, please keep making videos like this. It help us a lot. Please dont stop. 🙏
Excellent analysis, In depth detail. This is what I was looking for.
This is goldmine for system design resources .. Thanks a lot for this
Great video. Some of the issue like 64k limit is already pointed out in other comments. In the last you talked about decreasing the lag in kafka by adding more consumer. I guess you must be talking about adding more consumers in a consumer group but unfortunately it won't help once the count exceeds no of partitions. Excess consumers would sit idle in this case.
Hi Sandeep, want to take a moment to express my gratitude to your video series on system design concepts and scenarios. It was immensely helpful to ace through the system design interview. The system design interview scenario was different but the thinking pattern that I got from your videos helped to have a great dialogue with the interviewer.
I liked it because even though its about a chat application you discussed about other avenues that will utilize this app and make money out of it like your touch on analytics, and also you go end to end like a finished product from beginning to end. its complete, and to the point. Good work.
You are a God man! Have seen all of your videos, its been so so helpful for learning technical design for TPM roles! Thank you so much Bhai!
This is a great video. It would be great if you could also talk about how message ordering is ensured in case of FB messenger. I doubt it can solely happen on basis of timestamps (client/server), since depending on where users are and the timezones they are in, they might have different timestamps value. This is typically done through messageID's generated by client. You did pick up that, but that can be used for ensuring a consistent ordering of messages is missed. Please do include that as well if you can.
I would rather to it as a combination of an auto increment numeric message id from a user who is sending and a timestamp in UTC for a conversation /group. The message ID was more of a UUID/primary-key that I had talked about in the video. It's not efficient to use it for ordering I believe
This is my 3rd comment on your videos. I'm really liking your videos. Thank you so much! Here are my few questions. Please don't mind if you feel these questions dumb. I'm in initial phase of learning. Questions -1. Who decides which user will connect to which web handler? 2. You mentioned in Redis, we'll store 2 things - a) UserId-WebSocketHandler mapping and b) WebSocketHandler-List mapping . Why b is required? 3. What if WebSocket Handler machine goes down? What will be the behavior in that case. Do we need to remove mapping in redis and who will do this? 4. Are there chances of loosing the message. Let's say WebSocket machine went down before message was delivered and before message was put in Cassandra. Will their be retry logic from Client. What are your thoughts of using message queue here? 5. Is WebSocket Manager microservice? Is this single point of failure? If this goes down?
Hope this answers your questions.
1. Load balancer will connect the user to any one machine. It could maintain the connection then. In an advanced implementation web socket manager could act as a traffic distributor.
2. b is not required for any functional flow, but might be used for debugging or for any analysis incase there is some issue with some users or some machines.
3. Users will initiate the connection again. Apps will have this heartbeat logic.
4. Client would retry if there is no ack from the server. Message queue would also have this data loss issue if it's not handled via ack. Message queue here is not to much added advantage but it does add a lot of latency and impacts the NFR of low latency.
5. It's a microservice, but that does not mean it's a single point of failure. There should be multiple instances in multiple data centres to handle any outage. That's why it's not storing data in memory.
thank you for the detailed explanation, so much better than any whatsapp design video i saw. The summary link helps a lot to revise on the design! Great work!
Awesome explanation. Thanks so much for making such an insightful video
Really sad why you didn't receive same number of likes as views:) thank you for teaching such good content for free. I wish I could press the like button 100 times. Great efforts!!
Amazing content. Really this what i was looking for after watching many video from other creators.
Thank you so much for your dedication and the great content very helpful.
In your blog summary, I think there's a slight correction to be done the diagram. In 1:1 messaging where "Websocket Manager" is not communicating with "Message Service" which I think it should given that "Message Service" is the one handling storing the messages to Cassandra
Your videos are very through and with detailed explanation. Thanks for making it and sharing it
The best content I have come across from any all the content producers. All the components are explained with necessary technical details which interviewers drill down to and ask questions about. Thank you! Just one comment: Investment in a high-quality microphone would elevate your videos to a whole new level which are already unparalleled. Please keep making them.
Thank you so much for this wonderful video. I have a question. Isn't the web-socket manager a single point of failure? If it briefly goes down, how will it know the scope of the existing web-socket connections on the handlers?
Very well explained. Thank you for the video.
Great content. Thanks for the hard work ..
Could you please help answer a few questions
1) In the non functional requirements, you don't mention consistency. Shouldn't we talk about what consistency means for this system, and if it should be maintained
2) Why use MySql for storing User and Group Data? Why not noSql, it's easy to scale as compared to sql dbs as there could be millions of users
3) How is the message order taken care of in case of failures?
Ex:
timestamp1: U1 sends U2 message M1
timestamp2: U1 sends U2 message M2
Delivery of M1 fails during first attempt, but succeeds on successive retries
M2 succeeds on first try ..
as a result, M2 is sent before M1
I also have the same query, did you get answer for this?
Consistency should be maintained at a reasonable level. We cannot make an inconsistent system here.
The reason it's not too big a problem here is because there are these two possibilities:
1. in case the users are Live and engaging in a continuous chat, or just online, the message flows through the Weksocket Managers/handlers and we do not bother about what happens on the DB front.
2. The second scenario is when one of the users is offline. This is the scenario where consistency matters. But it's not that it takes minutes for the DB to get consistent. It'll get consistent within a second(maybe a few seconds on a worst case of extreme traffic), and since the user is offline, we would have the message replicated by the time they come online.
One edge case is that there is race condition between them coming online and the message being sent. In order to take care of that, I suggested to fetch messages for the user after a few seconds of them coming online to handle this scenario as well.
MySQL would be able to scale to a million of rows, that's not a concern. But more importantly, the data for these services is structured, which makes a good case for MySQL. We could use a NoSQL though, just that I don't see any amazing benefit of using a NoSQL here. I would suggest to look at this video for this: ruclips.net/video/cODCpXtPHbQ/видео.html
For the retry in case of message delivery, I think it is okay if they are shown out of order if we show it on the UI to both the parties, alternatively, we can build a sender side message ordering using some incrementing ids to handle this scenario.
Can some answer my doubts?
1) How would the packet flow be if user1 sends a message to user 2?
2) Will all web socket handlers be connected in a mesh?
3) Will web socket handler 1 push the message to message service and web socket handler 2 take it from the message service?
4) what protocol would be maintained between web socket handler and web socket manager? Is REST enough?
5) The cache maintained by web socket handlers. Are they in-memory cache or are they distributed?
I really like your videos, very crisp and clear, Detail oriented.
pretty good Sandeep. I feel you cover more in very less amount of time. I was wondering what your thought process was in having group messaging as separate. You can also think of direct 1-1 messaging as a group of 2 , no ? Also when in case of 1-1 messaging, when you are sending message directly from websocket_handler_A => websocket_handler_B, wouldn't it lead to system crash in case of rush hour ? In short my thought is like handling 1-1 messaging same as group messaging, in this case the async queue can buffer the messages and send with a slighter more delay, instead of a web_socker server crash. Let us know.
Anyways it was fun video dude. Keep posting.
I was also thinking in that direction due to the decoupling provided inherently by Kafka but I think the problem with that is creating a topic for each user in Kafka can lead to performance issues since there are billions of users, but since there are lesser number of groups (maybe millions) this approach works for group messaging.
Thanks @Sandeep @codeKarle for the great explanation. Myself Rishi, currently working as a Senior Software Engineer having completed my Bachelors in Computer Science from BITS. As is the case with most Design discussions I do have a few questions:
1) For the Group messages how do we handle the sequence of messages from a particular user. I understand that messages from multiple users can/need not be guaranteed a specific sequence, but all messages in a Group chat need to be in a proper sequence to avoid context misses in the conversation. Do we use a messageId or ordering of some sort for this? Or maybe use the ID Cassandra generated by the initial entry of the message into Cassandra ?
2) I have some concern with you preferring Cassandra over Redis for the last seen service. Redis being distributed and highly scalable, along with proper sharding can be appropriate for this use case IMO. We used it extensively for our gaming application for both read/write serving millions of I/O requests per minute.
By using consumer group and partition in topic ,in kafka , a sequence can be maintain .
@@abisheksoni3354 can you explain this a bit
Thanks a lot buddy for your detailed videos. These videos have helped me immensely. Sincere gratitude buddy🙏
Thank you for the amazing video! I have a question, what should to do if the group message fails to broadcast to some users?
A server can handle much more than 65k connections depending on its cpu and memory. Please do correct this in your video.
Liked the video. How are Web Socket Handers communicating with each another? Also, how is websocket handler communicating with the Websocket manager?
Great video.
What happens when a web socket handler fails. How does a web socket handler communicate with web socket manager.
Great. I agree with Ayesha, that SanDeep's seem most complete. ex, he goes into race conditions example , and the websock handlers ... that i don't see in other videos.
Really like the way of explanation, covered most of usecases
Thanks for the kind words, Sushant. Do share the channel details with your friends. It helps :)
covered lot of scenarios, good video
Great videos, thank you. Subtitles will be much appreciated !
Thanks for this video. Very scrip and informative . Have one doubt Host supports 6500 ports not connections. Number of connection does not depend on number ports. Simply a single port can handle more than one connection . Please clarify.
This is great. Request you to make videos on the individual components such as load balancers, databases, web sockets, caching etc. That would help many of us know the details of these things and designing systems will be much easier then
Please make one on Dropbox. Good job! One question my not use a MySQL cluster it is multi-master and any node can accept writes?
Thanks buddy!
We could use a MySQL as well here, and it'll work well to a great extent, but the parameters that you can consider to decide on a DB are many more other than just the scale.
For example, in this case, we might always query just on one field(let's say user_id) to fetch the chats of a person. Here we don't need complicated where clauses, we can just live with a DB that provides all values(chat details) against a key(user_id). If you look at it from this angle, there are a lot of DBs that are more optimal than a MySQL for this kind of a query pattern.
I would recommend to have a look at this video: ruclips.net/video/cODCpXtPHbQ/видео.html.
Here, I am covering exactly this question of which DB should be used when, and hopefully it'll answer your question :)
this is great stuff. i have also brought your udemy course. couple of questions 1. shouldnt we offload what we offload some of things websocket handler does 2. how is msg order achieved and 3. why isnt kafka used for passing messages between services as every thing happening in the system is essentially asyc in nature. thanks
Great video, just a question how WSH 1 will communicate with WSH2
Big fan of you sandeep. One quick q - if the target is offline why store the messages locally on sender. Why not store in cassandra and have it delivered when the user is online again
There are various nuances of the group messaging service.
- What should be the workflow for delivering messages to a group, Assuming the group has a huge membership. do we want to delay the delivery ? Does the Group service cache the information for user to machine mapping so as not to overwhelm the redis instance ?
- How to handle duplicate message delivery ? Should we delegate the deduplication at the client end so that devices will keep track ?
- Should all messages be delivered by a single node ? What happens if that crashes ? How to store the delivery status ?
- How should the failed delivery messages be retried ?
Amazing content as usual. Got to learn a lot from this
Nice video .have two question .
1 when is the messages getting deleted from Cassandra ?
2 when socket handler 1 request socket handler 2 to a send message
Does socket handler 1 sends message to socket handler 2 or socket handler 2 pulls from Cassandra
It would be great if you could explain tha database for cassendra etc like what to choose as key (partition and cluster) etc.
Just wondering, what is the difference between 'Low latency' and 'No lag'?
Excellent Video! the best video so far! Thank you!
Hi Karl, Nice explanation. Can you please clarify how the communication between the WS Handlers happen? What protocol does WS Handler1 use to talk to WS Handler 2? Thanks!
I think there a distributed messaging queue like Kafka might help as that will ensure atleast once delivery also will keep the websocket handlers from getting bombarded.
Thanks for explaining it such a nice manner.
I have a question though - In this design, all web socket handlers talk to each other. Are these connections also web socket connections?
When there are tens of thousands of servers, isn't this a problem? Is there any other alternative?
great video and really good details. Thank you. couple questions, 1- what are some over the counter 3rd party solutions or standards that can be used as a web socket handler and a web socket manager? I could not find any online. Maybe zookeeper for web socket manager. So every enterprise needs to build its own solution customer web socket handler or manager? 2- What is the protocol that can be used in between web socket handlers and between web socket handlers and web socket managers?
Hey you make such great designs and explanations, A+. But it would be so much better if you fixed the sound quality! Your videos would then be perfect.
Thanks for efforts. Your content is such a gem that it made a dumb(like me) good at these concepts. Thank you so much.
Also, I do have question, at 18.50, you talked about web-socket handler would talk to asset service if there is any image/video upload. I was thinking if it is good that web-socket handler talk to message service in case of image/video upload as well. And then, message service can decide that I am getting a static media in message so I should call asset service first and tell asset service to do all the transformation with image/video and send the CDN link to message service. This way we can also store the link in cassandra as well. What do you think ? Does this make any sense(I am still a beginner and it is just a thought so pls bear me up)
Hi @codeKarle and @Sandeep, Thanks for the video! I did not get how one websockethandler will talk to another websockethandler on which user u2 is having an open connection .. Can you please share some thoughts on this?
@sandeep could you please explain why we need Kafka to send group messages. Wouldn't it be easier to send message directly to group messages handler?
you well explained what is going on background thank you so much. But I could not understand that what about offline group members ? We can check message service to fech all undelivered messages in the first time that connected to wsh-x. But we do not keep any status indicator for group messages. So when user who is member of a group, if not connected to wsh then miss all messages which sent before.
If the lag increases in Kafka, we need to add more number of Broker(not consumer) right?
Awesome and detailed explanation... if you could make a system design for Locker service like Amazon provides would be great.
Love your content, really simple and easy.
I have a query in image upload part, how will the msg reach asset service? mobile -> LB1 -> web socket handler -> LB2 -> asset service? If yes why so, why can we directly have something that will store blob data for msg and return msg_link? i.e., mobile -> LB1 -> asset service?
Great video!! Can you create a series on LLD? There are hardly any resources available online.
very good content but your presentation could be better , instead of keeping the same tone you should emphasize / pause / highlight what is important to engage the audeince all the best
First Kudos to quench my thirst for a proper system design interview. However - it would be much appreciated - if you can draw the diagram and discuss it simultaneously - instead of dumping the viewer with the FINAL DIAGRAM. Due to this issue - your videos are for informational purposes. Nobody can learn to draw those fancy diagrams because only you know how you came up with them.
Awesome tutorials. Your videos are really informative. Thank you so much!
Great Video! Thank you! Keep making videos like this one!
I have one question though - which kind of strategy will the load balancer will use to maintain the web sockets? Is it IP based caching?
how did you come up with this design? what are the pro/con which led you to this
There are few improvements that can be added like which protocol to be used for message transfer like MQTT or XMPP and there pros and cons on device specification etc.
yeah definitely, just that it was getting too lengthy. we'll add those aspects in some video :)
Thank you so much for all the efforts. These are really great content.
For identifying duplicate image uploads can we use a bloom filter?
You explained WSM stores 2 sets of info: i) Which user is connected to which WSH ii) Which WSH is connected to which users . What is the purpose for storing the second piece of info here?
when user connects to load balancer, then how can load balancer forwards TCP request to websocket because it has to be between direct client and server?Can there be loadbalancer in between?
Does a web socket handler has a capability to talk to every other web socket handler? Even if they are like spread geographically?
for group service will there be specific topic in kaftka for that group conversation?
One question - how are web socket handlers talking to each other? via rest, grpc, ws between themselves? why can't they talk via a queue?
Cassandra has performance issues with the updates of done on regular basis because of its append only architecture, so Cassandra might not be a good choice , maybe we can use redis.
For asset operation where will we store receipt for asset in message service against some asset id? Also what will happed if u1 gets to know machine1 for u2 from websocket manager but just right then u2 connection resets and connected to machine2?
Well prepared and gracefully delivered. It would be glad if you can back to this channel and post video whenever you get some time or may be short videos. Thanks BTW.
Amazing Content!
Very good content. I just have one request. Can you explain what do you mean when you say "Query pattern we have aligns with query pattern Cassandra is good at" What query pattern is Cassandra good at ?
That's a good question!
I have tried to answer that in detail in this video: ruclips.net/video/cODCpXtPHbQ/видео.html
But basically the idea is that if you have huge scale of queries, but you have less variety in queries. Also if most of your queries are in a way where you can include a common partition key in the where clause in each query, then Cassandra works beautifully!
All you need to make sure is let's say if you have 10 varieties of queries, they all query on some or the other partition key.
Hope the helps!
More details @ www.codekarle.com/system-design/Database-system-design.html
Its elaborative !! Nice
Glad that it was helpful :)
Great video! One question - For group messages why can't we give responsibility of fanning out to multiple users to WebSocket Handler? Instead of a separate Group service can we have this logic built in the WebSocket Handler?
Websocket handlers are lightweight servers whose responsibility is to send and receive messages from users. Its better to give fanout responsibility to a separate service
U a System Design God
How will cassandra address queries like "Give me all unread messages for a user" ? If partition key is user_id for message table, then hot partitions will be a potential issue but if message_id is partition_key then all shards need to be queried for getting all the messages for the user and that will be costly.
Why will user_id as the sharding key create hot shards? Will a particular user be sending or receiving millions of messages in a day?
Please explain why are we using kafka or any messaging queue for group messaging, can't the web socket handler directly call the group service and fetch which all users it needs to send the message to? I mean why do we need kafka here, isn't it same as one to one messaging ?
DOUBT : It seems like whatsapp stores all the messages in our mobile device's storage.
When I turn off the internet, I can still view everything. So, I have 3 questions:
1. Is it reliable to store on mobile storage ?
2. How does it sync the messages from mobile storage and the actual DB ?
3. WhatsApp uses ~15 GB data on the phone. Is that a correct estimate for the data requirement for 1 user in an app like WhatsApp ?
Around 3:30 I think you are confused between ports and sockets. Theoritically a port can make billions of connection. And there are 65k ports. Please correct me if I'm wrong.
I have one question, so when u1 is sending message m1 to u2, then web socket thandler1 is connecting with message service to store the message m1, would it be correct to shift this work from web socket handler to web socket manager which then connects with message service, thus web socket handler1 role will only be to receive and send messages for a device.
Thanks for the video!
Thank you for creating this amazing content.
why there is a loadbalancer in front of the websocket manager? is only first request is going through the load balancer? is loadbalancer is also communicating to websocket manager to make decision about where to forward the request?
is whatsapp generating unique id for users or uses phone number as a unique id?
i only see kafka queue. don you need you multiple kafka queues for multiple groups, and/or p2p chats?