A principal engineer from DoorDash emailed me with a couple corrections: """ 1. cascading failure: the point here is not that an error or latency experience in an RPC call graph is experienced by all upstream callers. Rather it's that one thing being broken or slow ends up breaking a totally other thing, usually in a different way. As an example, let's say a database gets slow and it's used in a call graph by multiple services. So like A->B->db, C->B->db. By cascading failure we mean that because db is slow, A or B develop a NEW kind of failure, like they run out of memory or threads or something like that. 2. death spiral: scaling out is often a part of it, but the real issue is how to handle an overload condition. If there's no central coordinator, then each node needs to make this determination around overload individually. """ I will try to make a follow-up video covering these. Written version: neetcode.io/newsletter System Design Playlist: ruclips.net/video/lFomAYu_Ug0/видео.html
People who read are curators and add their own thoughts. You probably wouldn't read all the articles yourself. And now you get 3 layers of input, yourself, RUclips reader, and the article.
Fun fact: Load shedding is also a term used in the context of the electrical power grid, where power generation < demand, electric companies begin implementing rolling blackouts to shed the load to prevent a failure of the equipment
Not only failure of equipment, but also loss of stability and total system shutdown. This and the circuit-breaker metaphor made me think some guys at Doordash came from the power engineering field.
I actually helped implement a load shed feature a few years back for a Siemens Ems. Another reason they use it is to protect vulnerable houses. Say someone has critical medical equipment in their house the power company will cut them off last
regarding the death spiral, wouldn't it also be wise to have some autoscaling rule on if X many deaths within Y minutes, scale up by Z nodes? i feel like this would mitigate this issue also tbh, i feel like if they're using gRPC for every service anyway, they had that coming, use asynchrony where possible, e.g., MQs, rather than just jamming requests down a microservices throat when it doesn't have the capacity to handle them
Interesting that these are exactly the problems faced in networking. The retry storm issues / prevention techniques are exactly the same as the problems / solutions that one faces with retransmission logic in TCP
TBH for years the world was designing micro-services incorrectly and it was an uphill battle to correctly implement them. And then, over the crest, now there are attacks on micro-services. 😅
Most of the large ecosystems of micro services I've gotten to dive into have ironically been way more expensive and way slower than just building regular APIs with reasonable service boundaries. Then in order to make the microservice driven website appear performant, they spend a ton of money on monolithic caching layers between the client and the microservice APIs.
This is exactly why messaging queuing became popular in the 2010s, to enable the decoupling of these services so as to avoid bottlenecks and timeouts. This isn't always possible (read: appropriate for the use case), but important for software architects to keep in mind as a potential solution.
Perhaps doordash could have benefited from architecting their databases utilizing database sharding design patterns to minimize down time and to improve availability? Or minimally reduce the severity of the impact by isolating the database maintenance issue to a smaller area perhaps
We use microservices at company that I work at and we categorize our services. One category is coordinator service, such service is responsible for making request to all other microservices that are necesarry for a certain task. No other microservice is not allowed to make requests to other microservices. I think it kinda solves this problem ?
Your solution solves some problems but brings new ones. Imagine you have a service A which makes a call to service B. The client accesses the service A directly. But you have a coordinator so you have service C which makes calls to both A and B. Two requests instead of one and the coordinator becomes an additional point of failure in your system. Doordash problem wasn't that microservices call other microservices but that there was not enough protection against outages and no graceful degradation.
Now the coordinator becomes a single point of failure with added latency. imo, it's better to implement other microservice patterns like circuit breakers, bulkhead and etc. Introducing some events driven in some areas where eventual consistency is allowed is also not a bad idea.
A lot of the strategies used to mitigate these orchestration issues seem pretty complex. What’s the best way to implement these strategies at scale? Does every microservice implement their own request IO strategy? Or is there a scalable solution external to the microservices?
Thanks for great video! Am I right that Failure 4 (Metastable failures) is just a consequence of Failure 3 (or similar) where system cannot recover to stable state by itself? Trying to understand why it was separared into dedicated failure type. Btw, during this video I was waiting for any mention of backpressure ...
2:35 as a red-green colour blind person this is hard to read for me Slight suggestion - both healthy and unhealthy service have same pattern of lines, perhaps change just one of them
Hi, I work at DoorDash on this kind of thing. A lot has changed since we wrote that blog post. I'd be happy to talk to you about the way we are thinking about microservices now if that's something you'd be interested in.
I appreciate your channel and the information you're trying to get across, but what is your actual real world experience with any of these types of architectures?
the point is using a monolith wouldnt inherently cause problems. moreso the engineers that interpret, such as yourself, and the ones that actually contribute to it do. everyone on the same page and it will be easy to navigate it without even having to delve into microservices
Possibly, but some of the requests were meant to receive payloads in the response, where as a message queue would just respond with an 'ok' that the request has been queued. Also, if the receiving service is failing a message queue would still be retrying the requests. I guess it would persist them though. Just my thoughts, someone else can chime in.
@@NeetCodeIO The core microservices at the company I work at just publish and subscribe to our kafka stream using event carried state transfer. So microservice one doesnt need to know anything about microservice 2. But this approach doesnt really solve a "distributed monolith" approach to microservices.
At what point did I state otherwise? If service 1 calls service 2 which reads from a database, how does that imply service 1 doesn't have its own data store. Every image I had is paraphrased from DoorDash's article. That's all this video is, a summary of their blog with my thoughts added.
Counter point: I don't appreciate non actionable comments. Anything specific you would like to comment on or are you just leaving this to feel superior?
@@NeetCodeIOlove your content it keeps me motivated to learn more .. keep posting neetcode .. ppl like this will always be there thinking they are the only one who know everything
Lol everyone that led their microservices migration should loose their job. What they encountered are infact well known and understood failure modes of microservices. So much so that many smart folks invented service mesh with dedicated sidecars to handle these concerns like throttling , load shedding, exponential backoff + jitter during retries etc.
Not sure if you've ever been part of a migration like this, but it's really hard. The people doing this work were doing the best they could in the time they were given. Service mesh makes the migration take longer and introduces a whole new set of debugging challenges.
A principal engineer from DoorDash emailed me with a couple corrections:
"""
1. cascading failure: the point here is not that an error or latency experience in an RPC call graph is experienced by all upstream callers. Rather it's that one thing being broken or slow ends up breaking a totally other thing, usually in a different way. As an example, let's say a database gets slow and it's used in a call graph by multiple services. So like A->B->db, C->B->db. By cascading failure we mean that because db is slow, A or B develop a NEW kind of failure, like they run out of memory or threads or something like that.
2. death spiral: scaling out is often a part of it, but the real issue is how to handle an overload condition. If there's no central coordinator, then each node needs to make this determination around overload individually.
"""
I will try to make a follow-up video covering these.
Written version: neetcode.io/newsletter
System Design Playlist:
ruclips.net/video/lFomAYu_Ug0/видео.html
i'm starting to like this channel's content more than the main one
Same here
Didn't realise he had 2 channels lol
There’s a main one?
Hard agree. This content is much more practical
@@chaos9790 linus tech tips
I love this. It’s not just reading and reacting. It’s original content
Primeagen and Theo in shambles
Primeage is good at bringing his own thoughts and knowledge
People who read are curators and add their own thoughts. You probably wouldn't read all the articles yourself. And now you get 3 layers of input, yourself, RUclips reader, and the article.
Fun fact: Load shedding is also a term used in the context of the electrical power grid, where power generation < demand, electric companies begin implementing rolling blackouts to shed the load to prevent a failure of the equipment
Not only failure of equipment, but also loss of stability and total system shutdown.
This and the circuit-breaker metaphor made me think some guys at Doordash came from the power engineering field.
I actually helped implement a load shed feature a few years back for a Siemens Ems. Another reason they use it is to protect vulnerable houses. Say someone has critical medical equipment in their house the power company will cut them off last
Found the South African
I love these kinds of system design videos, would love to see more coming!
monoliths can scale (performance wise) too.
microservices are more about scaling deployments and teams.
Do not build distributed monoliths. Avoid temporal coupling. And if you can, avoid microservices unless you cannot run from them anymore.
Did they really try to implement micro services without circuit brakers ? Lmaooooo😂😂😂😂
Nor bulkheads 😂😂
regarding the death spiral, wouldn't it also be wise to have some autoscaling rule on if X many deaths within Y minutes, scale up by Z nodes?
i feel like this would mitigate this issue
also tbh, i feel like if they're using gRPC for every service anyway, they had that coming, use asynchrony where possible, e.g., MQs, rather than just jamming requests down a microservices throat when it doesn't have the capacity to handle them
What's the point then?
Interesting that these are exactly the problems faced in networking. The retry storm issues / prevention techniques are exactly the same as the problems / solutions that one faces with retransmission logic in TCP
TBH for years the world was designing micro-services incorrectly and it was an uphill battle to correctly implement them. And then, over the crest, now there are attacks on micro-services. 😅
Most of the large ecosystems of micro services I've gotten to dive into have ironically been way more expensive and way slower than just building regular APIs with reasonable service boundaries. Then in order to make the microservice driven website appear performant, they spend a ton of money on monolithic caching layers between the client and the microservice APIs.
I love this type for content, u get to learn soo much concepts and overview of systems and understand whats going on. keep this up
This is exactly why messaging queuing became popular in the 2010s, to enable the decoupling of these services so as to avoid bottlenecks and timeouts. This isn't always possible (read: appropriate for the use case), but important for software architects to keep in mind as a potential solution.
This is a great presentation. Love this format. Keep it up!
network latency isn't that bad in my experience. The main latency is the serialisation/deserialisation overhead.
They should have gone with erlang
Perhaps doordash could have benefited from architecting their databases utilizing database sharding design patterns to minimize down time and to improve availability? Or minimally reduce the severity of the impact by isolating the database maintenance issue to a smaller area perhaps
First mistake was making "microservices" that directly depend on one another and hence coupled(which is what microservices shouldn't be).
Makes sense they had that spike in 2020 when the pandemic hit and most people were staying home.
I'm currently learning microservice uusinh Java so this video is reqlly helpful
as far as the death spiral why not just lower the traffic/usage threshold ?
We use microservices at company that I work at and we categorize our services. One category is coordinator service, such service is responsible for making request to all other microservices that are necesarry for a certain task. No other microservice is not allowed to make requests to other microservices. I think it kinda solves this problem ?
Like a mediator right?
Your solution solves some problems but brings new ones.
Imagine you have a service A which makes a call to service B. The client accesses the service A directly.
But you have a coordinator so you have service C which makes calls to both A and B. Two requests instead of one and the coordinator becomes an additional point of failure in your system.
Doordash problem wasn't that microservices call other microservices but that there was not enough protection against outages and no graceful degradation.
Now the coordinator becomes a single point of failure with added latency.
imo, it's better to implement other microservice patterns like circuit breakers, bulkhead and etc.
Introducing some events driven in some areas where eventual consistency is allowed is also not a bad idea.
Google 'choreography vs orchestration'; these are known tradeoffs.
A lot of the strategies used to mitigate these orchestration issues seem pretty complex. What’s the best way to implement these strategies at scale? Does every microservice implement their own request IO strategy? Or is there a scalable solution external to the microservices?
Very Interesting! Make more like this.
Another great explanation !!! Thank you !!!
Thanks for great video!
Am I right that Failure 4 (Metastable failures) is just a consequence of Failure 3 (or similar) where system cannot recover to stable state by itself? Trying to understand why it was separared into dedicated failure type.
Btw, during this video I was waiting for any mention of backpressure ...
2:35 as a red-green colour blind person this is hard to read for me
Slight suggestion - both healthy and unhealthy service have same pattern of lines, perhaps change just one of them
Good suggestion and sorry for the issue, I should've known that. Will keep it in mind for future videos.
Hi, I work at DoorDash on this kind of thing. A lot has changed since we wrote that blog post. I'd be happy to talk to you about the way we are thinking about microservices now if that's something you'd be interested in.
Thanks, just emailed you!
@@NeetCodeIO Alright, let's make this happen.
More if this. Great video!
using this video as white noise for my sleep
I appreciate your channel and the information you're trying to get across, but what is your actual real world experience with any of these types of architectures?
I am really enjoying this content. we need more of this
Thank you for the content.
Did Uber not use event-based architecture? Meaning most service to service communication is a fire-and-forget event that doesn't require a response.
Are these videos part of a live stream or something? Where can I watch?
Good content man. Are you in the Bay Area? Let’s meetup. I’m here until the end of September.
12:56 🇿🇦!
Make more of these videos!
what books or articles or websites do you follow to learn system design or architechture.
I think he's blowing this out of proportion, this is a simple problem, that is expected and can be accounted for
What a drama queen
Excellent video!
Thats actually pretty wild that they were using a Python monolith up til 2020. Sounds like a stressful codebase.
Fr, I don't mind writing in Python but a single monolith written in it for such a big company? Jfc that sounds awful lol
google has a monolithic code base, too
the point is using a monolith wouldnt inherently cause problems. moreso the engineers that interpret, such as yourself, and the ones that actually contribute to it do. everyone on the same page and it will be easy to navigate it without even having to delve into microservices
We have too many people who don't have enough production experience 🤦
If you have a bad monolith then you also have a bad microservice.
FastAPI??
I wonder whether some form of request queues (different queues for each priorities) would have helped. What's your take on this Navi?
Possibly, but some of the requests were meant to receive payloads in the response, where as a message queue would just respond with an 'ok' that the request has been queued.
Also, if the receiving service is failing a message queue would still be retrying the requests. I guess it would persist them though. Just my thoughts, someone else can chime in.
@@NeetCodeIO The core microservices at the company I work at just publish and subscribe to our kafka stream using event carried state transfer. So microservice one doesnt need to know anything about microservice 2. But this approach doesnt really solve a "distributed monolith" approach to microservices.
Message queues certainly help avoid some of those problems, but like all engineering decisions there are tradeoffs.
i dont know what this is but if you are making it then i am gonna watch it
wat
suppose, maybe, you know nothing and are reporting as if you do... wtf?
Interesting Video!
Thanks
Interesting
so good
Oh!
Python has static typing
You don’t understand microservices. Each microservices has it’s own data otherwise it’s a distributed monolith
At what point did I state otherwise? If service 1 calls service 2 which reads from a database, how does that imply service 1 doesn't have its own data store.
Every image I had is paraphrased from DoorDash's article. That's all this video is, a summary of their blog with my thoughts added.
Don’t appreciate the way you explain concepts that you are new to as if you master them.
Counter point: I don't appreciate non actionable comments. Anything specific you would like to comment on or are you just leaving this to feel superior?
@@NeetCodeIOlove your content it keeps me motivated to learn more .. keep posting neetcode .. ppl like this will always be there thinking they are the only one who know everything
Lol everyone that led their microservices migration should loose their job.
What they encountered are infact well known and understood failure modes of microservices. So much so that many smart folks invented service mesh with dedicated sidecars to handle these concerns like throttling , load shedding, exponential backoff + jitter during retries etc.
Not sure if you've ever been part of a migration like this, but it's really hard. The people doing this work were doing the best they could in the time they were given. Service mesh makes the migration take longer and introduces a whole new set of debugging challenges.