Microservices Gone Wrong at DoorDash

Поделиться
HTML-код
  • Опубликовано: 18 сен 2024

Комментарии • 90

  • @NeetCodeIO
    @NeetCodeIO  День назад +3

    A principal engineer from DoorDash emailed me with a couple corrections:
    """
    1. cascading failure: the point here is not that an error or latency experience in an RPC call graph is experienced by all upstream callers. Rather it's that one thing being broken or slow ends up breaking a totally other thing, usually in a different way. As an example, let's say a database gets slow and it's used in a call graph by multiple services. So like A->B->db, C->B->db. By cascading failure we mean that because db is slow, A or B develop a NEW kind of failure, like they run out of memory or threads or something like that.
    2. death spiral: scaling out is often a part of it, but the real issue is how to handle an overload condition. If there's no central coordinator, then each node needs to make this determination around overload individually.
    """
    I will try to make a follow-up video covering these.
    Written version: neetcode.io/newsletter
    System Design Playlist:
    ruclips.net/video/lFomAYu_Ug0/видео.html

  • @ismbks
    @ismbks 3 дня назад +136

    i'm starting to like this channel's content more than the main one

  • @funkdefied1
    @funkdefied1 3 дня назад +56

    I love this. It’s not just reading and reacting. It’s original content

    • @robertluong3024
      @robertluong3024 2 дня назад +11

      Primeagen and Theo in shambles

    • @JegErN0rsk
      @JegErN0rsk 2 дня назад +2

      Primeage is good at bringing his own thoughts and knowledge

    • @imdeadserious6102
      @imdeadserious6102 2 дня назад

      People who read are curators and add their own thoughts. You probably wouldn't read all the articles yourself. And now you get 3 layers of input, yourself, RUclips reader, and the article.

  • @Blezerker
    @Blezerker 3 дня назад +57

    Fun fact: Load shedding is also a term used in the context of the electrical power grid, where power generation < demand, electric companies begin implementing rolling blackouts to shed the load to prevent a failure of the equipment

    • @finemechanic
      @finemechanic 3 дня назад +9

      Not only failure of equipment, but also loss of stability and total system shutdown.
      This and the circuit-breaker metaphor made me think some guys at Doordash came from the power engineering field.

    • @Rustyshackleford20
      @Rustyshackleford20 2 дня назад +4

      I actually helped implement a load shed feature a few years back for a Siemens Ems. Another reason they use it is to protect vulnerable houses. Say someone has critical medical equipment in their house the power company will cut them off last

    • @blankcanvas3554
      @blankcanvas3554 2 дня назад +7

      Found the South African

  • @arniie5288
    @arniie5288 3 дня назад +22

    I love these kinds of system design videos, would love to see more coming!

  • @hellowill
    @hellowill 2 дня назад +13

    monoliths can scale (performance wise) too.
    microservices are more about scaling deployments and teams.

  • @caspera3193
    @caspera3193 2 дня назад +7

    Do not build distributed monoliths. Avoid temporal coupling. And if you can, avoid microservices unless you cannot run from them anymore.

  • @spartanghost_17
    @spartanghost_17 3 дня назад +11

    Did they really try to implement micro services without circuit brakers ? Lmaooooo😂😂😂😂

    • @someguyO2W
      @someguyO2W 2 дня назад

      Nor bulkheads 😂😂

  • @nark4837
    @nark4837 3 дня назад +11

    regarding the death spiral, wouldn't it also be wise to have some autoscaling rule on if X many deaths within Y minutes, scale up by Z nodes?
    i feel like this would mitigate this issue
    also tbh, i feel like if they're using gRPC for every service anyway, they had that coming, use asynchrony where possible, e.g., MQs, rather than just jamming requests down a microservices throat when it doesn't have the capacity to handle them

    • @someguyO2W
      @someguyO2W 2 дня назад

      What's the point then?

  • @sidasdf
    @sidasdf 2 дня назад +2

    Interesting that these are exactly the problems faced in networking. The retry storm issues / prevention techniques are exactly the same as the problems / solutions that one faces with retransmission logic in TCP

  • @dontdoit6986
    @dontdoit6986 22 часа назад +1

    TBH for years the world was designing micro-services incorrectly and it was an uphill battle to correctly implement them. And then, over the crest, now there are attacks on micro-services. 😅

  • @purdysanchez
    @purdysanchez 6 часов назад +1

    Most of the large ecosystems of micro services I've gotten to dive into have ironically been way more expensive and way slower than just building regular APIs with reasonable service boundaries. Then in order to make the microservice driven website appear performant, they spend a ton of money on monolithic caching layers between the client and the microservice APIs.

  • @hapaise2924
    @hapaise2924 3 дня назад +4

    I love this type for content, u get to learn soo much concepts and overview of systems and understand whats going on. keep this up

  • @nthmost
    @nthmost 22 часа назад

    This is exactly why messaging queuing became popular in the 2010s, to enable the decoupling of these services so as to avoid bottlenecks and timeouts. This isn't always possible (read: appropriate for the use case), but important for software architects to keep in mind as a potential solution.

  • @gaulcore
    @gaulcore 3 дня назад +3

    This is a great presentation. Love this format. Keep it up!

  • @hellowill
    @hellowill 2 дня назад +2

    network latency isn't that bad in my experience. The main latency is the serialisation/deserialisation overhead.

  • @DaRealCodeBlack
    @DaRealCodeBlack 3 дня назад +5

    They should have gone with erlang

  • @jrdtechnologies
    @jrdtechnologies 3 дня назад +2

    Perhaps doordash could have benefited from architecting their databases utilizing database sharding design patterns to minimize down time and to improve availability? Or minimally reduce the severity of the impact by isolating the database maintenance issue to a smaller area perhaps

  • @debkanchan
    @debkanchan 2 дня назад +2

    First mistake was making "microservices" that directly depend on one another and hence coupled(which is what microservices shouldn't be).

  • @jamiebrs1
    @jamiebrs1 3 дня назад +2

    Makes sense they had that spike in 2020 when the pandemic hit and most people were staying home.

  • @theprimecoder4981
    @theprimecoder4981 2 дня назад +1

    I'm currently learning microservice uusinh Java so this video is reqlly helpful

  • @montramedia
    @montramedia 3 дня назад +1

    as far as the death spiral why not just lower the traffic/usage threshold ?

  • @Dave.Wattz100
    @Dave.Wattz100 3 дня назад +14

    We use microservices at company that I work at and we categorize our services. One category is coordinator service, such service is responsible for making request to all other microservices that are necesarry for a certain task. No other microservice is not allowed to make requests to other microservices. I think it kinda solves this problem ?

    • @BigBrotha3459
      @BigBrotha3459 3 дня назад +6

      Like a mediator right?

    • @BlTemplar
      @BlTemplar 2 дня назад +7

      Your solution solves some problems but brings new ones.
      Imagine you have a service A which makes a call to service B. The client accesses the service A directly.
      But you have a coordinator so you have service C which makes calls to both A and B. Two requests instead of one and the coordinator becomes an additional point of failure in your system.
      Doordash problem wasn't that microservices call other microservices but that there was not enough protection against outages and no graceful degradation.

    • @varshard0
      @varshard0 2 дня назад +4

      Now the coordinator becomes a single point of failure with added latency.
      imo, it's better to implement other microservice patterns like circuit breakers, bulkhead and etc.
      Introducing some events driven in some areas where eventual consistency is allowed is also not a bad idea.

    • @adambickford8720
      @adambickford8720 День назад

      Google 'choreography vs orchestration'; these are known tradeoffs.

  • @funkdefied1
    @funkdefied1 3 дня назад +1

    A lot of the strategies used to mitigate these orchestration issues seem pretty complex. What’s the best way to implement these strategies at scale? Does every microservice implement their own request IO strategy? Or is there a scalable solution external to the microservices?

  • @mehuljain1991
    @mehuljain1991 3 дня назад +2

    Very Interesting! Make more like this.

  • @VidyaBhandary
    @VidyaBhandary 3 дня назад +1

    Another great explanation !!! Thank you !!!

  • @mikhail-t
    @mikhail-t 3 дня назад +1

    Thanks for great video!
    Am I right that Failure 4 (Metastable failures) is just a consequence of Failure 3 (or similar) where system cannot recover to stable state by itself? Trying to understand why it was separared into dedicated failure type.
    Btw, during this video I was waiting for any mention of backpressure ...

  • @adityaanuragi6916
    @adityaanuragi6916 3 дня назад +9

    2:35 as a red-green colour blind person this is hard to read for me
    Slight suggestion - both healthy and unhealthy service have same pattern of lines, perhaps change just one of them

    • @NeetCodeIO
      @NeetCodeIO  3 дня назад +9

      Good suggestion and sorry for the issue, I should've known that. Will keep it in mind for future videos.

  • @mjrmjr4
    @mjrmjr4 День назад +1

    Hi, I work at DoorDash on this kind of thing. A lot has changed since we wrote that blog post. I'd be happy to talk to you about the way we are thinking about microservices now if that's something you'd be interested in.

    • @NeetCodeIO
      @NeetCodeIO  День назад

      Thanks, just emailed you!

    • @mjrmjr4
      @mjrmjr4 День назад +1

      @@NeetCodeIO Alright, let's make this happen.

  • @JegErN0rsk
    @JegErN0rsk 2 дня назад +1

    More if this. Great video!

  • @ika_666
    @ika_666 2 дня назад

    using this video as white noise for my sleep

  • @johnw.8782
    @johnw.8782 2 дня назад +1

    I appreciate your channel and the information you're trying to get across, but what is your actual real world experience with any of these types of architectures?

  • @mohamed44324
    @mohamed44324 2 дня назад

    I am really enjoying this content. we need more of this

  • @dmc_xenon2411
    @dmc_xenon2411 День назад

    Thank you for the content.

  • @dumpling_byte
    @dumpling_byte День назад

    Did Uber not use event-based architecture? Meaning most service to service communication is a fire-and-forget event that doesn't require a response.

  • @FreeDomSy-nk9ue
    @FreeDomSy-nk9ue 2 дня назад

    Are these videos part of a live stream or something? Where can I watch?

  • @stephenghool8888
    @stephenghool8888 День назад

    Good content man. Are you in the Bay Area? Let’s meetup. I’m here until the end of September.

  • @abpdev
    @abpdev 3 дня назад +2

    12:56 🇿🇦!

  • @dafivers4127
    @dafivers4127 3 дня назад

    Make more of these videos!

  • @NurHossainRidoy
    @NurHossainRidoy 2 дня назад

    what books or articles or websites do you follow to learn system design or architechture.

  • @riser9644
    @riser9644 18 часов назад

    I think he's blowing this out of proportion, this is a simple problem, that is expected and can be accounted for
    What a drama queen

  • @tuber694
    @tuber694 3 дня назад

    Excellent video!

  • @Dom-zy1qy
    @Dom-zy1qy 3 дня назад +8

    Thats actually pretty wild that they were using a Python monolith up til 2020. Sounds like a stressful codebase.

    • @MaxJM711
      @MaxJM711 3 дня назад +1

      Fr, I don't mind writing in Python but a single monolith written in it for such a big company? Jfc that sounds awful lol

    • @antdok9573
      @antdok9573 2 дня назад +1

      google has a monolithic code base, too

    • @antdok9573
      @antdok9573 2 дня назад +2

      the point is using a monolith wouldnt inherently cause problems. moreso the engineers that interpret, such as yourself, and the ones that actually contribute to it do. everyone on the same page and it will be easy to navigate it without even having to delve into microservices

    • @someguyO2W
      @someguyO2W 2 дня назад +2

      We have too many people who don't have enough production experience 🤦

    • @Huntertje13
      @Huntertje13 6 часов назад

      If you have a bad monolith then you also have a bad microservice.

  • @mensurkhalid7102
    @mensurkhalid7102 2 дня назад

    FastAPI??

  • @jamesisaacson6414
    @jamesisaacson6414 3 дня назад

    I wonder whether some form of request queues (different queues for each priorities) would have helped. What's your take on this Navi?

    • @NeetCodeIO
      @NeetCodeIO  3 дня назад

      Possibly, but some of the requests were meant to receive payloads in the response, where as a message queue would just respond with an 'ok' that the request has been queued.
      Also, if the receiving service is failing a message queue would still be retrying the requests. I guess it would persist them though. Just my thoughts, someone else can chime in.

    • @punkboi97k
      @punkboi97k 3 дня назад

      @@NeetCodeIO The core microservices at the company I work at just publish and subscribe to our kafka stream using event carried state transfer. So microservice one doesnt need to know anything about microservice 2. But this approach doesnt really solve a "distributed monolith" approach to microservices.

    • @mjrmjr4
      @mjrmjr4 10 часов назад

      Message queues certainly help avoid some of those problems, but like all engineering decisions there are tradeoffs.

  • @MuhammadAkbar-m4g
    @MuhammadAkbar-m4g 3 дня назад +1

    i dont know what this is but if you are making it then i am gonna watch it

  • @thepoonhound3003
    @thepoonhound3003 20 часов назад

    wat

  • @josersleal
    @josersleal 3 часа назад

    suppose, maybe, you know nothing and are reporting as if you do... wtf?

  • @GenAlchemist
    @GenAlchemist 3 дня назад

    Interesting Video!
    Thanks

  • @blue_genes
    @blue_genes 3 дня назад

    Interesting

  • @pastori2672
    @pastori2672 3 дня назад

    so good

  • @babo4019
    @babo4019 3 дня назад

    Oh!

  • @lukealadeen7836
    @lukealadeen7836 3 дня назад

    Python has static typing

  • @mikerico6175
    @mikerico6175 3 дня назад

    You don’t understand microservices. Each microservices has it’s own data otherwise it’s a distributed monolith

    • @NeetCodeIO
      @NeetCodeIO  3 дня назад +5

      At what point did I state otherwise? If service 1 calls service 2 which reads from a database, how does that imply service 1 doesn't have its own data store.
      Every image I had is paraphrased from DoorDash's article. That's all this video is, a summary of their blog with my thoughts added.

  • @mikerico6175
    @mikerico6175 3 дня назад

    Don’t appreciate the way you explain concepts that you are new to as if you master them.

    • @NeetCodeIO
      @NeetCodeIO  3 дня назад +11

      Counter point: I don't appreciate non actionable comments. Anything specific you would like to comment on or are you just leaving this to feel superior?

    • @p-lock786
      @p-lock786 3 дня назад +3

      @@NeetCodeIOlove your content it keeps me motivated to learn more .. keep posting neetcode .. ppl like this will always be there thinking they are the only one who know everything

  • @igboman2860
    @igboman2860 День назад +1

    Lol everyone that led their microservices migration should loose their job.
    What they encountered are infact well known and understood failure modes of microservices. So much so that many smart folks invented service mesh with dedicated sidecars to handle these concerns like throttling , load shedding, exponential backoff + jitter during retries etc.

    • @mjrmjr4
      @mjrmjr4 10 часов назад

      Not sure if you've ever been part of a migration like this, but it's really hard. The people doing this work were doing the best they could in the time they were given. Service mesh makes the migration take longer and introduces a whole new set of debugging challenges.