Bankrupt In 45 Minutes From DevOps | Prime Reacts

Поделиться
HTML-код
  • Опубликовано: 24 ноя 2024
  • Recorded live on twitch, GET IN
    / theprimeagen
    Reviewed article: dougseven.com/...
    Author: Doug Seven | / dseven
    MY MAIN YT CHANNEL: Has well edited engineering videos
    / theprimeagen
    Discord
    / discord
    Have something for me to read or react to?: / theprimeagenreact
    Hey I am sponsored by Turso, an edge database. I think they are pretty neet. Give them a try for free and if you want you can get a decent amount off (the free tier is the best (better than planetscale or any other))
    turso.tech/dee...

Комментарии • 535

  • @wlockuz4467
    @wlockuz4467 Год назад +769

    This is why you need a physical kill switch, something that blows up your server room.

    • @morganjonasson2947
      @morganjonasson2947 Год назад +90

      OR you could just make sure your web app is made out of flask as a development server without WSGI. That way the server will immediately crash as soon as you reach thousands of users in a short span of time. Usually one doesn't want this to happen, but if you are scared of high volume and don't want to deal with it, than that solution is perfect lol.

    • @Its.all.goodman
      @Its.all.goodman Год назад +6

      ​@@morganjonasson2947haha 😅

    • @smtp_yurzx
      @smtp_yurzx Год назад +7

      They laughed at me and said I was crazy! Who's laughing now?!

    • @awmy3109
      @awmy3109 Год назад

      😂

    • @Tekner436
      @Tekner436 Год назад +6

      just put a small incendiary device on your main fiber line

  • @spacemonky2000
    @spacemonky2000 Год назад +221

    Imagine hundreds of millions of dollars are being dumped every 10 minutes while you try to debug your deployment in prod. jesus christ

    • @ninocraft1
      @ninocraft1 9 месяцев назад +14

      the devs need alot of therapy after that xD

    • @FourOneNineOneFourOne
      @FourOneNineOneFourOne 6 месяцев назад +46

      I worked at MS and I saw a team showing us a demo of a stock trading API. The dev forgot to switch the account to the test one and used real company account to execute the trade worth $30m of IBM stock. He started getting calls, the trade couldn't be reversed, but the trading floor closed the position in small profit, so it now just a funny story, nobody got in trouble. Couldn't imagine if it went the other way.

    • @BrandonSorenson-fb3gg
      @BrandonSorenson-fb3gg 5 месяцев назад +9

      This is why robust testing is very important....and a staging environment as identical as you can get without actually making trades is crucial

    • @1adamuk
      @1adamuk 5 месяцев назад +2

      I'd just run out of the door tbh can't deal with stress like that.

  • @darylphuah
    @darylphuah Год назад +377

    Man, the amount of things that went wrong here
    - Not removing dead code
    - Re-using old feature flag (wtf were they thinking)
    - No deployment review and validation that proper code was deployed on all servers
    - No post-deployment monitoring
    - No observability /traceability metrics? (They couldn't immediately pinpoint that one server making way more trades than it was suppose to?)
    - No Kill switch
    Any one of these in place would have prevented the whole thing or minimised the damage

    • @nnnik3595
      @nnnik3595 Год назад +15

      No rollbacks?

    • @adriangodoy4610
      @adriangodoy4610 Год назад

      ​@@nnnik3595when the stated deploy process is copy paste the new binaries into the server... I guess rollback it's not even a possibility

    • @rachitpulhani3478
      @rachitpulhani3478 Год назад

      they did rollback, only the deployments that were working correctly though @@nnnik3595

    • @monad_tcp
      @monad_tcp Год назад

      @@nnnik3595 no staging ? no tests ? no automated anything ?

    • @darekmistrz4364
      @darekmistrz4364 Год назад +32

      Don't blame them on reusing old feature flag. Do you know how much the new one would cost? Their price is so high that it's so obvious they reused an old one. Look how much they saved on it! /s

  • @rdil
    @rdil Год назад +736

    Imagine getting an email saying something is wrong and seeing this much happen and not immediately killing the system

    • @doresearchstopwhining
      @doresearchstopwhining Год назад +138

      Imagine relying on someone outside of your company emailing you to tell you your shit is broken and now your company is bankrupt.

    • @vikramkrishnan6414
      @vikramkrishnan6414 Год назад +77

      Email is not an adequate alerting mechanism for shit like this.

    • @crtune
      @crtune Год назад +12

      The people building this software are not themselves traders, nor are they regulators, and there are some budgetary problems in seeking to write every possible screw up and related protections into even the most heavy budgeted trading software. This install and uninstall problem clearly shows the problem with a design for the "parent order/child orders" design and the way set up did not provide for network wide deactivation of this capability all without removing the PEG software. A better design would have provided for ability to turn off PEG entirely without uninstalling.

    • @FrankJonen
      @FrankJonen Год назад

      @@vikramkrishnan6414IMAP Push is just as immediate as any ping.

    • @Rexhunterj
      @Rexhunterj Год назад

      @@vikramkrishnan6414 I'm amazed they didnt have have an app and push notifications for this lol.

  • @shashantr.9380
    @shashantr.9380 Год назад +270

    I myself work as a System Engineer in a HFT company. Whenever there are any discussions regarding Risk Management, almost every time we get to hear about Knight Capital's incident

    • @harleyspeedthrust4013
      @harleyspeedthrust4013 Год назад +8

      risk management. why do they call it ovaltine

    • @boredphilosopher4254
      @boredphilosopher4254 9 месяцев назад +3

      @@harleyspeedthrust4013why not roundtine

    • @harleyspeedthrust4013
      @harleyspeedthrust4013 9 месяцев назад +1

      @@boredphilosopher4254 the mug is round

    • @steffenbendel6031
      @steffenbendel6031 4 месяца назад +2

      I remember the Knight incident. At that time, we traded FX (currencies) with there Hotspot subsidiary. Vibes out the pensions of the people working there. I myself caused a similar incident in our company: When a client changes the amount they wanted to trade with us, that created some adjustment trades. Either selling or buying more. After a change there got something switched. Execution for sell trades where added to the amount to buy and visa versa. So the number of shares you still need to trade got bigger and bigger. Luckily, I always insistent on keeping the trading slow and not very aggressiv. So it only traded back and forth for a few hours in the night, before we disabled and fixed it. Still had to file and incident report with the regulators.

  • @alexandrustefanmiron7723
    @alexandrustefanmiron7723 Год назад +153

    U never ever ever reuse the same opcode for a new functionality ever.. never ever ever ever ever!!!!!! If you remove and deprecate you throw an error of sorts... But never ever ever repurpose an API endpoint!!!!!!

    • @tryfinally
      @tryfinally Год назад +25

      their code was probably cursed and adding a new flag was too difficult

    • @astronemir
      @astronemir Год назад +10

      How cursed of a codebase do you think they had to reuse the same flag code? I bet they only removed the dead code for the express purpose of reusing its code, but they didn’t even separate the deployment into two. Remove first, make sure everything is working, then reuse it (don’t ever do it), but if you’re doing it, don’t do it in the same step.

    • @grimsas
      @grimsas Год назад +9

      As a bounty hunter I love when there's some obsolete API leftovers accessible in a codebase. Makes life much more exciting 😉

    • @kneesnap1041
      @kneesnap1041 8 месяцев назад

      Unfortunately, it's not considered bad practice to use single character command line flags.
      Most likely it was something like reusing -p. I can point to tons of unix programs that are using pretty much any single letter command line flag they could possibly use. I think it's a lesson that in critical applications, --power-peg should probably be used instead of just -p

  • @nexovec
    @nexovec Год назад +245

    What's the worst thing that could happen...
    *Siphons the entire AWS account

    • @awillingham
      @awillingham Год назад +19

      I thought that’s where this was going. “Our load balance was configured incorrectly and we allocated 500,000 instances which logged 5,000,000,000 errors and crashed cloudwatch and our s3”

    • @monad_tcp
      @monad_tcp Год назад +7

      ​@@awillingham I though they did something wrong and fired 1Billion instances and that cost $500M USD of AWS stupid charges. But it was even more amusing.
      At least if that was AWS you could immediately cancel your entire bank account, claim hack, make a public PR storm and never pay it back.

    • @FourOneNineOneFourOne
      @FourOneNineOneFourOne 5 месяцев назад +3

      @@monad_tcp Believe it or not, AWS has built in triggers to prevent anything like that from happening.

    • @monad_tcp
      @monad_tcp 5 месяцев назад +3

      @@FourOneNineOneFourOne That means it already happened at least once, someone did exactly that and they implemented stoppers to prevent it.

  • @neoplumes
    @neoplumes Год назад +252

    Article: "DevOps is broken!" Also Article: "Nothing about this is DevOps!"

    • @composerkris2935
      @composerkris2935 Год назад +76

      It's actually is a perfect example of why DevOps is needed.

    • @monad_tcp
      @monad_tcp Год назад +13

      its like people that say the windows firewall don't work (its turned off most of the time, or someone put a too broad rule there)

    • @Waitwhat469
      @Waitwhat469 Год назад +9

      @@composerkris2935 Right! Like engineering sent that shit over the wall and they deployed it. No one seemed responsible for it in prod, it's just was allowed to be. No one monitoring their new feature, no one designing automated deployments, backout plans, rollouts, etc. Just one big buerocratic process shoving new code in one end, and getting these results on the other.

    • @TosiakiS
      @TosiakiS Год назад +23

      It's actually "lack of devops" when they say "because of devops." Yes it's a confusing title.

    • @hbp_
      @hbp_ Год назад +1

      I'm still not convinced that an automated deployment was what would have prevented this disaster. Surely if it was built perfectly, perhaps written in Haskell.

  • @rain_deer
    @rain_deer Год назад +207

    I worked at a bank a few years ago, and the team I was on had a completely manual deployment process.
    We had a round robin table of who would be in charge that week and they would have to go around through all the sub teams and collect a list of manual instructions. And this was never prepared ahead, you really had to fight tooth and nail to get those instructions.
    We'd then wrap all up and send it off to the 1 and only 1 prod support guy that had been on the team longer than I was in the industry.
    Eventually that guy turned 56 and it scared the hell out of management, and they blocked the entire team from deploying anything.
    I was leaving the bank at that time so I don't know how it worked out, but every now and then I wake up in a cold sweet thinking about it.

    • @Nil-js4bf
      @Nil-js4bf Год назад +43

      Somewhat surprised that banks don't have better automation for this stuff. But then again, banks still rely on ancient code doing batch processing on csv files...
      If it ain't broke, don't fix it. The problem is, when it does break, it can be catastrophic.

    • @monad_tcp
      @monad_tcp Год назад +4

      That guy must have won a very sweet severance package when he retired

    • @monad_tcp
      @monad_tcp Год назад +7

      @@Nil-js4bf "If it ain't broke, don't fix it. The problem is, when it does break, it can be catastrophic." If you think like that you deserve everything that happens because of complacency.

    • @WinterMute_df
      @WinterMute_df Год назад +4

      Man, it sounds almost like the bank I'm working at (one of the biggest). I'm getting a deployment to do every other month without even a clear understanding of how all components of the platform work (because of 'need to know' and such). Without me calling this ONE GUY when something seems odd, everything would go in flames on Monday twice already.

    • @Tekner436
      @Tekner436 Год назад +5

      @@Nil-js4bf "oh shit it broke, call ted"
      "uhhh ted died 6 years ago."

  • @vikramkrishnan6414
    @vikramkrishnan6414 Год назад +192

    At that stage, I would just walk into the server room with a flame thrower and burn it down after around 10 minutes of this

    • @ea_naseer
      @ea_naseer Год назад +36

      remove the plug from the wall😂😂😂

    • @vikramkrishnan6414
      @vikramkrishnan6414 Год назад +16

      @@ea_naseer I want to be extra sure. Also, are there some copper ingots you would like to sell me?

    • @ardnys35
      @ardnys35 Год назад +2

      i was thinking sledgehammer. or a stick explosive or flooding it for the extreme measures.

    • @oleg4966
      @oleg4966 Год назад +2

      @@ardnys35 Just nuke the entire site from orbit.

    • @monad_tcp
      @monad_tcp Год назад

      First 5 minutes I would go to the server room and press the red button and kill the power

  • @ab-nb5xg
    @ab-nb5xg Год назад +51

    Normally you do have to pay good money for a power pegging but $400 million is probably a bit steep

    • @_Lumiere_
      @_Lumiere_ Год назад +6

      Entering findom territory there

    • @SaHaRaSquad
      @SaHaRaSquad Год назад

      Very low pegging on investment. Especially as it only ran on 1/8 of the capacity.

    • @NoComGaming-uq1oq
      @NoComGaming-uq1oq 5 месяцев назад

      Steep?! it's effing vertical

  • @demolazer
    @demolazer Год назад +108

    I have a close family member who works high up in data architecture at a major bank. You would not believe how batshit their dev processes are.

    • @larbiishak1974
      @larbiishak1974 Год назад +1

      surely they heard and learned from this story

    • @mcspud
      @mcspud Год назад +6

      I have been there.
      Can confirm.

    • @emptydata-xf7ps
      @emptydata-xf7ps Год назад +3

      Blame it on the CIO. The stakeholders are the pushers and the CIO needs to make it known to them how costly some mistakes are.

    • @adriangodoy4610
      @adriangodoy4610 Год назад +11

      The only way to improve a bank system is: create a completely new bank, make it grow until it's bigger than the original, absorb the original. You can't really touch that COBOL in. Meaningful way

    • @stevezelaznik5872
      @stevezelaznik5872 Год назад +10

      I used to work at an insurance company. However bad you think it is, it's worse.

  • @Gruby7C1h
    @Gruby7C1h Год назад +71

    I remember reading a story about it few years ago, the real lesson here is: never "repurpose a flag".

    • @harrytsang1501
      @harrytsang1501 Год назад +10

      I think it's ok as long as you space out the deployment between dead code removal and feature flag on
      But yes, lack of separate feature flag, lack of kill switch, lack of knowledge to even kill the server

    • @mennovanlavieren3885
      @mennovanlavieren3885 Год назад +3

      Why is this not in the lessons learnt? In would not pass my review. NEVER repurpose a flag. There is zero cost to a new flag, and if there is not then that is a problem by itself.

    • @tylerbreau4544
      @tylerbreau4544 5 месяцев назад +4

      I think it's foolish to only take a single lesson.
      There's also killswitch, automated deployments, accurate rollbacks (how did the rollback not stop the power peg system from running?), etc.

  • @chrisalexthomas
    @chrisalexthomas Год назад +28

    I remember hearing about this when I was working at a finance company back in the day and I couldn't believe it. Every time I see this article I still read it, despite knowing the history already because it's just so damn funny. Who doesn't love a story with a protagonist called power peg?

  • @arcanernz
    @arcanernz Год назад +135

    Infinite loop + high speed trading, what could go wrong? I think the problem was they didn’t have any anomaly detection and mitigation.
    A bad deployment caused this but the rollback made it worst, sometimes you can’t test for every scenario hence why you need anomaly detection and kill switches.

    • @andrewyork3869
      @andrewyork3869 Год назад

      HFT is so fast I challenge what good a anomaly detector would actually be.

    • @darekmistrz4364
      @darekmistrz4364 Год назад

      @@andrewyork3869number of trades per minute? number of type of trades per minute?

    • @arcanernz
      @arcanernz Год назад

      Better than nothing

    • @CoderDBF
      @CoderDBF 10 месяцев назад

      @@andrewyork3869 About 400 MIllion worth at this point. The system would have halted a lot sooner.

  • @anthonyisensee
    @anthonyisensee Год назад +78

    As a DevOps engineer, this story shows exactly why you need good DevOps, or at the very least good engineers that can do good DevOps.

    • @Veri7a
      @Veri7a Год назад +13

      if not, you auto deploy PegOps

    • @mlc1610
      @mlc1610 Год назад +4

      DevOooooops

    • @wormius51
      @wormius51 Год назад

      I never worked on something this big but when I work on something, the guy just wants the thing to do a thing so I am pretty much doing all the front back ends, devops, testing ect.. Do you have dedicated people for each thing in the stuff you worked on?

    • @cornoc
      @cornoc Год назад

      @@wormius51 yes that's how things usually work when a company gets big enough.

    • @ugh_dad
      @ugh_dad Год назад +6

      for sure, this story seems like the title should be Bankrupt in 45 minutes from Lack of DevOps

  • @raulsantana163
    @raulsantana163 Год назад +153

    "The code that that was updated repurposed an old flag that was used to activate the Power Peg functionality"
    The article covers the deployment part but this is it's craziest thing. For such impactful functionality, they should have just deleted everything and reused nothing.

    • @khatdubell
      @khatdubell Год назад +34

      The last place I worked I made deleting unused code my religion.
      I deleted millions of lines of code.
      When I left there were plenty of still unused code that needed deleting.

    • @catgirl_works
      @catgirl_works Год назад +40

      This one really stuck out to me too. Do not EVER reuse flags. If, for whatever reason, you absolutely have to reuse a flag, do not repurpose that flag in the same release that removes the old code. That is a disaster waiting to happen. The old code should be completely removed from the system long before you even think of reusing a flag.

    • @jasondoe2596
      @jasondoe2596 Год назад +6

      ​@@catgirl_worksExactly, I'm surprised Prime didn't mention this.

    • @oleksiistri8429
      @oleksiistri8429 Год назад +1

      ​@@khatdubellyou shite a lot, sir😊

    • @tymak_cz
      @tymak_cz Год назад +2

      Yeah...what could go possibly wrong, if you try to repurpose code, that was 8 years dead.

  • @danielvaughn4551
    @danielvaughn4551 Год назад +56

    DevOps is shorthand for "job security"

    • @theangelofspace155
      @theangelofspace155 Год назад +2

      It's t that SecOps? As in DevSecOps?

    • @cornoc
      @cornoc Год назад +2

      @@theangelofspace155 it's all just words in the ether

  • @faithful451
    @faithful451 8 месяцев назад +1

    Realistically their most inexcusable failing was not having a post deployment review to make sure everything was good (all servers in expected state), etc.
    There are always gonna be suboptimal processes, and things that are manual that shouldn't be, and sometimes not enough staff on the team, or management won't pay for X tool, or whatever, but the one thing you can ALWAYS do is a proper checklist of what was supposed to be done, and making sure it got done.

  • @salvatoreshiggerino6810
    @salvatoreshiggerino6810 Год назад +27

    I knew it was a scam when I was on a team where they had hired a DevOps specialist who didn't know how to code so nothing was automated, deploying just meant copying individual files to the server and restarting.

    • @adriangodoy4610
      @adriangodoy4610 Год назад +13

      Devops -> devs do operations. Companies so I will hire a team of non devs and put them in operations and will call them devops .

    • @randomdude5430
      @randomdude5430 Год назад +4

      I would never hire an devops who has not been a dev before he switched. I saw a lot of ops guys jumping on the devops train and having no clue about what they are doing.

    • @salvatoreshiggerino6810
      @salvatoreshiggerino6810 Год назад +6

      @@randomdude5430 You're not using your brain. You hire a rando off the street who vaguely knows how to turn on a computer, pay him accordingly, then you sell him to your clients at full [meme role du jour] rates and then you laugh all the way to the bank.

    • @georgerogers1166
      @georgerogers1166 7 месяцев назад

      @@salvatoreshiggerino6810 It's called having ethics.

    • @georgerogers1166
      @georgerogers1166 7 месяцев назад

      @@salvatoreshiggerino6810 But in that case you aren't hiring, your clients are.

  • @pohjoisenvanhus
    @pohjoisenvanhus Год назад +49

    Being able to roll back on a moment's notice seems to be pretty important, huh.

    • @khatdubell
      @khatdubell Год назад

      The rollback fucked them harder.

    • @fulconandroadcone9488
      @fulconandroadcone9488 Год назад +16

      Let alone roll back, just shut the thing down would be impressive.

    • @katrinabryce
      @katrinabryce Год назад +5

      But in this case, the roll-back made things worse.

    • @vikramkrishnan6414
      @vikramkrishnan6414 Год назад +4

      Rollback made this worse

    • @pohjoisenvanhus
      @pohjoisenvanhus Год назад +5

      @@katrinabryce Yes, a rollback as badly botched up as the roll out was. It sounds like their fundamental problem was having both poorly understood legacy code and a legacy server in the mix.

  • @TheEVEInspiration
    @TheEVEInspiration Год назад +6

    Automating deployment can also automatically deploy errors, introduce new errors, or be done in an environment who's state no longer represents the tested state in a critical way.
    Mistakes anywhere along the process can always happen and human supervision is always required to make sure things are going right and if not, to react to the unforeseen/unhandled situation promptly.

    • @autohmae
      @autohmae 8 месяцев назад +2

      Yeah, automated deployment of the wrong thing is DEFINITELY a huge problem, but part of the idea of DevOps, especially GitOps, is can you made it a pull-/mrge-request and have review of it.

  • @shoooozzzz
    @shoooozzzz Год назад +15

    Best outro yet. "The name is.... The PowerPegeagen"

  • @Refresh5406
    @Refresh5406 Год назад +14

    "Bankrupt In 45 minutes from every single solitary individual in our company being a monumental idiot"

    • @adriangodoy4610
      @adriangodoy4610 Год назад +8

      I bet a lot of people involved were saying openly to management that it was a bad idea. But management wasn't having any of that complaints

  • @JackDespero
    @JackDespero 5 месяцев назад +4

    There is always a kill switch. It is called forcefully unplugging the 8th computer with Power Peg from the electricity net.

  • @Tony-dp1rl
    @Tony-dp1rl Год назад +8

    "repurposed a flag" ... WTF would you do that!? lol

  • @sorcdk2880
    @sorcdk2880 Год назад +8

    Having worked with software where mistakes could potentially cause similar sized losses, I was always a bit amazed at how small the team was (3 people) and how little management cared to take extra precautions. At least I had pushed to get some good automated tests, and we did end up putting some other procedures in place over time, but it really felt like we were just lucky that nothing too bad ended up happening before we got a more safe setup in place.

    • @tylerbreau4544
      @tylerbreau4544 5 месяцев назад

      It also is apart of the developer's job to inform management of risk and what can be done to address the risk.
      Any manager who refuse to invest in risk countermeasures within reason does not have the company's best interest in mind.
      With that said, it is important to note that risk management is a balance, hence the "within reason".
      Just because a potential problem exists doesn't necessarily mean it's justifiable to spend 6 months of development time fixing it.
      And it is the team leaders and manager's jobs to weight the cost and risk and determine the best course of action - Devs explain the risk and managers decide if it's worth the cost to fix.
      If you keep a paper trail at least you can cover your own behind.

  • @vikramkrishnan6414
    @vikramkrishnan6414 Год назад +9

    Cash equivalents = LIBOR bonds and short term US bonds (typically < 1yr), i.e. bonds of AAA rated countries with near to no interest rate risk

  • @Eagledelta3
    @Eagledelta3 Год назад +8

    I fail to see how the title of the article matches what was discussed. DevOps has always pushed for automated deployment processes (or at least as automated as humanly possible) to limit human error. In other words, the idea has been to apply some Dev processes to the Operations process. NOT to replace operations with developers NOR to is it to make the operations team into a development team.
    Like Agile, the original ideas behind DevOps have been hijacked by managers and companies to get what they want from it rather than actually apply the benefits within those ideas. Nor have either of those ideas ever meant to be "This is the only way to do this" kind of attitude.

    • @composerkris2935
      @composerkris2935 Год назад +3

      Yeah, everything they did wrong goes against everything I have ever been taught about DevOps. Just one giant oopsie after the next. If anything, this tale demonstrates why good DevOps practices are needed.

  • @agusaris5031
    @agusaris5031 Год назад +32

    Imagine blaming “DevOps” when you still copying those files manually which is against DevOps’s principle itself

    • @Tekner436
      @Tekner436 Год назад +2

      but... the article is about why you need good devops practices... lol

  • @user-qr4jf4tv2x
    @user-qr4jf4tv2x Год назад +5

    even if you automate the automation can also create its own problem.
    its like using triggers in a database where you forget it eventually.

  • @karmatraining
    @karmatraining Год назад +4

    I love this new "Humorous Energetic Sports Commentator, But For Obscure Coding Topics" genre

  • @KulKulKula
    @KulKulKula Год назад +50

    "Terraform deez nuts"
    After dealing with this M$ piece of .... every day, I cannot agree more

  • @levifig
    @levifig 11 месяцев назад +4

    The irony here is that the issue was caused precisely because of a lack of DevOps procedures…

  • @tomkatdev
    @tomkatdev Год назад +3

    This is a new record for how hard i've laughed with prime. I can't even type...I may die of laughter whilst typing this on my inadequate keyboard.

    • @lashlarue7924
      @lashlarue7924 9 месяцев назад

      slow clapping 👏 on this one for sure 😂

  • @careymcmanus
    @careymcmanus Год назад +1

    At my workplace we have a replication of our production environment (sandpit) which as devs we deploy to and tested before devops deploys the same changes to production. Last year the person who did the deploys to sandpit left and I took over. The process was a list of different steps that all needed to be done correctly and as someone with adhd I can't get that right all the time/often. As it was a sandpit environment the only harm it caused was the ridiculous amount of delays in getting it all working but it drove me up the wall. I was able to convince my boss to give me the time to completely overhaul the process so that it is now just a simple one line command. We haven't had a single deploy issue since and also the DevOps team loves me now because it made their lives easier.

  • @TheOtherNEO
    @TheOtherNEO Год назад +16

    Can't recall when and who, but there was some broker who's software developers didn't realize that bid and offer means the reverse in stonk trading.

  • @bluecup25
    @bluecup25 Год назад +25

    Imagine being the poor dude who forgot to copy paste the new files to the 8th server.

  • @ChrisOfSDUB
    @ChrisOfSDUB Год назад +13

    The "term of art" is change control.

    • @kiranmkota
      @kiranmkota Год назад +2

      Change management? Version control? Never heard of change control

  • @talkohavy
    @talkohavy Год назад +4

    Hey ThePrimegen,
    I don't know if you read the comments...
    But we would definitely love a talk about how the implemented the kill switch, what it means, and how it would work in case of a real code-red situation.
    Love your content !

  • @dmurvihill
    @dmurvihill 9 месяцев назад +1

    I worked on a high speed ad bidder around ten years ago. The kill switch was literally the first thing we built.

  • @Phaceial
    @Phaceial Год назад +10

    Extremely high level - Market making are the people that exchange stocks for cash. They give the market liquidity.

    • @khatdubell
      @khatdubell Год назад +3

      “Market makers connect sellers with buyers” is probably a better description.

    • @DaveThomson
      @DaveThomson Год назад +3

      @@khatdubell Its both. There has to be liquitity in order to sustain the market.

  • @Alex_Cevi
    @Alex_Cevi Год назад +4

    The tism is really firing in this video .. I love it

  • @complexity5545
    @complexity5545 Год назад +2

    Its rare for a financial brokerage system not to have a Halt (or kill switch). ....really rare (and not to have a cluster backup).
    P.S. Don't forget that no matter how much you plan, even a robot can be told to do deployment wrong. You need a kill switch and a backup cluster (for rolling back).

  • @dabbopabblo
    @dabbopabblo Год назад +1

    Modulo on an incremental user id is such a genius way to select a subset of deterministic experiment subjects. My grug brain would of just picked a random value and stored it for every hit of a common endpoint if the user hadn't been either selected or not selected prior.

  • @darshanchaluvaraju
    @darshanchaluvaraju Год назад +2

    Automation is the core principle of DevOps and The statement "Copying the code to the 8th server (Seems manual to me)" itself kills the concept of DevOps principles. The symbol of DevOps "Infinity loop" itself shouts "Automation!". Guess the "Tech"nician failed to understand that. The article seems to have been written in 2014. I won't be surprised if that's what people understood by DevOps at that time.
    And if observability part is not taken care, it doesn't matter if the deployments are manual / automated, it is just a ticking time bomb.

    • @garciajero
      @garciajero Год назад +2

      100% , the only mention of Devops in this article was in the title , anyways we call it Platform Engineering now.

  • @kizhissery
    @kizhissery Год назад +2

    market making is done by HFT(high frequency trading) hence provide liquidy.
    These system trades billions of time in a day hence they buy low and sell high more like 1cent per trade hence they make money.

  • @asdfxyz0
    @asdfxyz0 Год назад +2

    This sounds like a lack of devops to me. Super poor observability around deployments, too many manual steps to deploy, insufficient monitoring of the system, sounds like this was inevitable tbh.

  • @sylarfx
    @sylarfx Год назад +1

    this flash crash story is well known in the finance/trading circles, I think there was also a book written about it and similar cases of flash crashes

  • @NebucadLaVey
    @NebucadLaVey Год назад +3

    CI/CD is one Part of DevOps. mainly stand DevOps for automatisation, thats why CI/CD often will be said. But it a little bit more. Like the goal to reduce human action to minimum or test to make sure you don't deoploy crap Program who developer fucked it up. kind like those things...

    • @Waitwhat469
      @Waitwhat469 Год назад

      Right DevOps is an org and process idea, CI/CD is how you enable that at scale.

  • @patrickhighspeed
    @patrickhighspeed Год назад +3

    That is the greatest Story of all time!!!!

  • @isaackoz
    @isaackoz Год назад +2

    "No written procedures that required such a review."
    I'm sorry, but having no procedures for somebody replacing code on servers is just asking for an Office Space 2.0. The amount of power those "technicians" had....

  • @mariano.pualiu
    @mariano.pualiu 8 месяцев назад

    Even at Pixar they were able to call the TI department and ask them to unplug the servers right now! to prevent continuously deleting of the assets, didn't help much, but they were able to do that

  • @nikarmotte
    @nikarmotte Месяц назад

    2:13 that made me get a laughter for a few minutes. Man you made my day

  • @mrchubbles
    @mrchubbles Год назад +4

    What's mind-boggling to me is how Knight was still able to be acquired at $1.4 billion despite this fiasco.

    • @lashlarue7924
      @lashlarue7924 9 месяцев назад

      Well, their configuration of assets in the value chain was such that someone was still willing to pay for those assets. Future value matters. Temporary insolvency can be remediated. Also, $400 million isn't that much money on Wall St.

  • @cmoullasnet
    @cmoullasnet Год назад +4

    If you’re not confident enough to delete vestigial deprecated code, it means you don’t know how your code works and you’re too cheap/lazy to do anything about it.
    It’s not surprising that blackbox code did something unexpected. If it’s your personal blog then fine. But high frequency trading systems subject to the wrath of angry rich Wall Street folk?

  • @my-curiosity
    @my-curiosity 5 месяцев назад

    I'm a DevOps working for big company. This article is actually good example why you have to use CI/CD / DevOps practices (call it whatever you like, but in the end it's just engineers with specific mindset and responsibilities).
    Also, I worked for companies that had a low quality DevOps teams consistent of folks from India (in Bay Area, CA)...and that was a disaster. They were doing everything manually on Windows servers...So, even if you got a DevOps team it doesn't mean you have a right people with right skills. You want to follow all Automation, DevOps, CI/CD best practices, not just have a useless DevOps team

  • @carltongannett
    @carltongannett Год назад +14

    NDC conferences has a good talk on this I believe

  • @Gilligan128
    @Gilligan128 Год назад +3

    The title ks very misleading: most of the way Knight handked things is counter to DevOps orinciples.

  • @kairon156
    @kairon156 5 месяцев назад

    Like the kill switch. I bought a plunger for my toilet 3 years ago. Haven't used it yet but I'm ever so thankful that it's there as an emergency option.

  • @doceddie
    @doceddie Год назад +1

    This is low-key your best video yet. 😂

  • @alvinxyz7419
    @alvinxyz7419 Год назад +2

    i dont get it, there is nothing to do with devops in the article

  • @khatdubell
    @khatdubell Год назад +10

    "the only blame on the engineers is that they didnt push hard enough"
    no.
    Obviously i have no inside knowledge here, but lets assume the worst case scenario.
    In that case:
    They are to blame for not removing unused code.
    They are to blame for repurposing an old feature flag instead of using a new one.
    They are to blame for not building in a kill switch into either of the features.

    • @freezingcicada6852
      @freezingcicada6852 Год назад +3

      Well, initially it seemed like just a lack of communication to avoid deploying the older version.
      Ultimately I do put the blame on the managers or w/e. Wtf where they doing when such an exaggerated amount of volume in the market was being pumped out. The FIRST 15 mins after opening bell in the market is the most important. They didnt see if it was just themselves pumping the market?

    • @khatdubell
      @khatdubell Год назад

      ​@@freezingcicada6852 Yes, that too.
      I forgot about that.
      where was the monitoring?

    • @catcatcatcatcatcatcatcatcatca
      @catcatcatcatcatcatcatcatcatca Год назад +3

      In my view, the price-tag alone can determine the blame. If the company goes bankrupt, the person at fault is by definition the CEO: the colossal failure was organisational risk management, as there seemingly was none at all in this case.
      While blame is never a zero-sum game, it is not very useful to blame developers or technicians for a disaster like this. While every level has lessons to learn, the lessons for leadership are much more important than the lessons for technicians and developers.
      Any harm, loss or damage disproportionate to the technicians or developers position and salary is a risk their supervisor failed to mitigate. All the way up to CEO.

    • @naniyotaka
      @naniyotaka Год назад +1

      @@catcatcatcatcatcatcatcatcatca Just like Chernobyl.

  • @foxglenacres
    @foxglenacres Год назад

    it was doomed from the beginning.. the name alone just makes the "horror" of potential issues just hilarious.. it is sad and horrible this happened but the humor in your deliver is hilarious!

  • @TobiMetalsFab
    @TobiMetalsFab 3 месяца назад

    I remember once asking "What's the worst that can happen?", and then proceeded to be in pain for the next two years

  • @JesseGilbride
    @JesseGilbride Год назад +1

    "PowerPegAgen" might be the best one yet. 😂

  • @MiguelFelipeCalo
    @MiguelFelipeCalo 4 месяца назад

    We're just publishing new pages from a CMS and we have checkboxes all the way through making sure nothing broke. One would think a mission critical system like this would have a deployment session similar to a rocket launch.

  • @GaryFerrao
    @GaryFerrao Год назад +3

    0:43 associating Continuous Delivery with Dave Farley was the best joke i’ve heard so far. 😂
    but be careful, you are becoming associated with regex licensing and some Rust things 😂

  • @Tips-r-us
    @Tips-r-us 3 месяца назад

    In Dairy Farm in Asia, we own most of the 7Eleven stores, and when we release updates to the 7Eleven Servers and POS, we do the slow clap, 1 store, then 5 stores, then stores in a region, then global. That is our procedure. we don't want to stop all the stores taking money in a single moment, and if crowdstrike happened, would be a nightmare, luckily, all out servers and pos terminals are Linux.

  • @PCGamesAndTek
    @PCGamesAndTek 9 месяцев назад

    This is a perfect example of why DevOps and following strict release procedures is crucial.

  • @PikaPetey
    @PikaPetey 9 месяцев назад

    Im so confused how a computer program can cause tha that much financial destruction. It does even seem real.

  • @madumlao
    @madumlao 4 месяца назад +1

    This article is kinda sus. The whole incident is being used to "sell" the concept of devops and continuous delivery, as if it was the "technicians" (classic sysops practitioners) fault that the company went bankrupt. "hey guys, if only you automated your delivery, you wouldnt have gone 400 million dollars into debt!"
    The reality is its obvious that the software culture of that organization was painfully unhealthy. No amount of devops practices would save it, because they'd just end up making reliable repeatable deployments that lost hundreds of millions of dollars.
    It is not sysops fault that
    - they used bad practice in repurposing an old flag
    - they refactor their code halfway
    - they have no ability to sanity check their deployment
    - they have no ability to verify the deployed version
    - they have no ability to turn off the said system
    The biggest red flag really is that this is a trading company with hundreds of millions in cash and apparently the people budget of their CORE SERVICE isnt big enough for a second guy. I wouldn't be surprised if these trader techbros assigned the intern to do it. Their whole job is to game the money market, and when their game is based on bad (company) mechanics its the "technicians" fault because they didnt "follow latest devops" or something. Yeah, no.

  • @MickDavies
    @MickDavies Год назад +1

    Love the Slow Clap, totally stealing that terminology for A/B percentage rollouts

  • @thewiirocks
    @thewiirocks Год назад +1

    Oh good grief. You know how there's Development and Operations? DevOps is the practice of Development and Operations working together. Techniques like CI/CD are methods by which Development and Operations can work together. *Anything else* is FAKE. Especially if you have a separate "DevOps Team".
    This isn't even a secret. Most companies are doing some bullshit they "call" DevOps but has nothing to do with DevOps. Of course it's FAKE!

  • @RenanTraba
    @RenanTraba Год назад +1

    I add a second issue, dont repurpose flags, the old code wouldnt trigger , or even cause a crash

  • @Huntertje13
    @Huntertje13 Год назад +4

    This seems related to a lack of controls like alerts / logging. Automated deployment will fix copy paste errors but not necessarily mitigate. Code will go wrong, it matters how quickly you can fix it.

  • @UNgineering
    @UNgineering 9 месяцев назад

    I've listened to Power Peg explanation twice and still couldn't concentrate on the meaning.
    I'm naming my next project "Power Peg", it also has an excellent abbreviation.

  • @teejaded
    @teejaded Год назад +5

    Manually deploying code to servers isn't devops. It's just ops. Write some shit to automate that and run tests on results then we can talk.

    • @vikramkrishnan6414
      @vikramkrishnan6414 Год назад

      Manually deploying code in 2012 isn't even ops, it is pure mental illness. At this time, CFEngine, Puppet and Chef were all being widely used by "2 guys and a laptop" style startups

    • @hg-sx5nk
      @hg-sx5nk Год назад +2

      Agreed. Also. I find it really hard to believe they were using the term "DevOps" at the mentioned company - neither most of the finance IT - back in 2012.

  • @e_rebus
    @e_rebus 3 месяца назад

    There is always a kill-switch. It's called pulling the plug. If there's no plug, grab the scissors.

  • @Olodus
    @Olodus 6 месяцев назад

    The real heroes of this story is the previous engineers that named the replaced software Power Peg, thereby setting up the perfect cherry to this masterpiece of ridiculous f-ups.

  • @BeamMonsterZeus
    @BeamMonsterZeus Год назад

    That's one of the best opening salvoes I've seen aimed at the amalgam known as YT comments

  • @chris3079
    @chris3079 Год назад +3

    i traded that day, we were getting information from floor traders about what was happening, it was one of the most crazy day i ever traded. Started even before open. We had guys locked cause mad margin calls on the system. $50 stocks showing for dollars etc. It was just like constant selling or constant buying and watching it on level 2 was crazy, only other time seen that was on purpose when MS tried to support FB ipo, but it was supporting, this algo was selling or buying it. but in 75 stocks and any direction, just crazy. It was notable cause it was same mm and same amount usually, just blocks of stock for sale or buying, same amounts.

    • @vikramkrishnan6414
      @vikramkrishnan6414 Год назад +2

      Question: how did this impact other traders? Did you guys have to trigger internal circuit breakers due to random price changes?

    • @dami970
      @dami970 Год назад

      ​@@vikramkrishnan6414I'm curious too

    • @chris3079
      @chris3079 Год назад

      @@vikramkrishnan6414 caused one guy to have a 2 or bigger million dollar profit, down to $400k about. A lot of system trade, trade outside normal trading bands stratagies, so they did VERY well. But lots of those trades were canceled, so they didnt net what they first thought. We had human compliance, that was more confused, why is every account showing a huge margin negative balance lol but usually margin gets sold by end of day, so for retail, it nothing, as it was over quickly. But it was crazy. I had good intel, so were able to profit. Intel as we knew it wasn't geo political etc, we knew who and whyish. Intel was real time, not something in advance. Took us maybe 5 minutes or less to know what was happening, which was very quick.

    • @steffenbendel6031
      @steffenbendel6031 4 месяца назад

      And this was slow enough that you could watch it. Most flash crashes happen in seconds.

    • @chris3079
      @chris3079 4 месяца назад

      @@vikramkrishnan6414the trader who use to have a strategies of out of market orders killed it. But a lot of trades were just canceled.

  • @gbb1983
    @gbb1983 Год назад +2

    "TerraForm deez nutz" right off the bat lmao

  • @avi7278
    @avi7278 5 месяцев назад +1

    "pay somebody to automate it!" -- you mean like a devops engineer? 😂😂😂😂

  • @TueChristensenDK
    @TueChristensenDK 5 месяцев назад

    Come on a couple of things here:
    1. Deployment day and no one cares to check that everything seems ok (I mean its High Frequency Trading - not your personal myspace page)
    2. They took down the servers that were deployed correctly, but not the faulty one - how?
    3. Delete old code
    4. They re-used the flag that was used in the power peg code - why and couldn't you find a new better name?!
    5. You didn't check that a counter for the parent order existed? That seems like a reasonable edge case unittest

  • @earthling_parth
    @earthling_parth Год назад

    This was one of the funniest thing I have come across in 2023. Hilarious and scary.

  • @nekoill
    @nekoill Год назад

    I like how every other day Prime seems to wake up and choose violence against me specifically

  • @arsvi123
    @arsvi123 Год назад +1

    Imagine having your company destroyed by something called the 'power peg'

  • @Machtyn
    @Machtyn 8 месяцев назад

    I became an SRE last year. Never heard of the position before. It didn't take me a month to hate it. It took me 9 months to finally get moved back to SDET.

  • @vtduch
    @vtduch Год назад

    I am subscribed to the man, who introduced canary deploy before it even was mainstream 👏👏👏👏

  • @jglaab
    @jglaab Год назад

    The devastation when it when from a one server pegging to a all hands on deck 8 server pegging

  • @TheLummen.
    @TheLummen. Год назад +1

    I enjoy the informative content and comedy !

  • @RajinderYadav
    @RajinderYadav Год назад

    I remember when this happened, a brand new High Frequency Trading shop that blew up on day 1.

  • @ray-mc-l
    @ray-mc-l Год назад

    your best video title, had to see this

  • @akam9919
    @akam9919 Год назад

    0:19 A wild arch user appears!
    Wild arch user used "I USE ARCH BTW!"...
    It was completely expected!

  • @patrickrobertshaw7020
    @patrickrobertshaw7020 6 месяцев назад

    This development process was an obvious ticking time bomb. It went off in 2012, but if it didn't then, it would have sometime between then and now

  • @MartinMaat
    @MartinMaat Год назад

    I am browsing jobs right now and just encountered one that asks for experience with an array of deployment technologies and products. This tells me a lot.
    If you read the ads for TerraForm/K8s and such they tell you it is supposed to automate deployment and make it easier and more robust. If that were true, why do I see so many vacancies for people who are supposed to manage this? Regular CI/CD is never a sole requirement, this is something that developers do on the side, spend a couple of days on and then hardly ever look at it again. This IoC rage seems to be turning into an industry of itself.
    It's a good story: Centrally control everything! Imagine the power! 🤣

  • @a.m.4154
    @a.m.4154 Год назад +2

    This is why I adopt the defensive programming paradigm. The only exception is when I know every single thing that will happen in the system (which for a complex system is almost certainly impossible).

    • @danwilson5630
      @danwilson5630 Год назад

      Any resources to start with this?

    • @diogoantunes5473
      @diogoantunes5473 Год назад +1

      @@danwilson5630 Start buy building a bunker under your cave, buying canned food and barrels of whiskey.
      Oh and of course guns. Lots of guns.

    • @cornoc
      @cornoc Год назад

      @@diogoantunes5473 and the best defense is a good offense, so you'll really need those guns

  • @grimsas
    @grimsas Год назад

    That was probably the best bed time story I've ever heard.

  • @sck3570
    @sck3570 Год назад +1

    Did Tom wrote the *Power Peg* functionality?