Crowdstruck (Windows Outage) - Computerphile

Поделиться
HTML-код
  • Опубликовано: 24 ноя 2024

Комментарии • 1 тыс.

  • @james_chatman
    @james_chatman 4 месяца назад +932

    I got dragged into this and I'm now at 48 hours of overtime. Thanks CrowdStrike.

    • @jklax
      @jklax 4 месяца назад +5

      ​@NigelfarijI was about to say

    • @FrietjeOorlog
      @FrietjeOorlog 4 месяца назад +59

      @Nigelfarij Tell that to the taxman.

    • @sunefred
      @sunefred 4 месяца назад +7

      Thats crazy. Whats your patch rate / hour? How many machines?

    • @Artifactorfiction
      @Artifactorfiction 4 месяца назад

      Ujjjj😢😢😢😢😢😢😢😢😢😢😢

    • @rationalbushcraft
      @rationalbushcraft 4 месяца назад +16

      Did you guys get the USB microsoft created to automatically fix it? What is cool is the winpe usb drive just boots into safe mode and runs repair.cmd file it creates. I am keeping this as it will be easy to change that batch file and have it do other things in the future if I want to.

  • @luicecifer
    @luicecifer 4 месяца назад +775

    "Well, well, well. Tell me, young gentlemen, why is it always you two when something bad happened??"

    • @throwaway6478
      @throwaway6478 4 месяца назад +19

      Because we rule the world, and a one in a billion chance is next Tuesday for us.

    • @SubTroppo
      @SubTroppo 4 месяца назад +4

      I am reminded of Cheech & Chong, - but high on technology. I mean man, what can you do?

    • @reallyWyrd
      @reallyWyrd 4 месяца назад +6

      "It's a gift." -- the 4th Doctor

    • @Nicolas-L-F
      @Nicolas-L-F 4 месяца назад

      ⁠well put

    • @nahco3994
      @nahco3994 4 месяца назад +6

      That's a bit unfair, isn't it? Crowdstrike managed to crash tons of Linux systems with the exact same software this April. Same software (Falcon), same problem (kernel panic). Only nobody made a big deal about it back then. Dr. Begley even mentions it briefly in the video.

  • @leighhaynes
    @leighhaynes 4 месяца назад +224

    McAfee did something similar several years ago. A bad definition quarantined core system files. The McAfee CTO from that era is now CEO at Crowdstrike.

    • @somethinglikethat2176
      @somethinglikethat2176 4 месяца назад +74

      To borrow a comment from elsewhere "real men test in production on a Friday"

    • @acrazydurian
      @acrazydurian 4 месяца назад +18

      A fine example of "failing up"

    • @alvintollah
      @alvintollah 3 месяца назад +7

      1 time is a mistake to be learned from. 2 times are a pattern of behaviour, signalling deeper flaws.

  • @TheAnonymmynona
    @TheAnonymmynona 4 месяца назад +301

    So there were 3 seperate failures from Crowdstrike.
    1. The kernel Driver didn't have proper input validation
    2. The Channel File was broken
    3. The testing was so abysmal that they didn't notice before sending the update out to customers.

    • @torbjornlindh5108
      @torbjornlindh5108 4 месяца назад +38

      It’s quite scary that they get their kernel driver signed, despite it not meeting the standard of validating all input! That’s a systemic problem with their entire solution! (Well, so is the third, but testing is not you build quality into the system, so I think the first is the fatal flaw.)

    • @jbird4478
      @jbird4478 4 месяца назад +33

      4. They didn't even notice that every client that updated went down, or at least they didn't respond. How that is even possible is beyond me. Their entire product is based on monitoring systems, but it took them hours to respond, and that was after Google had called them out for the chaos everywhere.

    • @SkandiaAUS
      @SkandiaAUS 4 месяца назад +16

      I think #3 is the worst and why their share price is tanking. Such an utter lack of responsibility to Yolo this into prod.

    • @ReverendTed
      @ReverendTed 4 месяца назад +18

      It does call into question the WHQL testing that allowed the driver to be signed, which does push some degree of responsibility back to Microsoft.

    • @jimfoye1055
      @jimfoye1055 4 месяца назад +4

      @@ReverendTed Bingo.

  • @IstasPumaNevada
    @IstasPumaNevada 4 месяца назад +74

    "As I said online, you should just go outside and enjoy the sunshine."
    Okay, but what are people in the U.K. supposed to do?

    • @QuantumHistorian
      @QuantumHistorian 4 месяца назад +16

      Shots fired. But not seen in the UK, because of the dense cloud cover.

    • @blucat4
      @blucat4 4 месяца назад +1

      😄

  • @wcmatthysen
    @wcmatthysen 4 месяца назад +175

    The problem is rolling out an update (that might not have been tested so well) TO EVERYONE ON THE PLANET AT THE SAME TIME. I can't believe Crowdstrike is operating like this. If you did a phased roll-out to a couple of smaller customers initially, and then monitored whether the updates didn't have any glaring issues this whole situation could have been averted.

    • @ChrisM541
      @ChrisM541 4 месяца назад +26

      That's the nuts & bolts of it. Zero QC/QA before release. In an unregulated industry, this is damningly the norm.

    • @lever2k
      @lever2k 4 месяца назад +9

      I can't believe huge customers don't have a tiered approach to allowing patches to be deployed.

    • @Jai-xj7vy
      @Jai-xj7vy 4 месяца назад +6

      ​@@lever2k what company do you work at that tiers endpoint protection updates? Never heard of such a thing. Crowdstrike may not even offer that capability.

    • @rolfs2165
      @rolfs2165 4 месяца назад +8

      @@lever2k That's assuming the software even allows tiered deployment and doesn't expect _everything_ (including the main server) to be working on the same version - and any machine that isn't updated yet can only connect to update.

    • @TjPhysicist
      @TjPhysicist 4 месяца назад +10

      @@lever2k based on what i've bbeen hearing from others online: a lot of companies **do** have tiered approach for updates, including crowdstrike, but this update - deemed by crowdstrike to be very critical, ignored ALL such settings and was deployed unilaterally to everything.

  • @oourdumb
    @oourdumb 4 месяца назад +404

    The real worry is the lack of QA at Enterprise companies. A state actor infiltrating one of these orgs would be absolutely devastating.

    • @SuperWolfkin
      @SuperWolfkin 4 месяца назад +44

      The real issue and worry is a monoculture. This sort of problem will always happen. Someone is always going to be affected and there's always going to be a cohort of people who are unfairly affected by things that are out of their control. The problem is the cohort here happens to be extremely big because of there's a monoculture of this type of software monopolies lead to monocultures and monocultures lead to unique weaknesses. This unique weakness was able to take out. You know millions of computers all around the world cuz everyone was using this software. We need more companies in this space. Even now the fact that after this happens, everyone basically have to look to crowdstrike because that's who everyone uses. It sounds there's no competitive alternative

    • @vincei4252
      @vincei4252 4 месяца назад +2

      It has and still is devastating. Didn't need the boogieman to show this.

    • @BongoBaggins
      @BongoBaggins 4 месяца назад +5

      If you can think of it, someone has already done it.

    • @NoahSpurrier
      @NoahSpurrier 4 месяца назад

      There are probably already some bad actors out there. Just look at the catastrophic instances of espionage inside the CIA. See Robert Hanssen and Aldrich Ames.

    • @sandwich2473
      @sandwich2473 4 месяца назад +6

      Agile!!!!!!!!!
      I love Agile development practices!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

  • @solimm4sks510
    @solimm4sks510 4 месяца назад +431

    Heh the BSOD at 0:40 is cool
    "For more information about this issue and possible fixes, do not ask us"

    • @DailyFrankPeter
      @DailyFrankPeter 4 месяца назад +38

      But it's about as helpful as a genuine one!

    • @T_GingerDude5416
      @T_GingerDude5416 4 месяца назад +26

      also LEET% complete

    • @paulmichaelfreedman8334
      @paulmichaelfreedman8334 4 месяца назад +8

      @@T_GingerDude5416 All hail 1337!

    • @telebubba5527
      @telebubba5527 4 месяца назад +3

      Haven't come across that for years. Had totally forgotten how it looks like.

    • @crazymonkeyVII
      @crazymonkeyVII 4 месяца назад +1

      Could've been a genuine message from M$ then!

  • @adityavardhanjain
    @adityavardhanjain 4 месяца назад +60

    I was waiting for this video with extreme excitement for the last 2 days. I jumped on RUclips as soon as I saw the notification.

  • @BruceAngus
    @BruceAngus 4 месяца назад +35

    I was stuck in Atlantas airport because of this. It was absolute madness and everyone that talked about it, either from the airline or passengers, said it was a Microsoft issue. That's all most people are going to remember.

    • @0LoneTech
      @0LoneTech 4 месяца назад +7

      That's not entirely wrong. Microsoft did bless this software as permitted the privileges to do whatever to the entire system. They're in turn blaming this on EU, but EU only mandated they provide access to security software at the same level their own has; it's Microsoft's choice to make that this risky. Then there's the trust placed in Crowdstrike; they're likely selected for being a known name, never mind they ran a previous company into the ground in this particular manner. It's like the hotel manager decided to install an entry counter in their front door and nobody asked why it's also a guillotine.

  • @bilalsadiq1450
    @bilalsadiq1450 4 месяца назад +113

    If Dr Bagley and Dr Pound had a podcast, I'd definitely listen to them talk for hours lol.

    • @paulmichaelfreedman8334
      @paulmichaelfreedman8334 4 месяца назад +9

      "The IT podcast with Bagley and Pound" Does that sound interesting to you?

    • @learningCodingWithMe
      @learningCodingWithMe 4 месяца назад

      ​@@paulmichaelfreedman8334 oh yeah it does

    • @Turbo3032
      @Turbo3032 4 месяца назад +4

      A Computerphile podcast as a sister podcast to the Numberphile Podcast would be amazing!

    • @whathappenedman
      @whathappenedman 4 месяца назад +3

      Fr. I like listening to them speak

    • @scottydawg1234567
      @scottydawg1234567 4 месяца назад +3

      ​@@paulmichaelfreedman8334 Yes, actually.

  • @jeraldbottcher1588
    @jeraldbottcher1588 4 месяца назад +9

    This boggles my mind as an IT professional. I was part of a team that deployed patches and software for years. This included OS deployment patch deployment, software deployment the whole thing on both Workstations and Servers. We tested our patches extensively before pushing them out to the entire population of the environment. This 1st included a sandbox environment, then a select user / system environment, then we would stage our patches out over several hours so if something happened we could back out before catastrophe struck. And honestly sometimes we would find problems with the patches, and we would be able to immediately stop, suspend and even back out.
    Yes we would use 3rd party vendor solutions to help with this, and any time we changed ANYTHING we would follow our testing procedures and matrix, normal business. We would never shirk our procedures to test 1st, then deploy. To me this is a total failure of IT Governance and failure to maintain standards. (IT Governance is setting and maintaining standards and policies for the IT Infrastructure)

  • @LunarcomplexMain
    @LunarcomplexMain 4 месяца назад +233

    I swear this is only the beginning for tech companies that are losing valued senior staff over the many, many decades...

    • @DoubleOhSilver
      @DoubleOhSilver 4 месяца назад +30

      Honestly I see why. This career is mostly miserable and the pay seems to be going down.

    • @kaseyboles30
      @kaseyboles30 4 месяца назад +55

      Senior staff that in case probably cautioned against allowing running code in kernel space before it's tested on a test system because that's a fast track to exactly what happened. Senior staff likely tired of their expertise being ignored by suits who cannot comprehend anything outside their niche might matter.

    • @vincei4252
      @vincei4252 4 месяца назад

      Losing? They think they can do things cheaper elsewhere and AI can replace everyone. I wish them luck in the wars to come. Yes, this was a fun career and all I've see is degradation of quality of life on a massive scale. Where everything is micromanaged by 100% non-technical types. I don't miss it at all.

    • @vincei4252
      @vincei4252 4 месяца назад +24

      @@DoubleOhSilver RUclips censored my comment. Wanted to say that I totally concur with the sentiment. Not only is it miserable, the hiring process that is adopted across the board seems to be nonsensical hazing rituals that do not map to real world problems or realistic development tasks and activities. The golden age is well and truly over.

    • @Abdega
      @Abdega 4 месяца назад +16

      Especially the ones who are losing senior staff who know the ins and outs of the product, and replacing them with “Business guy who does business things and doesn’t need to know how the technology works”

  • @vincei4252
    @vincei4252 4 месяца назад +200

    In the modern version of Battlestar Galactica, Admiral Adama absolutely refused to have Galactica networked to other systems and ships in fleet because of the risks to their it critical system. Yet here we are, allowing a root kit to operate unconstrained on millions of machines. Fun times ahead.

    • @MrJegerjeg
      @MrJegerjeg 4 месяца назад +2

      Wow, I thought exactly the same! 😃

    • @evannibbe9375
      @evannibbe9375 4 месяца назад +4

      A lot of the computers that businesses give out to employees (such as ATM screens and point-of-sale devices) where those computers are so cheap that they become completely useless without a network connection (like a Chromebook), and so the system is working “correctly enough” that it detected a problem in those (theoretically) cheap end computers, and it cut them off of the network. The failure was that the wrong thing was found to be a threat, and all those end computers were cut off.

    • @rolfs2165
      @rolfs2165 4 месяца назад +2

      @@evannibbe9375 "Oops, it's all malware."

    • @thefrub
      @thefrub 4 месяца назад +18

      @@evannibbe9375 I'm amazed, literally everything you just said in that comment is wrong. It's like I just watched Calvin's dad explain computers

    • @ivonakis
      @ivonakis 4 месяца назад +1

      And kernel level anticheat is a thing ...

  • @era_s
    @era_s 4 месяца назад +37

    "If you put everything on the cloud, and then the cloud's not there, you've got nothing."

    • @kevinmcfarlane2752
      @kevinmcfarlane2752 4 месяца назад

      The clouds have multiple redundancies though, depending on how much the customer is willing to pay.

    • @tadeob_
      @tadeob_ 3 месяца назад +2

      what if the could and its redundancies were affected?😮

  • @BigMcLargeChungus
    @BigMcLargeChungus 4 месяца назад +34

    I think it's important to point out that Crowdstrike did the same thing back in April but it affected Linux machines (causing kernel panic).

    • @Techmagus76
      @Techmagus76 4 месяца назад +10

      But not much talk about that, why probably because you have a rollback mechanism in booting previous working kernels in nearly all distros.

    • @heinzk023
      @heinzk023 4 месяца назад +3

      Maybe CrowdStrike's management thinks and acts like Boeing's?

    • @nosuchthing8
      @nosuchthing8 4 месяца назад +1

      Really??

    • @ChrisM541
      @ChrisM541 4 месяца назад

      And they've caused a massive c*ck-up a few years ago. Seems they are 'too big' to fail.

    • @sinaghaderi9184
      @sinaghaderi9184 4 месяца назад

      ​​@@Techmagus76bcz no one install an anti-virus on linux.

  • @piranniayt
    @piranniayt 4 месяца назад +132

    Perfect storm: no fuzzy testing the driver code, no staged deployment, no os blue/green boot partition

    • @Ash_18037
      @Ash_18037 4 месяца назад +4

      No not really, a perfect storm implies the issue was due to various timing / bad luck factors. ie It lessens the culpability of ClownStrike. Each of the issue you mention were just plain incompetence.

    • @baumkuchen6543
      @baumkuchen6543 4 месяца назад +1

      I am afraid there was not testing at all in this mess. Everything points out to that...

    • @draoi99
      @draoi99 4 месяца назад +4

      Third Party apps operating in kernelspace... FFS

    • @colinhobbs7265
      @colinhobbs7265 4 месяца назад

      ​@@draoi99All operating systems do this. If you are saying FFS about that you don't know how computers work. Yes, including MacOS.

  • @minxythemerciless
    @minxythemerciless 4 месяца назад +101

    The guilty in this instance are both CrowdStrike and their Customer Security Managers.
    CrowdStrike has a history of shipping stuff that breaks systems, most recently their Linux product.
    The Customers said: Yes CrowdStrike just put whatever you want on our systems without monitoring. And by the way, we have no adequate disaster recovery plan.
    As a corollary, letting CrowdStrike put stuff on your systems also allows bad people to compromise CrowdStrike and deliver unlimited hurt.
    If I was a baddie I'd spend my every effort to subvert CrowdStrike!

    • @ipadista
      @ipadista 4 месяца назад +3

      There will most likely be a lot of QA positions opening on Crodstrike in the aftermath of this. Bad actors just need to get one of "their guys" in through that recruitment process.

    • @LimitedWard
      @LimitedWard 4 месяца назад +12

      ​@@ipadistaI'd sooner expect more attorney positions to open up before QA

    • @justgame5508
      @justgame5508 4 месяца назад +2

      What an awful take

    • @haqvor
      @haqvor 4 месяца назад +6

      @@justgame5508 welcome to the corporate mindset. Protection against liability is more important than delivering a working product. Who do you think the company is prepared to pay the most, the lawyers or the engineers? That reflects how they value their respective services.

    • @jbird4478
      @jbird4478 4 месяца назад +1

      @@lintfordpickle Yeah, but when our security software screws up it will a) first crash the test machine which would block the rest from receiving the update, and b) if that somehow fails our system would allow us to reboot with a previous system snapshot. To see these massive and vital organizations not have _any_ backup plans while putting full trust in an external company is mind boggling.

  • @CheddarKungPao
    @CheddarKungPao 4 месяца назад +97

    When talking about this incident it's worth remembering that hospitals were affected and she people may have died because of this. So it's all well and good to say when everything goes down, go outside and touch grass. But also, we do need to think seriously about whether we're doing enough to ensure software safety. We take it way less seriously than, for example, car safety. When a new model of car comes out it has to go through all kinds of testing to ensure its safety. But we are doing nothing to ensure software safety, we are just 100% trusting the vendors. I've been a software engineer professionally for 25 years and have long thought that the current approach is madness and incidents like this one only make more sure we need to have standards that all critical system software meets in its development, deployment and implementation.

    • @Nadia1989
      @Nadia1989 4 месяца назад +10

      Someone left a message in an Spanish dev stream saying their aunt had a miscarriage and couldn't be operated on because the all the hospital computers had BSOD'ed. She had an emergency procedure hours later.

    • @SuperWolfkin
      @SuperWolfkin 4 месяца назад +13

      100% true. It's definitely a big deal that this incident took down not just School computers or corporate businesses but hospitals that need them to keep people alive. people were missing their medications and for some people like me missing medication means you end up throwing up for a couple nights for other people the consequences can be much more dire.
      At the end of the day as technology begins to run more and more of our lives I do agree there's nothing you can do to prevent hospitals from being part of the affected class these things will happen and hospitals will be affected just like any other computerized business. The problem is we don't need to have so many hospitals affected in a single incident that is purely the result of a monoculture which is the result of monopolistic practices which is a result of the form of capitalism that we have in North America and its effects around the world.
      And that's just on a philosophical level without even approaching all the specific problems that could have been prevented in this case

    • @mohammednazir3249
      @mohammednazir3249 4 месяца назад

      bro is secretly working for the government

    • @jismeraiverhoeven
      @jismeraiverhoeven 4 месяца назад +10

      while i agree with your statement, digitalization also played a huge role in this. nowadays everything needs to be "smart", even things that dont make sense like refrigirators. if those hospitals had alternatives to the computers they used (like for example have paper copies of documents alongside digital versions) this would have hurt them far less significantly. we are too dependant on digital computers

    • @tyrand
      @tyrand 4 месяца назад +6

      Anyone using this horseshit on hospital computers needs sacking

  • @kaseyboles30
    @kaseyboles30 4 месяца назад +28

    The fix is simple, do not push untested code onto live systems where it will run as part of a must run to boot kernel level driver. Run it on a test system first. And never trust a 'security company' who says you should do otherwise (except in rare cases, such as a very bad zero day being exploited where it's a gamble either way). If they allowed this for a run of the mill non-emergency update then they don't know cyber security and safety well enough to protect a home gaming system, let alone major systems. This goes past gross incompetence to the point where I wouldn't blame anyone from suspecting malice. Though I personally think it was "we don't screw up, we stop screw ups" level hubris.

    • @ChrisM541
      @ChrisM541 4 месяца назад +6

      EXACTLY!
      Unfortunately, this braindead policy of offloading all QC/QA onto the end user is being practiced my an increasing majority of devs...all thanks/empowered by The Internet. Software development is the most uncontrolled, unregulated industry in existence. Governments MUST act...before it really is too late!

    • @haqvor
      @haqvor 4 месяца назад +4

      I quote Grey's law: "Any sufficiently advanced incompetence is indistinguishable from malice."
      It doesn't really matter if Crowdstrike did it out of malice or just cut corners to cheap out on development costs. They sell a product that is obviously not robust enough to be used on mission critical systems and they have made the decision to risk their customers business to make more money for themselves.
      In turn Microsoft allows their OS to hard crash due to a faulty third party driver. That can not be tolerated on mission critical systems so a large part of the blame goes to them as well. The end users seems to be pretty naive as well, they have hopefully learnt the expensive lesson on how to not build infrastructure.

    • @BillAnt
      @BillAnt 4 месяца назад

      There's also a small chance that the files got corrupted during the transfer to a CDN which served the corrupted update to millions of computers. We shall see....

  • @wily_rites
    @wily_rites 4 месяца назад +25

    Software running in the kernel pretending to be a driver, when in reality it is a parser, what could go wrong?

  • @mfaizsyahmi
    @mfaizsyahmi 4 месяца назад +4

    Seeing two academicians discuss this issue is so refreshing. So many ideas thrown back and forth.

  • @Arthur-1337
    @Arthur-1337 4 месяца назад +169

    The frowny face is absolutely necessary

    • @user-yv6xw7ns3o
      @user-yv6xw7ns3o 4 месяца назад +4

      Yes I agree. Absolutely necessary, even if not strictly so :(

    • @ICanDoThatToo2
      @ICanDoThatToo2 4 месяца назад

      I dunno, I'm starting to like 😉👍

    • @phizc
      @phizc 4 месяца назад +2

      ​@@ICanDoThatToo2 any of these would work too:
      🤪 🤯 🥳 🥶 😱 💀 💩 🍐 🌋 🆘️ 🏳
      Or an animation:
      🤣
      😂 🔫
      😅 🔫
      🥺🔫
      🤯💥🔫
      🧠💀

    • @blucat4
      @blucat4 4 месяца назад

      If Mike Pound says it, it must be true. Therefore you are wrong! 😁

  • @daanwilmer
    @daanwilmer 4 месяца назад +5

    Thanks for being the first source I found that actually explains what crowdstrike is and what went wrong here, and nice to hear some nuance amd perspective as well.

    • @IceMetalPunk
      @IceMetalPunk 4 месяца назад

      If you want a little more detail: apparently, the definition file they pushed out left some index entries uninitialized, so some memory addresses that were meant to hold pointers ended up with junk data that, when dereferenced, pointed to invalid memory locations.

    • @Tahgtahv
      @Tahgtahv 4 месяца назад

      @@IceMetalPunk Thanks, this is the best explanation I've heard so far. IMNSHO, the software should have been written in such a way such that the definitions don't directly map to memory. Then when you create data structures in memory, they always point to something valid. But nobody asked me.

    • @alazarbisrat1978
      @alazarbisrat1978 4 месяца назад

      @@Tahgtahv I think what you're talking about is Rust. but apparently there were numerous cracks in the program even before then that was caused by the same QA issues that caused this current crash, the crash was just everything finally fell apart

  • @blenderpanzi
    @blenderpanzi 4 месяца назад +23

    Windows can in fact boot with the failing driver automatically disabled the next time, except for drivers that are marked as absolutely necessary for booting itself, and this driver is marjed as such.

    • @irql2
      @irql2 4 месяца назад

      nah it wasnt marked as boot critical, common talking point though. Doesnt change anything though, unless you get to a desktop windows considers it a failed boot, do that 3x and you end up in the recovery console.

    • @grokitall
      @grokitall 4 месяца назад +1

      @@irql2 yes it was, but the decision as to if it can be downgraded should be Microsofts.
      just because they want it to prevent booting if it cannot start does not mean that windows cannot start without it.

    • @irql2
      @irql2 4 месяца назад

      @@grokitall stop parroting talking points and go look at how the driver is configured in the registry. People super confident about things and wont even verify when its very easy to do.

    • @grokitall
      @grokitall 4 месяца назад +4

      @@irql2 according to retired microsoft engineer dave plumer, they had it marked as boot critical according to his sources.
      i have no reason to doubt his statement.
      despite how unimpressed i am with various choices Microsoft has made, i have no reason to doubt the quality of their engineers. that is why i am sure they are capable of determining if it is actually boot critical when the driver is being signed.
      i am also sure that they are capable of writing code which will use that determination to down grade the driver and disable it if it is too broken to boot, and to check if it is stuck in a boot loop.
      for any os, as long as you can get to startup, and use the net, you can fix the driver with an update without having to manually login to all the locked down machines.
      the fact that they have not bothered to implement such a measure when this has happened before is disappointing.

    • @irql2
      @irql2 4 месяца назад

      @@grokitall Thanks for confirming you wont even go look and you'll just parrot whatever anyone says. David is wrong too and he would admit it if he looked. We're human, it happens... He probably doesnt have a dump to go and check.
      and honestly doesnt matter.
      Whats more concerning is how confidently wrong people and they have no interest in learning anything that wasnt hand delivered to them by some source they consider trustworthy. This is a huge problem and our political climate is evidence enough of this.
      If you would have asked "How do I verify this?" since you obviously don't know or even care to, I would have shared that information with you so that you could be more informed on the topic... but nah, polly wants a cracker instead.
      For those that are interested in learning, csagent's Start value is set to 1. Meaning its just another driver, its not special in regards to booting. If it were, you'd get a 7b on boot. This entire interaction is disappointing. What happened to the days when people went "Oh yea? Show me".

  • @Moose_33
    @Moose_33 4 месяца назад +10

    Yesssssss, twas waiting for this. You beautiful channel you. The dynamic duo returns

  • @stco2426
    @stco2426 4 месяца назад +3

    Enjoyed this. Glad I watched the recent 'Dave's Garage' video where he explained the problem. Here I saw and got a good understanding of the wider consequence management. Well werth wathing both I think.

  • @3Ppaatt
    @3Ppaatt 4 месяца назад +4

    Working for a Bank we had drills where we simulated losing our systems for a few hours and had to do everything (and I mean every conceivable thing we might be asked to do in a normal day) without any computers. Including driving physical records to central processing locations.

  • @WilliamLeeSims
    @WilliamLeeSims 4 месяца назад +106

    The CrowdStrike bug was what Y2K wished it could be.

    • @ZiggyGrok
      @ZiggyGrok 4 месяца назад +28

      Fortunately we fixed Y2K before it could cause this chaos. If we had done nothing, it would've been far far more devastating.

    • @davidmcgill1000
      @davidmcgill1000 4 месяца назад +4

      @@ZiggyGrok Y2K only affected those that were too lazy to add 2 more characters to their dates. If your code was vulnerable, it was terrible code to begin with.

    • @nosuchthing8
      @nosuchthing8 4 месяца назад +8

      The world was not as interconnected then too.

    • @AySz88
      @AySz88 4 месяца назад +10

      ​@@davidmcgill1000 You realize that non-programmers use two digits for years too? A lot of it was a (lack of) standards issue, not just code

    • @davidioanhedges
      @davidioanhedges 4 месяца назад +15

      ​@@davidmcgill1000too lazy... No, using software originally designed when memory was small and expensive, and saving two characters per entry won them pay rises
      There were huge and expensive efforts put in to check and update to get around the issues many years later, and so near nothing happened, but it doesn't mean there wasn't a problem

  • @lenwe33
    @lenwe33 4 месяца назад +65

    13.37% complete... ISWYDT 🙃

    • @blackholesun4942
      @blackholesun4942 4 месяца назад +1

      What does that mean

    • @alazarbisrat1978
      @alazarbisrat1978 4 месяца назад

      @@blackholesun4942 I see what you did there

    • @playground2137
      @playground2137 4 месяца назад +6

      @@blackholesun4942I am not sure which part you didn’t get. The custom blue screen of death (BSOD) is something they fabricated. 1337 is often used in gamer culture to mean LEET (or elite rather). Usually indicating something like highly skilled (1337 player for instance). ISWYDT : I see what you did there. So it is used a bit ironically here, because it was of course not a skilled update. Hope that helps.

    • @jeremytrees7266
      @jeremytrees7266 4 месяца назад

      ​@@blackholesun4942 🏴‍☠️

    • @JonBrase
      @JonBrase 4 месяца назад +2

      ​@@playground2137TBF, 1337 is specifically turn-of-the-millennium gamer culture (late GenX, elder millennial). I'm not sure I've even seen younger millennials using it, let alone Gen Z.

  • @sunefred
    @sunefred 4 месяца назад +31

    Falcon is using definition files which are NOT part of the WHQL process which Falcon obviously is! I don't know how this works on Linux or MAC, but maybe it should not be allowed for Windows driver makers to deliver _anything_ to the kernel that does not go through the WHQL certification.

    • @roippi3985
      @roippi3985 4 месяца назад +15

      This is the part that’s wild for me. WHQL is supposed to be this Highest Level Of Scrutiny thing, and somehow WHQL reviewed this workaround to inject arbitrary runtime behavior without requiring WHQL recertification and said F It Ship It.

    • @IceMetalPunk
      @IceMetalPunk 4 месяца назад +11

      My only suspicion is that someone, somewhere thought requiring WHQL for definition files could delay definitions too long when new vulnerabilities are discovered and need to be monitored. Like, "if we do WHQL on every definition, by the time it gets released, so many people could be affected by this exploit!"

    • @sunefred
      @sunefred 4 месяца назад

      @@IceMetalPunk I think that's the reason, and I can't say I have any insights in the WHQL process to tell you how long the process normally is. Would be interested to know though, do you know? I would imagine most of it is automated.

    • @playground2137
      @playground2137 4 месяца назад

      Yeah that is an important part that they didn’t mention, I think.

    • @bierrollerful
      @bierrollerful 4 месяца назад +2

      Maybe definition files do not contain any code and are thus exempt from WHQL process? It could be that the definition file was simply corrupted and unreadable and the kernel driver crashed when trying to read it.

  • @Vospi
    @Vospi 4 месяца назад

    Very enjoyable format of two people discussing. Sounds less monotonous, too. Great job.

  • @PE4Doers
    @PE4Doers 4 месяца назад +5

    I am a recently retired Cyber Security (though being heavily involved in Computer Security for over 30-years, and a software developer for 20 years prior to that, I prefer the traditional names of Computer or Systems Security) Compliance Officer. Although the systems I monitored were involved with critical infrastructure and not open to regular users of business systems, they were still peripheral dependent on many such systems. Since I was a stickler for avoiding the Cloud and third-party security products, my former employer has taken steps to ensure I never know if they were severely affected by the CrowdStruck (accepting the pun) event.
    The real issue is something you two gentlemen mentioned but did not go deeply into. What if there were malicious embeds (i.e. spies) working for that organization, or for Windows System development? We would not be face a bad day or so, but it could been lights-out until every critical system were completely rebuilt and data backups restored. I can understand why discussion of that scenario would be avoided, but should it be avoided. If I were a critically ill patient in the hospital I would want to know so I could prepare for the aftermath.

  • @mythofechelon
    @mythofechelon 4 месяца назад +2

    As someone who led the deployment of EDR and EPP to 18,000+ endpoints last year, agents are absolutely installed on Windows servers, yes. Updates like this that don’t go through change control are a calculated risk for more up-to-date protections. Problem is that the risk mitigation is that the vendor does testing and releases competently..

  • @DragoniteSpam
    @DragoniteSpam 4 месяца назад +3

    A number of years ago Tom Scott did a fun talk called "Single Point of Failure." I think about that sometimes.

  • @vincentfiestada
    @vincentfiestada 4 месяца назад +1

    Finally, FINALLY, some informed and cogent commentary on this issue that isn't just "Tech influencer says Windows is a mess and this would never happen in Linux or macOS"

  • @m4rt_
    @m4rt_ 4 месяца назад +34

    The new update to CrowdStrike falcon included some corrupted channel files (they contained just zeroes instead of the intended data), and because the core driver that loaded the channel files didn't do enough input validation, it continued on using the messed up channel files, and this revealed a bug that likely had been there for a while. The bug caused the driver to attempt to dereference a null pointer, which caused the BSOD.

    • @David-bi6lf
      @David-bi6lf 4 месяца назад +7

      Yeah and probably crowd strike have not fixed the bug because it would require a new release of the driver and that would have to go again through the Microsoft WHQL signing process which the use of these channel files seeks to avoid.

    • @MatthijsvanDuin
      @MatthijsvanDuin 4 месяца назад +6

      Note that this corruption claim is afaik coming from one random twitter user and has been denied by Crowdstrike who says there was a logic error in the updated rules file that caused the problem. It seems extremely unlikely to me that crowdstrike does no validation on these files given that they're being updated frequently on a huge number of machines and are therefore liable to get corrupted (due to power failures and such) on a regular basis.

    • @MatthijsvanDuin
      @MatthijsvanDuin 4 месяца назад +3

      I found a twitter post from someone that the problematic channel file was _not_ zero-filled on any of the systems he had to manually fix that day.

  • @tocsa120ls
    @tocsa120ls 4 месяца назад +19

    Crowdstrike did more harm to its clients, and to the Western world, that it could ever have possibly prevented for the entire duration of its existence as a company. How they ONLY lost 20% of their share value is mind-boggling.

    • @AlBoulley
      @AlBoulley 4 месяца назад +1

      Love the point you've made.

    • @nicostigliano6393
      @nicostigliano6393 4 месяца назад

      You said the most obvious thing

    • @Valgween
      @Valgween 4 месяца назад

      robot movie pfp

    • @tocsa120ls
      @tocsa120ls 4 месяца назад

      @@nicostigliano6393 nobody's saying it out loud tho

  • @eructationlyrique
    @eructationlyrique 4 месяца назад +26

    Linux has a feature that allows the sandboxing of channel updates using eBPF, although Crowdstrike doesn't use it yet. In theory, that could have prevented the BSODs had Windows had a similar feature.
    Also, I don't ncessarily agree that Windows is blameless here. While Crowstrike is definitely at fault, Windwos did certify their driver, and that validation somhow didn't include testing for corrupted or invalid channel files. There's no reason the driver should blindly trust those files without validation.

    • @reybontje2375
      @reybontje2375 4 месяца назад +1

      Yeah, Microsoft also allows eBPF, but it's in an alpha, very early state. Also, the people opining that "this isn't a Windows' issue" are right to a degree, but when you realize that there are design deficiencies around how Microsoft handles drivers, it can only be said, "they're right to a degree," especially when you can specify kernel command line options to disable drivers that are acting bad, or have a fallback initramfs that doesn't load the CrowdStrike driver, which Windows doesn't really allow.
      I believe that CrowdStrike is also on the eBPF design foundation alongside some other industry giants like Apple, Google, Microsoft, etc. I think CrowdStrike also uses eBPF for Linux in their newer agent after the debacle back in March/April with Debian.

    • @JonBrase
      @JonBrase 4 месяца назад +2

      My understanding is that CrowdStrike does use some type of interpreted code in their definition files, which would imply that there was some bug in the interpreter (or code downstream of it) that allowed a null-pointer dereference through (or made a null pointer dereference on its own).

    • @TheFPSPower
      @TheFPSPower 4 месяца назад +5

      @@reybontje2375 Windows does have self-recovery functions for bad acting drivers, but they do not work on boot drivers and Crowdstrike's driver is a boot driver so the system is not allowed to boot if it crashes by design unless you use safe mode.

    • @JonBrase
      @JonBrase 4 месяца назад +1

      @@forbidden-cyrillic-handle Lol. Your username.

    • @sinaghaderi9184
      @sinaghaderi9184 4 месяца назад

      But who would install this on linux? I never seen a linux server with anti-virus or edr. it sounds dum.

  • @Scum42
    @Scum42 4 месяца назад +1

    Every time there's some outage, or bug, or virus big enough to get in the news, I get excited about the inevitable computerphile video explaining it.

  • @zhandanning8503
    @zhandanning8503 4 месяца назад +19

    when the computer goes down, that is a sign to photosynthesize, nice

    • @Abdega
      @Abdega 4 месяца назад +1

      It’s thunderstorming where I’m at so I’d have to wait

  • @jjdawg9918
    @jjdawg9918 4 месяца назад +2

    I cant find one RUclipsr talking about proper sysadmin practices at the enterprise level that would have caught this before getting rolled out. I have never worked at a company where PCs weren't locked down from software installs and every update (even ones from MS) were tested by local QA before rolling them out to your enterprise PCs. Unbelievable that airlines are being run this way. Unless Cloudstrike installed some rootkit that bypasses all these processes I'm shocked at the state of sloppiness in IT.

    • @egria
      @egria 4 месяца назад +1

      I am trying to voice out the same thing but not even tech guys understand. CS Falcon updates bypass everything but still i don't understand how admins allow live updates on supposedly closed system like airports, banks, POS etc. And the loophole seems like the same windows update server used fir both live and testing, or just plain network connection to outside world to allow CS Falcon updates so that it can prevent zero day security issues. It is just absurd!

  • @Ny_babs
    @Ny_babs 4 месяца назад +27

    My local pub went down.. no fish and chips for me..

    • @jklax
      @jklax 4 месяца назад +3

      No cash in hand?

    • @Abdega
      @Abdega 4 месяца назад +20

      “This was a phishing attack and a chip level attack?”
      “No, no… the cash register system is down thanks to broken Windows update”
      “They broke your windows and stole your cash?!”
      “No, the money is still here!”
      “Okay, I’ll just pay you in cash then”
      “I can’t do that! The register is locked unless the computer tells it to open! Besides, each purchase is required to update the inventory as well”
      “I don’t see what the Tories have to do with anything in this case”
      “… I don’t have time for your Monty Python shenanigans”
      “I’d think this stuff would be programmed in C and not Python”
      “GET OUT!”

    • @paulmichaelfreedman8334
      @paulmichaelfreedman8334 4 месяца назад +4

      @@Abdega 😂

    • @KarimY-119
      @KarimY-119 4 месяца назад

      in my local pub i can order by sending a SMS to their fax. cash-only place

    • @dhillaz
      @dhillaz 4 месяца назад +5

      ​@@Abdega When the best comment is buried in a thread

  • @TS6815
    @TS6815 4 месяца назад +1

    These IT disasters always have the upside of flushing Dr Pound and Dr Bagley out of whatever else they’re up to, to give us these great explanations!

  • @bbellefson
    @bbellefson 4 месяца назад +21

    Typical "Management Bug?" A CrowdStrike engineer or two urges more testing before release. Some executive then pounds the conference table and shouts, "No more f**king EXCUSES! I want that update NOW gawdammit!"

    • @wcmatthysen
      @wcmatthysen 4 месяца назад

      Yeah, and I want it rolled out to everyone, NOW!!! Phased roll-outs are for pussies!

    • @aixtom979
      @aixtom979 4 месяца назад +13

      Especially seeing that the CEO of Crowstrike *now* was the CTO at McAffee back *then* , when McAffee brought down XP Machines by deleting Windows core files in 2010. The common factor ist the manager.

  • @HopliteSecurity
    @HopliteSecurity 4 месяца назад +1

    Computer Phile is amazing!
    I love your content and calm but casual demeanor. Your explanations and ability to break things down is superb!
    Keep it up 🙏🙏🙏🙂❤️

  • @lis6502
    @lis6502 4 месяца назад +7

    Crowdstruck? We gave this overtime event a codename of 'clownstrike'

  • @lachlantula
    @lachlantula 4 месяца назад +1

    that os/house/hotel analogy was really good!

  • @phasm42
    @phasm42 4 месяца назад +17

    Crowdstrike sounds like a nickname for Mustangs 😅

  •  4 месяца назад +1

    Oooh boy, you're guys are back. Finally!! ❤

  • @rooboy69
    @rooboy69 4 месяца назад +3

    Crowdstrike didnt do any validation control(or not enough) in their Driver to check the .sys file before running it to confirm it wasnt just full of Null values etc.

  • @alexandrecolautoneto7374
    @alexandrecolautoneto7374 4 месяца назад +2

    13:06 totally agree, we just need to US develop our technology. But we see how US monopoly all technologycal aspects, and any real competitor they ban out...

  • @sunefred
    @sunefred 4 месяца назад +3

    Its going to be very interesting to see what Crowdstrike learns from this. One thing they didn't seem to use is a canary or blue/green deployment scheme. Hoping for some enlightening blog-posts on the topic eventually.

    • @vincei4252
      @vincei4252 4 месяца назад +4

      nothing. The guy in charge oversaw something exactly similar when he was at McAfee

    • @spartanj2957
      @spartanj2957 4 месяца назад +1

      Microsoft,CS,Black rock the WEF and more are tied together .was no accident

  • @HubrisInc
    @HubrisInc 4 месяца назад +2

    Never fails, something big happens in the field of cybersec, we can guarantee that we'll get a Computerphile video starring Dr Bagley &/or Dr Pound :)

  • @tubehellcat
    @tubehellcat 4 месяца назад +5

    😂 the example bluescreen at around 0:36 , 13.37% 😂 love it 😁

  • @DamonWakefield
    @DamonWakefield 4 месяца назад +2

    I'VE BEEN WAITING FOR THIS!!!

  • @pnwlady
    @pnwlady 4 месяца назад +3

    Are there no standards for deploying updates that run in the kernel?

  • @paultasker7788
    @paultasker7788 4 месяца назад

    Finally, a really good explanation of crowdstrike and what it does and what went wrong.

  • @TechSY730
    @TechSY730 4 месяца назад +3

    UPDATE: Thanks tma2001 letting me know the zero file was not the cause. And in fact there is validation in place. The error was somewhere else.
    So the below is inaccurate
    Seems it was a lack of input validation.
    Apparently the root cause of the crash was that one of the files in the definition update was just a file filled with zeros for whatever reason. Leading to a null pointer dereference (which always crashes, by design)
    But that makes me go like: Input validation anyone?! Does CrowdStrike Falcon fail to at least make sure the definition file makes sense as a definition file before blindy following its directions?

    • @necuz
      @necuz 4 месяца назад +1

      Everyone who is even remotely competent knows to put headers on files, network packets and the like. A magic byte or two and some metadata goes a long way when validating.

    • @tma2001
      @tma2001 4 месяца назад +1

      no that was a red herring - for some people it wasn't all zeros and CS confirmed in a technical blog post that null bytes in the channel file were not the cause. There are many possible reasons why it was a file of zeros for some folks - pre-allocated ahead of time before updated or wiped clean as a post processing step for security.
      Valid channel files have a magic signature at the beginning and they actually contain code in the form of byte code for a VM interpreter in the actual kernel driver. The logic error was in the byte code. Of course this means the actual driver can have gone through WHQL but is actually a dynamic entity.

    • @TechSY730
      @TechSY730 4 месяца назад

      @@tma2001 Ooh, thanks for the correction. I hadn't heard any technical detail updates since the original 0'ed file finding

    • @tma2001
      @tma2001 4 месяца назад +1

      @@TechSY730 you were not alone - I too was confused by what little folks had to go on initially. None of it made any sense!
      There is a full explanation by the Cloud Architect B Shyam Sundar on Medium website to breaks it down.

  • @cappaculla
    @cappaculla 4 месяца назад +1

    We can probably thank Dave Plummer for making sure guys like this actually know how to explain the issue.

  • @rodolphenemr9064
    @rodolphenemr9064 4 месяца назад +3

    Been waiting for this 🍿

  • @Asidders
    @Asidders 4 месяца назад +1

    I love listening to these engaged guys 😁

  • @rubenreyes2000
    @rubenreyes2000 4 месяца назад +17

    You didn't mention that in order to install kernel drivers, the code needs to be submitted to Microsoft's to be tested, approved and digitally signed. As you mentioned, the bug was not present in the main kernel, but in the "channel files" that are updates without following that same process. It is not clear to me if those "channel files" are code or just configuration, but maybe Microsoft is partially at fault here for allowing these channel files in the first place, or for not sufficiently checking the kernel driver had the necessary logic to gracefully crash without taking down the entire system.

    • @throwaway6478
      @throwaway6478 4 месяца назад +6

      Clownstrike apparently uses a P-code interpreter to sneak unsigned code into their driver. You'd be a millionaire by Saturday if you invented a heuristic that can reliably detect a P-code interpreter and/or the P-code itself (which of course can be in any format the writer desires) running in kernel mode.

    • @nosuchthing8
      @nosuchthing8 4 месяца назад +1

      As I understand it, if something fails in ring zero or kernel mode, the entire OS goes down.

    • @TheFPSPower
      @TheFPSPower 4 месяца назад +1

      @@throwaway6478 In this case it's not that hard, it's a new file getting loaded from system32, the kernel knows every file you open so you could absolutely block unsigned files in system folders from loading, but as they said it would interfere with competing products so they can't do that, they signed an agreement to allow kernel drivers to work.

    • @ChrisM541
      @ChrisM541 4 месяца назад

      There are exceptions to requiring to get your code MS Certified - code that needs to respond to Day 0 attacks don't need certified, for obvious speed reasons. Fortunately/unfortunately.

    • @irql2
      @irql2 4 месяца назад

      the "bug" was in csagent.sys, thats the driver that was referencing an invalid memory address. Important to note that.

  • @fatonaoladimeji9697
    @fatonaoladimeji9697 4 месяца назад +1

    I would have listened to these guys talk about it for an hour

  • @miravlix
    @miravlix 4 месяца назад +3

    Not seeing much understanding of administration. A system I was admining involves testing updates before they get installed on the live environment and with this many computers, you don't install it on all of them at the same second, you install it in segments and don't continue until you have successfully restarted the first batch of computers.
    This all about GREED admining, they didn't want to pay for doing to properly, my way of admining was developed in the 19xx, we have INTENTIONALLY dropped security to save money.

    • @egria
      @egria 4 месяца назад

      Yep, admin practices is the key and not a particular bug. Live updates in closed system is big NO no matter what sweet voice of software vendor tells you. And the most common phrase nowadays is: "it is for you security" - be it the people or the machines.

    • @egria
      @egria 4 месяца назад

      Some companies had staging environments but they use the same windows update server for both live and staging/testing so this update just bypassed software enforced policies and gone live. Those are mine speculations git from admins sharing their cases. Yet no in depth public case analysis. Hush practice fir reputation.

  • @PowerShellWizard
    @PowerShellWizard 4 месяца назад

    As an Ex MS employee and one that worked at Windows, I appreciate what was said at 7:42 :)

  • @michipeka9973
    @michipeka9973 4 месяца назад +7

    "Dave's Garage" a former microsoft software engineer just did a video about what he thinks happened about this. Very comprehensive and very clear.
    He also speaks extensively that this was possible because Crowdstrike works in kernel mode.

    • @murzilkastepanowich5818
      @murzilkastepanowich5818 4 месяца назад +2

      why would anyone want to watch that scammer?

    • @cidercreekranch
      @cidercreekranch 4 месяца назад +3

      @@murzilkastepanowich5818 WTF?

    • @michipeka9973
      @michipeka9973 4 месяца назад +4

      @@murzilkastepanowich5818 Sorry, I am not aware about any of that or don't even know what you are talking about. Just found about it yesterday, the video in question seems fine and basically makes some of the same points as this one, but is a bit more detailed.

    • @murzilkastepanowich5818
      @murzilkastepanowich5818 4 месяца назад +2

      @@cidercreekranch your wholesome 100 big le epic reddit content creator aint that wholesome 100 eh?

    • @Razzy_D9111
      @Razzy_D9111 4 месяца назад

      @@murzilkastepanowich5818 take your meds

  • @taz9609
    @taz9609 4 месяца назад

    I mean crowdstrike issues aside, WHAT A CHANNEL! Thank you

  • @spookycode
    @spookycode 4 месяца назад +3

    Honestly I would have called it crowdstroke :p

  • @ianflint4610
    @ianflint4610 4 месяца назад +1

    The wider issue is that, while Windows acts in a way to mitigate the consequences of a malicious act (which this failed update mimicked), there has seemingly been no thought into how to manage, contain and recover from such a problem when it is happening at scale on massive numbers of end-points at a very rapid rate. The rate of 'infection' is happening far faster than it can be contained. Microsoft's kernel code policy on top of Crowdstrikes error has exacerbated the problem.
    The impact isn't a theoretical one, it is real with potentially life threatening consequences (like the Highways Agency being unable to control Smart motorways when their displays were not reflecting what signs were saying and they couldn't change them - that left people in Refuges being unable to rejoin live motorway lanes). It has exposed many weaknesses.

  • @SyphistPrime
    @SyphistPrime 4 месяца назад +10

    It also doesn't help that Microsoft took away the key combo to tell the OS to boot into safe mode on startup. If that was a thing I'm sure this would've been at least a bit smoother.

    • @throwaway6478
      @throwaway6478 4 месяца назад +3

      It amazes me how many of you don't know about bootmenupolicy legacy.

    • @SyphistPrime
      @SyphistPrime 4 месяца назад +4

      @@throwaway6478 because I don't specialize in the black box that is Windows. Also why should I have to dig through layers of archaic settings to change this when it's a sensible default?

    • @throwaway6478
      @throwaway6478 4 месяца назад +4

      @@SyphistPrimeYou use an operating system where you have to edit dotfiles to configure your mouse. 🤣

    • @irql2
      @irql2 4 месяца назад +4

      @@SyphistPrime oh stop it, you're not reading the source code for linux to figure out how something works, no one does that... you "can" do it, but thats not a thing an average person does. You're reading documentation just like people do with windows. Stop it.

    • @SyphistPrime
      @SyphistPrime 4 месяца назад +3

      @@irql2 The documentation on Linux is leagues better than Windows. There's so many undocumented and hidden features in Windows where as with Linux it's all out in the open. Also I have read bits of source code when AUR packages failed to compile. I've very much used that to help fix issues with PKGBUILDs and compiler errors. It's not usually necessary to read source code because all the documentation is out in the open, unlike Windows.

  • @MrSpeedyAce
    @MrSpeedyAce 4 месяца назад

    The #1 video I’ve been most looking forward to!!!

  • @lambda653
    @lambda653 4 месяца назад +9

    8:42 It can happen and indeed DOES happen on mac and particularly linux machines but the difference is those operating systems have safety mechanisms in place so that mass IT outages like the kind that just occurred can't fail to the point of individually booting every single device into safe mode and deleting a driver file. As you said, there was a kernel panic error on clownstrike's linux distributions, yet it didn't crash the world's infrastructure because the error was handled correctly. So microsoft should be at fault in some part for not providing these error handling systems.

    • @Formalec
      @Formalec 4 месяца назад +4

      This could be exactly as bad for linux machine if the driver is at ring 0.

    • @ipadista
      @ipadista 4 месяца назад

      @@Formalec the x86 family supports four rings, but for reasons Linux didn't continue the tradition used in VMS and some other contemporary mini computer operating systems, where kernel is ring 0, drivers are ring 1 and shared libraries are in ring 2. Choosing to do the same as NT did, skipping rings 1 & 2 only leaving kernel and user processes. Since essentially nothing uses more than ring 0 & 3 nowadays most new CPU designs only implement 2 rings

    • @JonBrase
      @JonBrase 4 месяца назад +1

      Linux allows you to specify a kernel command line from the bootloader, and you can blacklist individual drivers in the kernel command line, so recovery would be simpler.

    • @ipadista
      @ipadista 4 месяца назад +2

      @@JonBrase Same as with BSoDs, you would still need some techie typing in the fix at the Console. On cloud servers, it could be automated, same as with BSoD fixes, but I doubt it could be done on standalone machines

    • @genehenson8851
      @genehenson8851 4 месяца назад +2

      Mac has not allowed kernel level access since Big Sur.

  • @kentslocum
    @kentslocum 4 месяца назад

    This was a fantastic conversation! 😊

  • @NoahSpurrier
    @NoahSpurrier 4 месяца назад +3

    The cure was worse than the disease.

  • @m4rt_
    @m4rt_ 4 месяца назад +1

    "Anything that can go wrong will go wrong.."
    - Murphy's Law
    Another one I like is the variation of Murphy's law from Interstellar:
    "Anything that can happen will happen."

    • @ChrisM541
      @ChrisM541 4 месяца назад

      Murphy also says...
      "Remove QC/QA and you're f*d !!"

  • @paranic7
    @paranic7 4 месяца назад +4

    There is a bottle of water under the desk !

  • @daryx.langdale
    @daryx.langdale 4 месяца назад

    Big cyber-cockup
    (a beat)
    "Crowdstruck (Windows Outage) - Computerphile"
    ------
    Right on time, thank you

  • @akashaabeysundara8454
    @akashaabeysundara8454 4 месяца назад +13

    1:13 if that hotel is like linux then the guests would carry their own air conditioners 😂

    • @SanderEvers
      @SanderEvers 4 месяца назад +4

      and smart guests will build their own hotel next to the original, with only a small difference.

    • @davidioanhedges
      @davidioanhedges 4 месяца назад +4

      Linux can run CrowdStrike, and had a worryingly similar issue a few weeks ago, since it was in the kernel there was nothing Linux could do either... But only on a couple of distros and only if you had installed Falcon CS ...

    • @dhillaz
      @dhillaz 4 месяца назад +2

      Room key is not in the sudoers file. This incident will be reported.

    • @timsmith2525
      @timsmith2525 4 месяца назад +1

      And to get your room cleaned, the instructions would be, "Run make, look for any errors, and correct them."

  • @feldmanovitch
    @feldmanovitch 3 месяца назад

    Really amazing video, like always!

  • @stefanreindel9888
    @stefanreindel9888 4 месяца назад +11

    Wondering how it got past QA?
    Seems like installing the update on a docker instance or vm would have found this bug.

    • @ytechnology
      @ytechnology 4 месяца назад +6

      Also, how was rollout conducted? Normally it would be tiered / staggered to minimize damage from faulty code. I haven't found any confirmation, but this looked like a "big bang" release.

    • @Tahgtahv
      @Tahgtahv 4 месяца назад +4

      @@ytechnology It sounded like from the video, what they pushed out was definition files, and not code per se? Normally I would not expect that kind of thing to cause a kernel panic, so maybe they didn't either. Hopefully, this incident will make them take a hard look at how they do/deploy things in the future, no matter what it is.

    • @MrThebigcheese75
      @MrThebigcheese75 4 месяца назад

      Friday update before the holidays strikes. Just like Friday built cars. Just push into production and go down the pub, will deal with problems when we get back.

    • @muhdiversity7409
      @muhdiversity7409 4 месяца назад

      QA is a cost center. Everyone is getting rid of that. Why not have the devs responsible for QA, oh and deploying the stuff to the customers and datacenters. The above is not a joke, I've lived it for 5 year now.

    • @ChrisM541
      @ChrisM541 4 месяца назад

      "Wondering how it got past QA?" - there was none. This industry is unregulated. The mentality is "push now, patch later". Maybe governments will finally wake up to the certainty of more timebombs.

  • @SiljCBcnr
    @SiljCBcnr 4 месяца назад

    Thanks for explaining it so well. I love this channel!

  • @johnhudson9167
    @johnhudson9167 4 месяца назад +4

    Loving how social media is making comp sci lecturers get trendy haircuts and dress properly 😂

    • @AlanCanon2222
      @AlanCanon2222 4 месяца назад

      Never, I say! NEVER! *puts on sandals over socks*

  • @---ox1lg
    @---ox1lg 4 месяца назад +34

    "There's no problem with Microsoft. There's no problem with Windows."

    • @shiroyasha_007
      @shiroyasha_007 4 месяца назад

      Perhaps 😢

    • @ChuckleDuck
      @ChuckleDuck 4 месяца назад +2

      lol, lmao even.

    • @yurisebastiao1872
      @yurisebastiao1872 4 месяца назад +8

      It's actually right .... only those windows machines with Crowd strike software were affected by such zero day attack (self attack actually, more like a buggy one:😂)

    • @yurisebastiao1872
      @yurisebastiao1872 4 месяца назад +1

      They've created their own zero day attack by not testing pieces of codes in their software update release. 😂

    • @titaniummechanism3214
      @titaniummechanism3214 4 месяца назад +2

      nothing wrong...
      ...other than the usual stuff

  • @Tomcat-rj5tp
    @Tomcat-rj5tp 4 месяца назад +1

    My school's coding club was faster to respond than our IT helpdesk, and they were more helpful too. They posted a document with detailed step-by-step instructions, while IT just said "come see us." Thankfully I got rid of Falcon at the end of spring semester, as we're not required to have it over summer break.

  • @tscoffey1
    @tscoffey1 4 месяца назад +5

    Apple has the luxury of being able to force changes to their OS like that because only a minuscule percentage of the world infrastructure relies on it. Microsoft must remain backwards compatible as best they can with their OS upgrades precisely because they aren't a tiny player in this arena.

  • @mukulnag1578
    @mukulnag1578 4 месяца назад +1

    As someone whos network jumbox had this ... A very bad day for the IT guy in my company

  • @steveftoth
    @steveftoth 4 месяца назад +6

    "Sorry Elon"? Never apologize to that man.

  • @kahnfatman
    @kahnfatman 4 месяца назад +1

    What an amazing Anti-advertisement! Now we all know what CrowdStrike is and how to avoid it like the plague 😂

  • @choleralul
    @choleralul 4 месяца назад +3

    Thanks Lord Targaryen

  • @squid13579
    @squid13579 3 месяца назад

    Best resources/books
    Windows system internals (P1 & P2 ) usually takes 1 year to complete.
    Art of memory forensics ( Wiley for understand NT authority and kernel objects ) : also this available in previous books as well.
    Both are amazing books🔥🔥🔥

  • @TimothyWhiteheadzm
    @TimothyWhiteheadzm 4 месяца назад +6

    "They may have implemented something badly, we don't know". Yes, we do know. It happened, therefore they implemented something badly. This sort of thing is why we have canary deployments, and apparently they have the infrastructure for that, and allow customers to have settings for which computers get updates first in order to validate them, but they also have some updates that simply ignore those settings, and this one one of them. Yes, they 'implemented something badly'.

    • @alazarbisrat1978
      @alazarbisrat1978 4 месяца назад +2

      it was definition files not the drivers themselves that broke so it's held under less scrutiny

    • @TimothyWhiteheadzm
      @TimothyWhiteheadzm 4 месяца назад +2

      @@alazarbisrat1978 'Held under less scrutiny' by whom? The reality is that it crashed computers, and this isn't the first time similar updates by Crowdstrike have caused crashes (including on linux). The fact that they know this is a possibility but failed to implement proper testing before pushing out to everyone, means the 'implemented something badly'.

    • @alazarbisrat1978
      @alazarbisrat1978 4 месяца назад

      @@TimothyWhiteheadzm they didn't know that would happen, sorta how this ever got out in the first place. but companies always neglect QA, it's just how it is. and also definition files themselves couldn't do any of this without a huge screw-up so they're not as important to defend, but had they tested it there would be no problem. some programmers just prefer to test after failure tho, just a complete miss

    • @0LoneTech
      @0LoneTech 4 месяца назад

      @@alazarbisrat1978 What makes this remarkable is that the entire purpose of this product and company is to address that QA neglect. They've demonstrated they're among the worst at the one thing they're claiming to do better.

    • @alazarbisrat1978
      @alazarbisrat1978 4 месяца назад

      ​@@0LoneTech not really, most companies do that, just that this one was widespread and broke something fundamental. they just got unlucky with their neglect and this slip-up got all the way and broke everything. legend has it that there have been many other issues in their code over time that went totally unnoticed and only now caused catastrophic failure

  • @lollycopter
    @lollycopter 4 месяца назад

    8:40 The point of using not-Windows isn't that the other OSs are impervious, but rather the fact that diversification *is* redundancy. Instead, the current landscape is still heavily Windows-centric and that is a bad thing if we're talking resiliency.

  • @IsYitzach
    @IsYitzach 4 месяца назад +8

    12:50 don't apologize to Elon. He deadnames one of his kids. If he can do that, you can deadname his company. The best he's going to get out of me is ex-Twitter.

    • @spht9ng
      @spht9ng 4 месяца назад +2

      And then uses his child as a culture war pawn publicly. Gross

  • @steevf
    @steevf 4 месяца назад +2

    It's ironic that a bit of software intended to prevent a system from getting taken out ends up taking out the system.

  • @dgo4490
    @dgo4490 4 месяца назад +3

    It's been obvious for a while now - MS does NOT DO software testing, nor Crowdstruck evidently. They are delegating the testing straight to the end user. They pushed a bad binary to an "on-the-fly" update, and after the updated binary was first touched, it crashed the system. That's criminal negligence, brought to you by industry's greatest security providers.

  • @OneAndOnlyMe
    @OneAndOnlyMe 4 месяца назад

    There is a much simpler and pragmatic approach that I've used in places. Which is to simply not allow updates to critical IT infrastructure (DC, DNS, etc) until the update has gone out to a smaller group of endpoints first. Permit the update to 10% of end user compute estate before permitting it on all of it.
    Kernel mode drivers should go through a rigorous testing regime (WHQL, for example). The other problem was Crowdstrike configured their driver as a boot-start driver otherwise people could have used safe mode easily.

  • @minigunnboy21
    @minigunnboy21 4 месяца назад +42

    This is like 9/11 for computerphile

    • @Elesario
      @Elesario 4 месяца назад +2

      Not sure how you're making that comparison. The issue wasn't even malicious. I'd hesitate to compare an act of extreme violence causing the loss of so many lives, so much pain and misery, to a technical mistake that at worst is very expensive and financially damaging to many, but is mostly just at the level of an strong inconvenience.

    • @alazarbisrat1978
      @alazarbisrat1978 4 месяца назад +2

      @@Elesario the hospitals tho

  • @sensecurities
    @sensecurities 19 дней назад

    00:03 Windows machines experienced widespread blue screens due to an operational error.
    01:55 Windows utilizes safety mechanisms like blue screens to protect against critical failures.
    03:43 Kernel-level code in Windows can cause serious errors if not managed properly.
    05:32 Kernel mode software failures can severely disrupt essential services.
    07:25 Microsoft's Windows systems faced critical issues due to a specific bug.
    09:04 Mitigating system failures through advanced update mechanisms.
    10:56 A genuine mistake led to significant issues, but damage could have been far worse.
    12:42 Cloud dependency poses risks for individuals and organizations during outages.
    14:24 Exploring advanced image recognition capabilities.

  • @dahla1973
    @dahla1973 4 месяца назад +3

    This was nor very well informed with a lot of lacking info and some facts clearly missing. Much better videos already out there. That being said, normally a fan ❤️