5:42 - Yes! It took me a while in the IT space to find the confidence to look my boss straight in the face and say "If you see me working like crazy, or in a panic... something is very wrong. It's being handled, don't bog me down with meetings and superfluous communication. If you want to help, I'll show you what you need to do. Otherwise, leave me alone and let me work." Now as a lead, I am the wall. If you see my guys working hard or in a panic, you don't bother them. You talk to me.
It's folks like you who learn from experience, then put that experience to ensuring the new generation of workers at that level can do their jobs as best they can who are the best for management positions. Hiring people up from their existing positions, when they are able, will ensure that things work smoothly. Too often do we have managers who don't understand the smallest of nuances of a role, demanding outrageous shit as if it's normal.
"Maxim 2: A sergeant in motion outranks a lieutenant who doesn't know what's going on. Maxim 3: An ordinance technician at a dead run outranks _everybody_ ." -The 70 Maxims for Maximally Effective Mercenaries In an emergency you *always* defer to the person who actually understands what's going on, irrespective of the normal chain of command. Always.
@@jacobbissey9311 Tell that to management who cannot handle anybody of a lower grade within the company trying to correct them. There are sadly too many egotistical people in charge of things they do not fully understand.
The thought of a fossilized judge who can't use Excel will be presiding over a massive case like this, needing the lawyers to bring out the crayons to explain even the basics of network design makes me feel like it will be a complete coin flip on how it goes.
Umm... Not to give you nightmares, but the overwhelming majority of all politicians and executives are old AF, tech illiterate Luddites. Out ENTIRE WORLD is ruled and governed by Boomers. THAT should scare you far more than ONE JUDGE.
It's going to be less about understanding network architecture and more about digging through decades of case law to determine who has what percentage of fault. A judge doesn't need to understand how a program works, they just need to understand what previous courts have said about programs not working. Outages have happened all the time and there's well-established case law about outages, so we're not looking at novel case law here. That will make it pretty straightforward for the judge.
There are good judges and bad judges. I remember in the Google v. Oracle case, the trial judge actually taught himself Java in order to properly understand the case (and then the appeals court promptly screwed it all up, but at least somebody in the system was trying).
It was both hilarious and heart breaking to watch the catastrophe ensue in the Rittenhouse case, where they had to discount "pinch to zoom" on iPhones, without anyone realising how badly that fucks up basically every court case involving digital video. - Kinda tempted to make an amicus brief, explaining that video compression throws out 90%+ of information in a video, so none of it can be trusted, just watch what happens.
Thor hit the nail on the head with the IT industry. If everything works some exec is saying "What are we paying those guys for?" and if anything goes wrong there's more than one exec saying "What are we paying those guys for?!?!"
This is why we don't allow day zero updates from external sources. Also medical devices are isolated and do not get updates. Can't risk an update breaking a critical medical system.
likely sales trying to get nice big fat number of devices install, and hospital administration checking off checkboxes that their "medical devices" are all secured
Well that's where the big fuck up came for crowdstrike, many companies that did not choose to have zero day updates got pushed the faulty definitions update anyways. To me this is 100% on Crowdstrike because they fucked up on so many levels and Microsoft has perfectly documented the dangers of using kernel drivers at boot time.
@@MachineWashableKatie Well you still need to get the images off the computer into your patient records, but it shouldn't be talking to the internet. That's why they are run on a Medical Device VLAN and only get very specific access granted to reach out.
For the bitlocker issue: some people figured out how to manipulate BCD (Windows Bootloader) to put the system in something approximating safe mode - safe enough that crowdstrike doesn't load, but not so safe that Windows doesn't ask the TPM for the key. Probably 95% of the bitlockered machines can be recovered this way (my estimate).
@@phobos258 most systems have the bootloader on an unencrypted partition and if you can get Windows to try booting into a recovery environment by failing boot 3x, you run a handful of bcdedit commands (most importantly, setting safeboot to minimal on the {default} entry) and reboot. BCD should be able to pull the key from the TPM because nothing important changed (no bios settings, no system file checksums) and boot into safeish mode. Then, you can delete the bad file, change safeboot back to the default setting, and restart.
What this issue showed us at my emergency service center is that we don’t have robust enough plans for operation without computers. It’s helped us improve our systems and we are to a point now where we can totally operate without any computer or internet systems. We’re more prepared than ever now.
my dad is a doctor and for years he joked that 'computers are a fad and theyre going to go away" and haaaated their involvement in his work that just reqired him to do twice the paperwork most of the time... looks like, at least hwere you are, he could end up being right xD and i think its better that way. computers shouldnt be blindly trusted to securely hold such sensitive and private imformation, especially when the things being put on the computers are often things that sales and admin want to make movey and profits off os, as is pretty much anything that crosses their paths.
Having worked in IT, I always had the MBA's find out how much their departments cost to run per minute, and then account for how long a 3rd partly IT support company would take to respond. Now we can just point to this...
I enjoy your approach. Pointing out what time is worth and what downtime costs in order to advocate for keeping your support team around & equipped to support you seems obvious, but plenty of people seem to need it pointed out to them.
As someone who has QE'd... there is also "we told you, but the business never thought this edge case was important". This appears to be the everyday, common "that isn't our failure" type of design failure that no one solves as there is no ROI on pre-solving these instances.
I see parallels in other industries. The owners need to balance putting out a perfect product and a profitable product. At no point is software ever "done." There is always another edge case that needs to be addressed, a new exploit discovered. At some point a product needs to go out the door, otherwise there is no profit and no one has a job. You can see the same thing in construction when everything is fine, until it's not. Sure a job is better with 10 guys on site, but that's not feasible, so it's done with 5 guys. Time lines are rushed, safety gets over looked here and there, but most of the time it's fine. Then one day a bridge collapses. A tower crane falls over. An investigation will show where they went wrong but if you ask the guys, they will tell you, "It was only a matter of time before this happened."
according to the RCA analysis CS put out, they had multiple points of failure in the process. Simulated/Mocked examples of these updates which relied on wild cards Next update didn't allow them (but the test with wild cards passed and noone caught it) Parameter out of bounds (was this in the kernel? or in Crowd Strike's sensor? Not clear on that) They call it a content update, but it relies on a RegEx engine, who hasn't seen Regex hose you when something seemingly minor changes? there's more I'm sure.
The channel Dave's' Garage (Dave is a retired Microsoft developer from the MS - DOS and Win 95 Days) did a good breakdown on one of the last questions Thor asked with why there was no check before the systems blue screened with a driver issue. Crowd strike apparently had elevated permission to run at a kernel level, where if there is a problem at the Kernel windows Must blue screen to protect itself and files from corruption. Dave's video will do a better job explaining it than I could ever hope to so I grabbed the link to it: ruclips.net/video/wAzEJxOo1ts/видео.html
There is also the aspect that CrowdStrike doesn't validate those changes through the whole WHQL process to go faster. This is purely CrowdStrike's failure to validate input in kernel level code and the fact that they didn't test properly. If you had done even one install test you would have seen that it tried to access an address that didn't exist and it failed. At that point Windows has no option, but to fail. There are plenty of things to talk about with how Windows has issues, but this is not one of them. Like the Microsoft update basically had nothing to do with the failure so I just hope this VOD is late to the party because many points they talk about if Microsoft contributed to the failure is basically counter to reality. You play in kernel level code, without WHQL validation, and fail to data input validate you fail. Even CrowdStrike's PIR basically says "Stress testing, fuzzing and fault injection and Stability testing". As someone that works heavily in the industry, they basically just admitted they don't do proper validation"
@@jimcetnar3130he knows what he's saying. if you use Windows, you are using his work all the time, in fact Task Manager was HIS program and he thought about selling it as a 3rd party (he had a clause that allowed this) but decided to donate it to Windows, he's also responsible for the format dialog, the 32gb fat32 limit, and the shell extension that made viewing the zips files just like any other folder possible. He's talking from knowledge.
For those who want blame Windows. Same thing happened with Crowdstrike few weeks earlier on Linux, just then was no much devices demaged so no one cared.
Right, but the affected Linux machines were able to recover _much_ quicker because of the way Linux handles kernel modules (and the fact that a kernel panic gives a hell of a lot more information about what _actually_ happened than a BSOD probably helped too).
@@AQDuck They were able to recover quickly because you could see which driver failed and while the system still crashed you could blacklist the driver at startup.
Dude if Windows + Crowstrike = huge problem And linux + Crowdstrike = small problem Then Windows is the main cause (Edit: can't believe people took this seriously. Is obvious Crowdstrike is the problem)
@@wesley_silva504 Most Linux systems don't use encrypted drives and that made the problem MUCH worse. Linux has also dealt with bad drivers for a long time and provides a way to blacklist drivers to keep them from loading.
This is much like what the NTSB goes through. When an airline or train disaster happens. You won't know a fault or failure point. Weather it's human, digital, or mechanical. Until an event happens. And why it takes them months to years, to solve. There's so many factors to look and potentially blame.
Yep, Crowdstrike was doing fucky shit with their EXTREMELY SENSITIVE BOOT CRITICAL drivers. Even if a Windows update broke the driver, it broke the driver because Crowdstrike was doing fucky shit they weren't supposed to be doing with their driver.
Why isn't there any blame on the Companies themselves? I work in IT , and my previous company use to test all windows updates, and software updates on a 48hr test before allowing it push out to the rest of the systems. The current company does not do this and was hit with the crowdstrike issue. My PC was not affected because I disable update pushes on my system and do them manually. I was advocating to start doing smoke tests before allowing update pushes ahead of time.. before this happen.. NOW after half the systems went down.. they decided to add it to the process.
The perception here is since it's /security/ software explicitly used to handle close to realtime issues, 48 hours is 48 hours vulnerable. If it was anything else this wouldn't fly. Nor would it have such a low level access to bugger it up in the first place.
You can't do that with Crowdstrike. The entire purpose of Crowdstrike is that you are paying them as a customer for that kind of "due diligence". I work in IT. If I have to start managing Crowdstrike like OS patching with staggered roll-outs then they suddenly become a lot less appealing to pay services for. Crowdstrike is all about being as fast as possible for Threat Analytics. You can't tolerate a lag because once and exploit hits... you need to get in front of it as soon as you can. This close to "bleeding" edge is just going to have that risk.
@CD-vb9fi ah.. but my good sir, you just ruled out your own statement, you stagger roll out of OS patching, if this was done with the OS patching then the hit would not have been bad as they would have caught the boot loop / OS update issue with crowdstrike latest patch.. as stated in this episode it only failed when coupled with latest OS patch not initial release. This would mean the due diligence would have been on the companies themselves for not verifying the latest OS patch did not have any conflicts. This is why Microsoft also does rolling OS patch deployments. The problem here is not only did MS and Crowdstrike fail QC, but so did the Companies IT QC processes. Crowdstrike should be testing on preview builds of OS deployments as well, if they are a partner, they have access to all builds future releases.. all in all this could have been easily avoided if the industry had better practices as a whole.
@jasam01 you guys seem to miss the nuance , of the OS patch. CROWDSTRIKE WORKED, then new patch releases , then bluescreen/boot loop. If you smoke test with configs for 24 to 48 hours with your common user test suite , before auto releasing OS patches to your IT infrastructure you would have been fine as you know that hey something happen after this OS patch. There is a reason Micrososft let's IT manage the patches, not all software plays nice with every PATCH.
@@vipast6262 You said OS Updates AND software updates, we were talking about the latter, because Crowdstrike has a relatively unique position from the standard expectations. It's worth noting that no amount of smoketesting the OS patch would of saved anyone if Crowdstrike were to do even worse and push a bluescreening update that occurred on the current update.
Oh wait what? This is the first time I've not seen a comment on a video like this lol. This really puts into perspective how bad this was for many people. I, fortunately was not as affected by it, but many people were. I'll be praying that the IT people can get a break/ are appreciated more as a result of this.
That's where IT reservist can be a good governmental program : you vet local IT professional and get them to be familiar with emergency services and critical infrastructure systems. Then they can jump in to support critical places who don't have enough of an IT staff to face a big attack or such a catastroph.
I know of a 150B+ market cap business, which was doing full Windows recovery rollback to a version before the update was pushed out. The bit locked machines had to have a person on site accompanying the remote admins. Their billed time and losses are the the hundreds of thousands
The thing is... it was all avoidable if the did disaster recovery right. But who does that? I have been doing IT for over a couple of decades... two things are never taken seriously but always claimed to be serious... Security and DR.
This is why monopolies are bad. There's a VERY good reason the old adage of "Don't Put All Your Eggs In One Basket" has been around for as long as we've had chickens and baskets... it only takes ONE PROBLEM OR ACCIDENT to ruin everything.
Crowdstrike isn't even close to a monopoly, they have ~14% of their market Computer Operating Systems are largely a natural monopoly/duopoly... Developers don't want to create programs for multiple platforms, only the popular platforms get the apps, the unpopular platforms die (see windows phone, WebOS, OS/2, etc....) Linux has been trying to break in for years, it's arguably fairly complete, but no one buys in, because the platform support from app developers isn't there.
@@cloudyviewI believe he is referring to windows. It is true that some systems tend to be a standard, and windows is one, but the space has 2 other's. Monopolies might be good for the consumer at the start, but they quickly turn sour and more then just for the consumer. In ideal world, we would have diversifications with strong standards. But this isn't the ideal world.
@@fred7371 Problem is multiple different OSs either wouldnt change much between each, in which case theres no real point or difference in which OS is used anyway so this would still likely occur, or there would be many different OSs with extremely different structures and nothing would be compatible between any of them. Its already a massive pain developing for windows, linux and mac, and a pain dealing with every single possible configuration of base hardware that can be mixed and matched. Could you imagine adding another 30 different OSs? Security critical devices would still all homogenize to some extent and one OS and security program would still reign king.
@@duo1666 correction, a pain for windows and linux, mac and linux share the same structure. Yes your right, I am aware of the conflict of standards that could arise from multiple OS. But I also point out that if you enforce a basic of standards, this is less of the an issue, I added there for those that know what I am saying, you can see this in man industries. It is also less of an issue if you have to up the competition instead of imposing your way of doing things (funny that's where we are at currently). We saw that on the USB ports, and the fact Apple was trying to get ride of them, or the charger debacle in the EU. That's just some, there's google recently. Ofc it won't be easy, but that's the spirit of competition, to try and do better. That's why monopolies are awful, they ruin everything from the people creating the product to those left with no options.
@@fred7371 Monopolies are bad in capitalism. A centralized system to handle things on its own isnt bad. And capitalism and competition isnt exactly good either. Monopolies in capitalism exist because you can take the entire pool of cash then constantly roll back expenses that ensure a quality product because the investment to start up is large enough that you can make a lot of money before that happens, and then bankrupt the company, buy out the competition, and do it all over again. Realistically, the only real issue here is everything auto updated, so everyone was hit all at once, where as the problem would be more localized if everything didnt update at literally the same time.
10:00 - "It was a blind spot" makes sense from a QA perspective, but... clearly we, as a society, can't be having software systems with billion-dollar costs attached to them.
My personal feeling, being within the IT/Development space is that liability is also shared by the end users. Of course it sucks if an update brings your system down, I've been in those kind of situations when it was just an update for some small but important software we were using. But you need to have adequate backup plans in place to quickly recover, don't always force the latest updates immediately without testing, and have a system down plan in place where you can. I feel bad for everybody that has suffered because of this, but in terms of strengthening processes going forwards, this was an important lesson. We've handed our lives over to the concept of an always available infrastructure that can be brought down within minutes with very few alternatives in place.
This right here, is why a monopoly is bad. CrowdStrike has a monopoly on the IT security market when it comes to locking and managing systems. And it broke. Whether it was Microsoft's fault or Crowdstrike, it doesn't matter. Something broke that made systems with CrowdStrike go down, and the world stopped for a day because of it.
Not entirely sure, but wasn't there a "SolarWinds" or some such that Google had a problem with a year and change ago? I remember a similar problem to this happening (though no nearly as severe) because ALL of Google services went down for a few hours and things basically stopped for the day as they removed it from their systems.
Even if you have multiple providers of this kind of service, any given organisation is going to use one or the other. You're not going to have half the hospital using CrowdStrike and the other half using StrikeCrowd unless you know that each has advantages in its niche that outweighs the added hassle of using two different service providers.
Crowdstrike isn't a monopoly, they have (had?) ~14% of the market. It's a substantial share, which is why this was so widespread, but it's not even close to a monopoly
Please don't talk when you don't have any Idea about the IT-Security landscape okay? Crowdstrike is as far from a Monopoly as McDonalds is from Healthy Food. Microsofts own Microsoft Defender XDR / Sentinel has way more Share then Crowdstrike has, heck even Kaspersky has more Marketshare then Crowdstrike.
it would be helpful in the future to have a date marker for when these conversations occurred. This feels like it occurred shortly after the wake of crowdstrike's outage, but it doesn't appear stated anywhere (that i can see at least. I could be missing it).
About hand-written airplane tickets: About a decade ago my intercontinental flight had to be rerouted and they issued me hand-written tickets for the three legs it would have. At the time I expected I would get stranded in Bangkok - thought no way that ticket will be recognized as valid. But lo, there was no problem in Bangkok - in fact Thai Airways staff was waiting at the gate and hustled me to the connecting flight. On my next stop in Singapore they again had staff waiting for me at the arrival gate, let me bypass the entire 747 queue at the x-ray machine and drove me directly to the departure gate. I thought that was peak organization by the airlines involved (all Star Alliance). They even did not lose my baggage (which is more likely if connecting times are short).
He is wrong. The fault was 100% Crowdstrike. They changed a function to take 21 arguments but only gave it 20 and it wasn't coded to handle this error so it exited with an error code and since it was running in the kernel Windows stops to prevent data corruption and Crowdstrike is a boot driver which means windows won't boot if it doesn't boot. The stupidest part of this is not check pointing a working config and automatically reverting to it.
When was this interview, and when did the information you mentioned become available? Because I've been seeing clips from what looked like this interview for weeks, dunno how the timeline actually works out
Linux handled it fine and it was able to easily recover from a defective kernel module. Microsoft still has some blame for bricking windows if the kernel module fails.
I was there working the night shift for our EMEIA branch. shift's from 10pm to 7am. I left at like 1pm that day. first hours were frustrating because we had no instructions. then instructions were found in reddit, took a few more hours to approve and implement them. and yes, it lasted weeks because people were on PTO/vacation that day so they came back to work with this issue in their laptop. fortunately affected people were nice.
9:06 It's like the line in shooters and strategy games. "It's never a warcrime the first time." Warfare with chemical weapons wasn't an issue before chlorine gas filled the trenches. Posing as/attacking dedicated field medics wasn't ever a problem to be considered before the red cross came to be. And now, these IT-problems were too obscure for the end user and original developer to notice.
This whole process about prevention Thor describes reminds me of episodes of Air Crash Investigations where something goes wrong in a niche case that in hindsight is completely preventable with the smallest change to something...but you never knew it needed to be changed because nothing like it had ever happened before.
10:50 There was a saying in germany while the introduction of De-Mail from a computer expert: "For every technical problem there is a judicial solution". The law created basically stated that encryption of messages in transit on a server are not to be considered in transit for the purpose of deencrypting that message. (Otherwise the law would be in breakage of another law for data safety). And the next sentence was "No government is stupid enough to give their people a means of communication that can't be spied on".
I was one of dozens of Field techs doing contract work for The Men's Wearhouse. 200 of their stores had just had pinpads upgraded from USB to ethernet server connections. That server was affected by this. I personally went to the 6 stores I had upgraded and got them all fixed that same day. They could still do sales, but it was 15 minutes per customer until I got the server fixed. Once I got the bitlocker key for each server, it was a breeze, but for the first 3 stores, I had to wait 45 minutes to 90+ minutes PER store in the queue to get the bitlocker key. Was easy to fix, but whew....that was a fun Saturday.
I assume most of this maschines are managed in an Active Directory. So from this AD you can assign and provide to run something at startup, so you can deliver with that some update that breaks the bootloop.
I work for my local county government and we use Crowd Strike thankfully we only have about 1400 desktops and about 500 laptops we had to fix. We were back up to 80% fixed by the end of the day Friday when this happened. That was with all 4 techs plus the radio comms techs which was 2 more and 2 sysadmins all going to each and every PC in the county. Some departments have offices 60-80 miles away from our main office.
In this case it was actually 100% preventable by proper processes, which we do not do due to costs... Any system that is this important should have a clone where updates get tested before they get deployed to the actual machine. This process is widely considered as DTAP Development/Test/Acceptance/Production. The patch should be automatically deployed on test and tested in a automated manner to see if it still runs, in this case it would've failed to even boot. Then in Acceptance to see if it still functionally works. Once you've done this you go to production. If you say this almost any manager now a days will tell you, yeah but what are the chances! Well not that high, but hey we've proven the impact is disastrous as we've proven once again! Honestly, I'd say the fault is how we approach stuff like this in the first place. We have companies creating imploding submarines, we have bowing, etc... At some point you have to ask yourself is this entity at fault or is it all of us for allowing them to be this faulty for the benefit of a few peoples personal wealth. Because that is what it comes down to in the end.
I work in IT. We don't use Crowdstrike, but this kind of an issue is not unique in this space. We use CarbonBlack and we've discovered that the people who control it's configuration have INSTANT access to all configs on on any PC that use it. One time someone caused the software to BLOCK critical software we use and could not run the business for 20 minutes until someone turned whatever setting they used off. Currently, they have blocked specific browsers, but we can't do anything with them, not even uninstall them. So they are broken pieces of software stuck on the PC. This problem is probably more related to lack of knowledge and communication to those that control the backend, and then we are moving to the cloud in a year or two, I absolutely HATE the state of IT right now, it's damn scary to have uninvested people in charge of our infrastructure.
Also, from working in the tech industry for that data recovery company... it was *very preventable* we ended up finding out, for those who want an update to it. They didn't QA the update that was pushed on Crowdstrike's side. The only computer that would have received a notification of an issue wasn't even a QA person, but one of the devs, and that dev's computer was locked while he was out on PTO. If they had properly QA'd the update *and* had it set up for a proper notification channels to a qa person, it probably wouldn't have been as catastrophic.
Back during the summer there was a different outage that affected the auto industry, CDK. What was "great" was going into a management role a month after that outage, being told they were still recovering from it, and that my pay was going to take a hit because they were still fixing sales numbers and recovering from a loss.. like cool, didn't even work here yet, I have an agreed upon salary, but sure pay me less.
I am a senior systems engineer and thankfully we only had 30 severs affected because we primarily use Carbon Black - we did have a few servers in Azure and it is a pain in the neck to fix those. We unmounted the disks from the affected VM and attached them to another VM to get the file deleted then had to move them back.
as someone who works in i.t. at a medical office, there is a reason why i have the system setup not to update for a minimum of three days. That being said, we also have a few systems that haven't been updated in years, to be fair they're isolated but the point is that in the medical industry from everything i've seen (at least at smaller offices) updates are resisted and only done out of necessity.
I remember a couple years ago someone hacked into a program that a hospital in my state used and in turn it infected all emergency services that used it in the whole state somehow, they loaded the service with some ransomware and it crippled the services for a couple months or so... hospitals, police, fire services everything connected to the service was completely locked out. My mother who's a nurse manager in a home health office here said they had to scramble to break out old paper files that hadn't been touched in years, pull others out of storage, or try to get more recent physical paperwork from other hospitals and care offices because no one could log into the online services... equipment like ekgs and central computers just stopped working
I was half involved in our company's recovery (I didn't perform the fix, but I prepped server restores from pre-update). It seems like "blame" should be easy to determine through contracts. I don't know if a jury or judge would need to be an IT person, they could literally just sit through a Thor-style MS Paint session, since the problem is logical in nature.
I thought I had to manually boot machines into safe mode and delete the update on around 200 devices. Fortunately a lot of the machines were off when the update went through so it ended up only being around 25 machines.
For those wondering, this was from july. On Aug 6th crowdstike put out a blog post called "Channel File 291 Incident: Root Cause Analysis is Available" where they admitted they were 100% responsible. This actually is not the first time it happened. Its not even the first time *this year*.
This is misinformation. No part of that report admits to 100% of the responsibility. Out of the 6 described issues, only #6 (staged deployment) is something that is solely within Crowdstrike's domain. Based on the report alone, other issues can easily depend on external factors such as Microsoft-defined APIs that are not expected to suddenly change. The described mitigations can be seen as _additional_ precautions, not as something that is required of a Windows kernel driver. In fact, they explicitly mention passing WHQL certification, and it isn't specified if Microsoft's or Crowdstrike's update broke a specified API standard. Maybe Crowdstrike relied on something that wasn't formally defined but practically remained unchanged for a long time. Maybe Microsoft failed to specify which APIs are changing with the update and communicate it to partners on time. Maybe both messed up. While it does look (at first glance) like Crowdstrike's regex shenanigans are to blame, I can't help but remember the displeasure of dealing with Microsoft updates shutting down production due to server and client protocol updates being released simultaneously without a deprecation announcement, effectively killing all clients at once until the server gets updated (while provider admins are napping) or we blacklist a sudden Windows update. Extremely minor incident in comparison (~2h downtime), but Microsoft also breaks stuff from time to time, and it's entirely possible that Microsoft released something that was incompatible with their own WHQL certification. We really need more information before we blame Crowdstrike for absolutely everything. They definitely made mistakes, but it's possible it isn't entirely their fault. And so far they haven't admitted it's solely their fault.
@@MunyuShizumi When you say "it isn't specified if Microsoft's or Crowdstrike's update broke a specified API standard." here's literally the first line of the Root Cause Analysis: "On July 19, 2024, as part of regular operations, CrowdStrike released a content configuration update (via channel files) for the Windows sensor that resulted in a widespread outage. We apologize unreservedly." You said: "other issues can easily depend on external factors such as Microsoft-defined APIs that are not expected to suddenly change" Microsoft wasn't involved at all. All six issues were wholly within Crowdstrike's domain. If you read the RCA, you would know that the root cause was "In summary, it was the confluence of these issues that resulted in a system crash: the mismatch between the 21 inputs validated by the Content Validator versus the 20 provided to the Content Interpreter, the latent out-of-bounds read issue in the Content Interpreter, and the lack of a specific test for non-wildcard matching criteria in the 21st field.". I'm not sure where you're getting that a Microsoft API changed anywhere in that document. It was Crowdstrike software attempting to a read an out-of-bounds input in a Crowdstrike file sending Windows into kernel panic. You said: "Maybe Crowdstrike relied on something that wasn't formally defined but practically remained unchanged for a long time" Both the Falcon sensor and Channel Update file 291 are Crowdstrike software - not Microsoft. Issues 1, 2 and 4 described what Crowdstrike's Falcon sensor didn't do. Issues 3 and 5 are gaps in their test coverage (gaps is an understatement). Issue 6 they didn't do staged releases leading to a much more widespread issue. You said: " and it's entirely possible that Microsoft released something that was incompatible with their own WHQL certification." The WHQL certification is only certifying the Falcon sensor, not the update files - thus it's irrelevant to the root cause. The issue isn't that they didn't change the sensor software as that would require a new WHQL certification testing process, it's that they changed what the sensor was ingesting. It's like saying I have a certified original Ford car, but then I'm putting milk in the gas tank and wondering why the engine is bricked. You said: "We really need more information before we blame Crowdstrike for absolutely everything. They definitely made mistakes, but it's possible it isn't entirely their fault. And so far they haven't admitted it's solely their fault." They literally did. What more information do you need? They released the full RCA. There's no more information to be had. Crowdstrike pushed a buggy update that bricked millions of systems resulting in trillions in damages and almost certainly lead to deaths (emergency services and hospitals were offline for many hours).
@@MunyuShizumiNo, it isn't misinformation. Go reread the RCA. Literally the first line is "On July 19, 2024, as part of regular operations, CrowdStrike released a content configuration update (via channel files) for the Windows sensor that resulted in a widespread outage. We apologize unreservedly." Microsoft isn't involved in the incident beside the fact it's a windows machine. It was Crowdstrike software trying to access an out-of-bounds input from a bad config file that somehow passed all of Crowdstrike "testing". All aspects involved are Crowdstrike. The certification is irrelevant. That's only for the Falcon sensor itself, not the inputs. That's like getting a certified new car from Ford and pouring milk in the gas tank and blaming Ford for making a bad car. All issues are Crowdstrikes fault. None of the issues in the RCA have anything to do with Microsoft. 1,2, and 4 are what the Falcon sensor failed to do. 3 and 5 are (enormous) gaps in test coverage. 6 is the lack of staged releases. What more info do you need? There's not going to be any. This is it. This is literally the document that says what happened and why. Microsoft didn't change anything, Microsoft isn't really mentioned besides the certification and the pipes.
Quality assurance. Especially testing with enterprise level security software because of how hacky they do what they do but also gradual rollouts of updates so if something goes wrong it doesn't go wrong on every machine at once so you can catch it and stop the update
My understanding of the cause from Crowdstrike's end (I admittedly haven't looked into this in a while) was that a file was empty that shouldn't have been. Another part of the software tried to read from that file and crashed with a Null Reference exception. It's possible the file itself was fine when tested, but something went wrong in their release process which corrupted the file and it isn't something that could have been directly caught by QA at that time. That being said, it seems like QA or Dev should have caught the "bad file causes null reference" problem as a Null check should always be done before trying to access the reference, not matter how sure you are that it can never happen. It may be ok to crash loudly in dev, but the prod release should always handle it gracefully.
I was working 911 that night. We had no Computer, CAD or anything except phones. No text to 911, half our services were crippled. You should see how difficult it is trying to obtain people's locations during a traumatic event with them screaming mad at US that the systems were down. I was there with a pen and paper and google maps on my phone. Nothing we can do except deal with it until shift ends.
i would like to see where Thor got his accounts of windows pushing an update that crashes the configuration of named pipe execution since that's what crowdstrike claims they did (updating the channel files)
I was working that night at my hospital when everything rebooted and bsod at 2-230 am. We thought it was a cyber attack and acted immediately. Our entire team went all hands on deck, and stayed all Friday and weekend to recover critical servers and end devices to keep the hospital running.
The blame could lie in forcing updates that every company seems to wants to do nowadays, back in Win7 updating was optional, so far it still is with most software. I've been with Thor on his short about "I am the Administrator, you are the machine" bit as I HATE getting updates shoved on me, because it seems every few updates it breaks something and they seem to insist on shoving phone games and bloatware on your device, we would be well served in making updates a default-option again and you can schedule an IT guy to check every week or month or whatever their policy is to be.
Crowdstrike has requirements for using their systems by mandating updates. They do this because if you don't keep everything up to date, you are exposing your services to unnecessary risk that they may not be able to defend against, at least the non updated versions of their software can't defend against, and because of that, wouldn't be held liable if something happened.
@justicefool3942 but we come back to Windows updates are notorious for breaking random ass shit for an update that I could guarantee you didn't need to happen.
Forcing updates is a blessing for everyone, you don't value it until you meet a company that blocked updates and every machine is still on Windows 10 Redstone 2. You'll curse so many people the day you have to update all those machines because software doesn't work anymore and now 1 out of every 3 machines starts bluescreening or locking up because the drivers are completely fucked up and outdated.
@@devinaschenbrenner2683remember wannacry, only out of date systems were affected. the broad attack came in may, but the fix already in March. most private machines were save, because of auto updates, but alot of company's were screwed, because they never got the fix.
not really. large companies use enterprise versions of windows, you can stop/start/schedule updates whenever you want. but crowdsource needs the updates for security or else what's the point in having a crowdsource software (or any other similar one) with an os full of holes.
If I had to point to a likely blame, I would lean toward Windows. MacOS, prevents issues like kernel panics with features such as System Integrity Protection (SIP), Windows doesn’t have the same safeguards in place. I'm not saying macOS is inherently better, but it shows that it's possible to protect against these types of vulnerabilities.
To access the file they are talking about at my company we had to use the bitlocker key first then if it worked we could make the changes. If bitlocker did not work we reimaged the device. We 12-hour shifts going for 7 days to fix over 6 thousand PCs.
First stories: Looking at the actions people took with the benefit of hindsight. This typically attributes blame to individuals and not systems. Second stories: Looking at the system and seeing WHY mistakes were made. This approach makes the reasonable assumption that everyone is a rational actor and makes the best decisions they can with the information they have. Using this approach typically allows organizations to grow and improve, tackling the real root causes of problems instead of taking the easy route of blaming individuals. Karin Ray (on a discussion by Nickolas Means)
Does microsoft use crowdstrike and if they do how do they not test their own OS updates against their own systems? Pushing updates straight to production does not sound like something OS used for military etc purposes should be doing.
Difference field (home inspections), but "If nothing happens, nothing happens" is very much on point here. We get new safeguards only after something damaging happens. This failure was predictable and I'm sure you can find plenty of people who predicted it. But it's extremely rare for action to happen because a very bad outcome is theoretically possible.
It's also entirely possible that Crowdstrike and Microsoft tested the latest update in isolation and it worked, but broke when pushed to live end-users. Really curious to see how this unfolds
For any kind of large scale upgrade, you test the update on 1 machine running whatever apps are important to you to validate the update. Only when that testing is successful do you ever update the rest of your stuff. You also always have a method of rolling back if it fails tested and verified to work. So a failed update should never disrupt anything for longer than a few minutes. Also terminal servers exist. Anything I can do consoled directly into a machine I can do just as well from 3000 miles away, other than maybe having someone physically power cycle it so I can interrupt the boot sequence to get in. There's no way I could do my job if I had to fly all around the country constantly to physically console into stuff.
With Thor talking about the courts deciding who's at fault, I just think of the saying "I didn't say you're the one who's responsible, Im just said im blaming you."
The crazy thing is in the 90's as computers became incorporated into industry.. industries had crash kits. so they could continue to operate if the network went down. credit card rollers to process payments, paper pads with forms that would normally be inputted into a computer. Hell in the restaurant industry they still have manual overrides. Ovens have computer boards that can change the temp add stem add time all on its own. but there is a manual override inside a panel. for if the motherboard fails.
I love how we had probably the 2 worst cyber incidents withina month of eachother. In June we had the CDK cyber attack that took out about half of dealerships and cost billions, only to be one upped by the Crowdstrike incident that crashed nearly the entire world.
In Canada Twice Rogers went down nation wide and they had to do some sort of "if our systems are down we'll allow other infrastructure to be used" Sort of situation. Canada isn't the whole world but it's pretty similar to the CrowdStrike situation in that it halted a lot of stuff in Canada for a day not once but two separate times. So it's not like this sort of thing never happened it's that our world is too connected via software and a few wrong forced updates can wreck a nation.
crowstrike is to blame: their bad update definition caused their kernel mode driver to crash. there was no microsoft update that set this off the issue occured when the machines rebooted for a routine windows update and witch point the crowdstrike kernel mode driver ate its own tail and took the os with it. if microsoft has ANY fault its they should have put a stop of kernel mode drivers as a whole years ago. but the 3d party AVs bawked at that and cried antitrust and now people are dead because of it
This comment matches my understanding of the situation. I'm confused as to why this video is still up, as I can find no source relating to a related Windows update that caused, influenced or even corrected (after the event) this matter. If anyone has such a source, please share it. I'm aware of an unrelated Azure bug in the US Central region that day.
I believe the other issue was how Crowdstrike deployed the update to bypass the Microsoft verification, the Kernel program got the code from outside the Kernel, so that code didn't need the verification. I also believe that they bypassed the staging options that people had setup, which would have reduced this to virtually no impact.
The most difficult check with QA as a former QA video game tester myself is CROSS CHECK VERIFICATION. Its pretty clear that this probably was a situation where the update ticked properly EVERY box but the cold boot restart. Cause its 99% of the time the most easily explained answer with testing is the real issue.
If you don’t have a way to switch immediately to manual record keeping for at least a temporary span, then these things will shut down your business. Any retail store can keep written records of transactions and process them later like we used to do 20-30 years ago-even with credit cards or checks if you take down the info properly and have some level of trust with that customer.
If you don’t have a way to switch immediately to manual record keeping for at least a temporary span, then these things will shut down your business. Any retail store can keep written records and process them later like we used to do 20-30 years ago I also worked ems in a rural area and legally it’s fine without digital records as long as you do have a way to digitize them eventually
I think this is going to push for a lights out out of band management for every single laptop / desktop the same we do for servers. We can get into any server via iLO / iDrac / BMC / etc to fix this. But every desktop and laptop, we have to go hands on for.
I built a custom setup for our org that you just boot to the boot media put it in safe mode then run a bat with local admin once you get into windows. it still took over a week to really clean up with a solid 72 hours of pain trying to touch all of our 100k+ devices.
well *hopefully* this means that airports finally update their infrastructure. all of aviation is incredibly behind the times but especially so our computer infrastructure, its ridiculous edit: yeah that's the thing. processes and procedures often get written in blood, and I suspect this will be one such case. there are many like this, 14CFR is one of them that is chcok full of them, and they will continue to happen as long as technology improves
I think how it's gonna play out is along the lines of "once a certain amount of people depend on you you need to start announcing your rollouts a certain amount of time in advance" Certainly from a liability standpoint MS gets to say "well we told them we were rolling this out 2 weeks ago or whatever, they should have checked"
i read somewhere that crowdstrike has some in's with regulatory bodies so companies just use it by default to comply because otherwise they'd have to jump through hoops, that means people 'forcing' them to do this made the scale of this vulnerability much more of an issue and should also be held liable in some way
cloudstrike is crazy because i was stranded in Connecticut while i live in IL while my mom was slowly having a stroke (we didn't know until we got home 2 days later)
Our checker machines died off, only could do cash. People turned into almost panic mode like "omg i need to get my cash out of the bank!" it was going to spiral out of control on retail, but ..i can see this having long lasting effects. Been a while since ive seen the public panic.
What they haven't discussed and is going to be equally relevant is that because this is/was a global issue, not a national one, SCOTUS is ultimately just one Court. There are lawsuits being filed in the UK, in France, in Germany, in Australia, Japan, China, South Africa, Argentina - they're not all going to be resolved by one court, and SCOTUS' ruling does not set a precedent outside of the US (though I imagine the verdicts of some of the courts that resolve this soonest will be used as part of the argument of whichever party the ruling favours). The investigation will take months. The legal fallout? Years. Closer to decades.
When was this recorded? I just went back and reread CrowdStrike's post mortem and they don't mention anything about a windows configuration changing, just a regex issue.
This is why i hate windows forcing updates on systems. You cant test stability on an isolated box before the entire office gets an automatic update pushed on every box
In the enterprise space, you can set when updates happen. So you can roll out an update, or test updates in a VM. The issue is that cloud strike have the ability to ignore that for some reason. Something about that particular update was considered critical update. So even if your IT team was doing their due diligence, windows and cloud strike fucked it up, cause as you said, they forced it to happen. It's fucking stupid that all users are essentially beta testers.
Immutable Linux is perfect for enterprise stability. It's basically impossible to break, and if you do manage to break it, simply revert to the previous system image.
@@FFXfever Even in an enterprise space you still cannot prevent windows from performing updates tagged "critical", unless you configure the OS to use a custom windows update server.
so comfy to choose a company that does all these things for you, automatic so you don't have to wait too long and reduce your exposure to potential attacks.
Just from a diagnostic perspective, if the only place where the update Microsoft released caused issues on machines that were also running Crowdstrike software, I would have thought that who is responsible for this problem is pretty clear, logically speaking. It's like a "not all cars are Toyotas, but all Toyotas are cars" kind of deal. I guess though it's more about what protocols there are for dealing with how well different pieces of software play with each other, and the responsibility for that is a bit unclear.
To somewhat counter what Thor said, this was preventable for the companies affected. At least for the larger ones. If you have critical infrastructure and you need to update the software you test that in a dev environment first. As Thor said, the software providers won't test every possible combination and you, as an end user, don't know if your particular concoction of software will cause a compatibility issue with a new update. So you use a separate test machine, load the update, and monitor the stability. If it's all good THEN you push the update to all your machines. Like how is Crowdstrike of Microsoft gonna know that Burger King's order queuing software crashes with the new update? They won't, and if Burger King just blindly pushes that update to all their machines no one will be getting any Whoppers. In that example it would be up to a Burger King IT employee, likely at the HQ, to first load the new update onto a test machine and see if it all works well together before releasing the update to all their machines. Clearly a lot, and I mean a LOT, of companies weren't doing this.
Oh... I guarantee you, most of the liability is going to be thrown at Microsoft at first. And it has nothing to do with their real liability or how much evidence there is of fault... it's that Crowdstrike has a net worth of $65 billion and Microsoft has a net worth of over $3 TRILLION... people are going to sue the company that's likely to throw them the most money to go away.
Can Microsoft argue that they were forced to create a unsafe environment by the EU, do you mandated they allow third party companies to have colonel level access for the purposes of any malware and the like, Microsoft had a plan in place to allow third party antivirus to just hook into the same kernel level driver they use for Windows defender but that was rule monopolistic by the EU
I had the assumption that releasing an update globally at the same time wasn't a good practice. I work in a large company and our application updates are always gradually rolled out to users. There is a cost and a tech challenge because several versions need to coexist and keep working. But crowd strike is a security company with major clients and ultimately lives at stakes. And preventing this kind of damage (from malicious intent) is literally their business.
I would have loved to catch this live. I watch a number of lawyers on youtube and from what I've learned, If one side is found say 25% responsible then they can also be held 25% liable.
Can someone point to me to something that explain what update Microsoft pushed? Reading anywhere I always read about crowd strike pushed an update of their data and nothing else..
Microsoft didn't release any update related to this. The actual episode is over a month old, and I suspect Thor was speaking to what he thought was reliable information at the time. The actual issue was entirely on CrowdStrike.
That's what I thought.. but afaik it was almost immediately clear that windows had nothing I do..when was recorded this video? I just read again official crowdstrike report and there is no reference on Ms updates: www.crowdstrike.com/wp-content/uploads/2024/08/Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf
It hit us hard. With Encrypted Devices I got into a pretty good rythm. But even with the tool I made it still took about 1.5 minutes per machine. I had thought of using something a bit better, but didn't have the time to make it. With so many people down I just didn't have the time to craft it. Still though regardless touching as many machines as I did it was aweful. I setup an triage center and just have people bring their machines to me. It was the fastest way to recover things rather than go to place to place.
There's a funny side effect to poverty. The provincial airport I had to fly from had no outage because they could not afford CrowdStrike. Malwarebyte and machines that can be swapped out in 15 minutes. It's crazy how this is the "defense" against that monopoly. You need to break up CrowdStrike and limit how much critical infrastructure gets "protected" by one company.
The real issue is that vital machines and critical infrastructure are updated without testing the updates on an offline machine/test bench before updating the whole system, to ensure everthing still works after the update. If everyone did this the problem would never have happened because no one would roll out that windows update to their systems.
I work for a medical device company and while none of our devices were directly impacted by this issue (none of them are Windows machines themselves or require a Windows machine for them to be used), the work computers utilized by the vast majority of the company WERE impacted (very few Mac/Linux users because you need to receive special permission for IT to issue one). Among my team of software developers only about 10% of them had functional computers the morning it all went down. I was "lucky" enough to be one of them and all the others in a similar position had one thing in common - the night before when we left the office all of our computers were disconnected from the company network and instead connected on our own small personal networks without internet access used to separate off prototype devices. Everyone with their computers connected to the main company network had the updates pushed to their machines automatically that night. I don't fault the IT department for that at all because honestly some users' computers would never get updated otherwise, but it was the one instance where a near-universally beneficial IT policy happened to have an unexpected outcome that turned it into a liability instead.
I'm sure someone's already said it USB rubber ducky. You've still got to go to every device but if you've got 20 of them you can do 20 devices at the same time.
Can someone please, for the love of god, help me figure out what the drawing software that Thor uses is? People say it's MS paint, but I have no idea how to get that configuration on it.
Bad idea to push this dedicated clip well after Crowdstrike released their RCA confirming they're at fault. Really should take this down or preface the clip with something indicating this was filmed within days of the outage occurring.
so a process-fix to avoid such issues would be gradual rollouts of updates, but that in itself will make it harder to make sense of what's going on, and is a potential extra point of failure
5:42 - Yes! It took me a while in the IT space to find the confidence to look my boss straight in the face and say "If you see me working like crazy, or in a panic... something is very wrong. It's being handled, don't bog me down with meetings and superfluous communication. If you want to help, I'll show you what you need to do. Otherwise, leave me alone and let me work."
Now as a lead, I am the wall. If you see my guys working hard or in a panic, you don't bother them. You talk to me.
The greats manage both.
With respect
It's folks like you who learn from experience, then put that experience to ensuring the new generation of workers at that level can do their jobs as best they can who are the best for management positions. Hiring people up from their existing positions, when they are able, will ensure that things work smoothly. Too often do we have managers who don't understand the smallest of nuances of a role, demanding outrageous shit as if it's normal.
Good egg, need more people like this as leads
"Maxim 2: A sergeant in motion outranks a lieutenant who doesn't know what's going on.
Maxim 3: An ordinance technician at a dead run outranks _everybody_ ."
-The 70 Maxims for Maximally Effective Mercenaries
In an emergency you *always* defer to the person who actually understands what's going on, irrespective of the normal chain of command. Always.
@@jacobbissey9311 Tell that to management who cannot handle anybody of a lower grade within the company trying to correct them. There are sadly too many egotistical people in charge of things they do not fully understand.
"This is the worst outage we've ever seen in our lifetimes".
This is the worst outage we've seen in our lifetimes, *so farrrr* .
😂
😮
and we're still in 2024
who knows what 2025 have waiting for us
Yeah man, we've is past tense, you can't see the future......
Really thought you did something there
@@2o3ief humor receptor not detected, please return to facility for further equipment.
The thought of a fossilized judge who can't use Excel will be presiding over a massive case like this, needing the lawyers to bring out the crayons to explain even the basics of network design makes me feel like it will be a complete coin flip on how it goes.
Umm... Not to give you nightmares, but the overwhelming majority of all politicians and executives are old AF, tech illiterate Luddites.
Out ENTIRE WORLD is ruled and governed by Boomers.
THAT should scare you far more than ONE JUDGE.
Well, at least SCOTUS still has Chevron deference
Oh wait
It's going to be less about understanding network architecture and more about digging through decades of case law to determine who has what percentage of fault. A judge doesn't need to understand how a program works, they just need to understand what previous courts have said about programs not working. Outages have happened all the time and there's well-established case law about outages, so we're not looking at novel case law here. That will make it pretty straightforward for the judge.
There are good judges and bad judges. I remember in the Google v. Oracle case, the trial judge actually taught himself Java in order to properly understand the case (and then the appeals court promptly screwed it all up, but at least somebody in the system was trying).
It was both hilarious and heart breaking to watch the catastrophe ensue in the Rittenhouse case, where they had to discount "pinch to zoom" on iPhones, without anyone realising how badly that fucks up basically every court case involving digital video. - Kinda tempted to make an amicus brief, explaining that video compression throws out 90%+ of information in a video, so none of it can be trusted, just watch what happens.
Thor hit the nail on the head with the IT industry. If everything works some exec is saying "What are we paying those guys for?" and if anything goes wrong there's more than one exec saying "What are we paying those guys for?!?!"
Just keep downsizing the team till they can barely cope with the minimum workload, because surely nothing will go wrong ever.
To be fair, asking the question what are we paying these guys for? Is literally the executives job.
@@michaeldeats328 It should be a tiny part, not the majority of their job
This is why we don't allow day zero updates from external sources. Also medical devices are isolated and do not get updates. Can't risk an update breaking a critical medical system.
Yeah, why would an MRI machine need to be networked
Same for a dialysis machine, i know davita was up because my dad could still go on friday
likely sales trying to get nice big fat number of devices install, and hospital administration checking off checkboxes that their "medical devices" are all secured
Well that's where the big fuck up came for crowdstrike, many companies that did not choose to have zero day updates got pushed the faulty definitions update anyways. To me this is 100% on Crowdstrike because they fucked up on so many levels and Microsoft has perfectly documented the dangers of using kernel drivers at boot time.
@@MachineWashableKatie Well you still need to get the images off the computer into your patient records, but it shouldn't be talking to the internet. That's why they are run on a Medical Device VLAN and only get very specific access granted to reach out.
For the bitlocker issue: some people figured out how to manipulate BCD (Windows Bootloader) to put the system in something approximating safe mode - safe enough that crowdstrike doesn't load, but not so safe that Windows doesn't ask the TPM for the key. Probably 95% of the bitlockered machines can be recovered this way (my estimate).
I love someone created something that sounds like a hacker breaking in through the window, in order to repair the damage caused by the homeowner
That is some sort of next level Chinese-pot-balancing circus trick
IF you have the key this might work.
@@phobos258 most systems have the bootloader on an unencrypted partition and if you can get Windows to try booting into a recovery environment by failing boot 3x, you run a handful of bcdedit commands (most importantly, setting safeboot to minimal on the {default} entry) and reboot. BCD should be able to pull the key from the TPM because nothing important changed (no bios settings, no system file checksums) and boot into safeish mode. Then, you can delete the bad file, change safeboot back to the default setting, and restart.
Windows just updated this " vulnerability and it auto updates even on computers that are in a boot loop"
What this issue showed us at my emergency service center is that we don’t have robust enough plans for operation without computers.
It’s helped us improve our systems and we are to a point now where we can totally operate without any computer or internet systems. We’re more prepared than ever now.
my dad is a doctor and for years he joked that 'computers are a fad and theyre going to go away" and haaaated their involvement in his work that just reqired him to do twice the paperwork most of the time... looks like, at least hwere you are, he could end up being right xD and i think its better that way. computers shouldnt be blindly trusted to securely hold such sensitive and private imformation, especially when the things being put on the computers are often things that sales and admin want to make movey and profits off os, as is pretty much anything that crosses their paths.
Having worked in IT, I always had the MBA's find out how much their departments cost to run per minute, and then account for how long a 3rd partly IT support company would take to respond. Now we can just point to this...
I enjoy your approach. Pointing out what time is worth and what downtime costs in order to advocate for keeping your support team around & equipped to support you seems obvious, but plenty of people seem to need it pointed out to them.
@@AndyGneiss sadly, sometimes the snake in the grass is only obvious to you after it’s bit you
As someone who has QE'd... there is also "we told you, but the business never thought this edge case was important".
This appears to be the everyday, common "that isn't our failure" type of design failure that no one solves as there is no ROI on pre-solving these instances.
My thoughts exactly
I see parallels in other industries. The owners need to balance putting out a perfect product and a profitable product. At no point is software ever "done." There is always another edge case that needs to be addressed, a new exploit discovered. At some point a product needs to go out the door, otherwise there is no profit and no one has a job.
You can see the same thing in construction when everything is fine, until it's not. Sure a job is better with 10 guys on site, but that's not feasible, so it's done with 5 guys. Time lines are rushed, safety gets over looked here and there, but most of the time it's fine. Then one day a bridge collapses. A tower crane falls over. An investigation will show where they went wrong but if you ask the guys, they will tell you, "It was only a matter of time before this happened."
according to the RCA analysis CS put out, they had multiple points of failure in the process.
Simulated/Mocked examples of these updates which relied on wild cards
Next update didn't allow them (but the test with wild cards passed and noone caught it)
Parameter out of bounds (was this in the kernel? or in Crowd Strike's sensor? Not clear on that)
They call it a content update, but it relies on a RegEx engine, who hasn't seen Regex hose you when something seemingly minor changes?
there's more I'm sure.
The channel Dave's' Garage (Dave is a retired Microsoft developer from the MS - DOS and Win 95 Days) did a good breakdown on one of the last questions Thor asked with why there was no check before the systems blue screened with a driver issue.
Crowd strike apparently had elevated permission to run at a kernel level, where if there is a problem at the Kernel windows Must blue screen to protect itself and files from corruption.
Dave's video will do a better job explaining it than I could ever hope to so I grabbed the link to it: ruclips.net/video/wAzEJxOo1ts/видео.html
There is also the aspect that CrowdStrike doesn't validate those changes through the whole WHQL process to go faster. This is purely CrowdStrike's failure to validate input in kernel level code and the fact that they didn't test properly. If you had done even one install test you would have seen that it tried to access an address that didn't exist and it failed. At that point Windows has no option, but to fail. There are plenty of things to talk about with how Windows has issues, but this is not one of them.
Like the Microsoft update basically had nothing to do with the failure so I just hope this VOD is late to the party because many points they talk about if Microsoft contributed to the failure is basically counter to reality. You play in kernel level code, without WHQL validation, and fail to data input validate you fail. Even CrowdStrike's PIR basically says "Stress testing, fuzzing and fault injection and Stability testing". As someone that works heavily in the industry, they basically just admitted they don't do proper validation"
So a guy who collects a pension from Microsoft says it isn't microsoft's fault. Sounds about right lol
@@jimcetnar3130he knows what he's saying.
if you use Windows, you are using his work all the time, in fact Task Manager was HIS program and he thought about selling it as a 3rd party (he had a clause that allowed this) but decided to donate it to Windows, he's also responsible for the format dialog, the 32gb fat32 limit, and the shell extension that made viewing the zips files just like any other folder possible. He's talking from knowledge.
For those who want blame Windows. Same thing happened with Crowdstrike few weeks earlier on Linux, just then was no much devices demaged so no one cared.
Right, but the affected Linux machines were able to recover _much_ quicker because of the way Linux handles kernel modules (and the fact that a kernel panic gives a hell of a lot more information about what _actually_ happened than a BSOD probably helped too).
@@AQDuck They were able to recover quickly because you could see which driver failed and while the system still crashed you could blacklist the driver at startup.
Dude if
Windows + Crowstrike = huge problem
And
linux + Crowdstrike = small problem
Then Windows is the main cause
(Edit: can't believe people took this seriously. Is obvious Crowdstrike is the problem)
@@wesley_silva504 Most Linux systems don't use encrypted drives and that made the problem MUCH worse. Linux has also dealt with bad drivers for a long time and provides a way to blacklist drivers to keep them from loading.
@@wesley_silva504 Yeah this is where im at with it.
This is much like what the NTSB goes through. When an airline or train disaster happens. You won't know a fault or failure point. Weather it's human, digital, or mechanical. Until an event happens. And why it takes them months to years, to solve. There's so many factors to look and potentially blame.
According to Dave’s Garage the issue was 100% Cloudstrike. They sent a empty package and the driver couldn’t handle the problem.
Yeah, this was recorded early in the debacle, first couple days, and info was not great.
Yep, Crowdstrike was doing fucky shit with their EXTREMELY SENSITIVE BOOT CRITICAL drivers.
Even if a Windows update broke the driver, it broke the driver because Crowdstrike was doing fucky shit they weren't supposed to be doing with their driver.
who wrote the driver with no error handling? MS
@ Cloudstrike bypassed WHQL testing in the way they wrote the driver.
9:15 safety regulations are written in blood. Often times you don't know what you need to prepare for until after it happens.
Why isn't there any blame on the Companies themselves? I work in IT , and my previous company use to test all windows updates, and software updates on a 48hr test before allowing it push out to the rest of the systems. The current company does not do this and was hit with the crowdstrike issue. My PC was not affected because I disable update pushes on my system and do them manually. I was advocating to start doing smoke tests before allowing update pushes ahead of time.. before this happen.. NOW after half the systems went down.. they decided to add it to the process.
The perception here is since it's /security/ software explicitly used to handle close to realtime issues, 48 hours is 48 hours vulnerable. If it was anything else this wouldn't fly. Nor would it have such a low level access to bugger it up in the first place.
You can't do that with Crowdstrike. The entire purpose of Crowdstrike is that you are paying them as a customer for that kind of "due diligence". I work in IT. If I have to start managing Crowdstrike like OS patching with staggered roll-outs then they suddenly become a lot less appealing to pay services for. Crowdstrike is all about being as fast as possible for Threat Analytics. You can't tolerate a lag because once and exploit hits... you need to get in front of it as soon as you can. This close to "bleeding" edge is just going to have that risk.
@CD-vb9fi ah.. but my good sir, you just ruled out your own statement, you stagger roll out of OS patching, if this was done with the OS patching then the hit would not have been bad as they would have caught the boot loop / OS update issue with crowdstrike latest patch.. as stated in this episode it only failed when coupled with latest OS patch not initial release. This would mean the due diligence would have been on the companies themselves for not verifying the latest OS patch did not have any conflicts. This is why Microsoft also does rolling OS patch deployments. The problem here is not only did MS and Crowdstrike fail QC, but so did the Companies IT QC processes. Crowdstrike should be testing on preview builds of OS deployments as well, if they are a partner, they have access to all builds future releases.. all in all this could have been easily avoided if the industry had better practices as a whole.
@jasam01 you guys seem to miss the nuance , of the OS patch. CROWDSTRIKE WORKED, then new patch releases , then bluescreen/boot loop. If you smoke test with configs for 24 to 48 hours with your common user test suite , before auto releasing OS patches to your IT infrastructure you would have been fine as you know that hey something happen after this OS patch. There is a reason Micrososft let's IT manage the patches, not all software plays nice with every PATCH.
@@vipast6262 You said OS Updates AND software updates, we were talking about the latter, because Crowdstrike has a relatively unique position from the standard expectations.
It's worth noting that no amount of smoketesting the OS patch would of saved anyone if Crowdstrike were to do even worse and push a bluescreening update that occurred on the current update.
Oh wait what? This is the first time I've not seen a comment on a video like this lol. This really puts into perspective how bad this was for many people. I, fortunately was not as affected by it, but many people were. I'll be praying that the IT people can get a break/ are appreciated more as a result of this.
That's where IT reservist can be a good governmental program : you vet local IT professional and get them to be familiar with emergency services and critical infrastructure systems. Then they can jump in to support critical places who don't have enough of an IT staff to face a big attack or such a catastroph.
I know of a 150B+ market cap business, which was doing full Windows recovery rollback to a version before the update was pushed out. The bit locked machines had to have a person on site accompanying the remote admins. Their billed time and losses are the the hundreds of thousands
Where I work we have test machines for updates. Only after the updates get tested they are released to the other machines (Apple and Windows alike)
@@vollkerball1 for sure. That makes sense. I the place I'm referring to, can't be named, it's that huge. It's also managed by pleb suits
The thing is... it was all avoidable if the did disaster recovery right. But who does that? I have been doing IT for over a couple of decades... two things are never taken seriously but always claimed to be serious... Security and DR.
This is why monopolies are bad. There's a VERY good reason the old adage of "Don't Put All Your Eggs In One Basket" has been around for as long as we've had chickens and baskets... it only takes ONE PROBLEM OR ACCIDENT to ruin everything.
Crowdstrike isn't even close to a monopoly, they have ~14% of their market
Computer Operating Systems are largely a natural monopoly/duopoly... Developers don't want to create programs for multiple platforms, only the popular platforms get the apps, the unpopular platforms die (see windows phone, WebOS, OS/2, etc....)
Linux has been trying to break in for years, it's arguably fairly complete, but no one buys in, because the platform support from app developers isn't there.
@@cloudyviewI believe he is referring to windows. It is true that some systems tend to be a standard, and windows is one, but the space has 2 other's. Monopolies might be good for the consumer at the start, but they quickly turn sour and more then just for the consumer. In ideal world, we would have diversifications with strong standards. But this isn't the ideal world.
@@fred7371 Problem is multiple different OSs either wouldnt change much between each, in which case theres no real point or difference in which OS is used anyway so this would still likely occur, or there would be many different OSs with extremely different structures and nothing would be compatible between any of them.
Its already a massive pain developing for windows, linux and mac, and a pain dealing with every single possible configuration of base hardware that can be mixed and matched. Could you imagine adding another 30 different OSs? Security critical devices would still all homogenize to some extent and one OS and security program would still reign king.
@@duo1666 correction, a pain for windows and linux, mac and linux share the same structure. Yes your right, I am aware of the conflict of standards that could arise from multiple OS. But I also point out that if you enforce a basic of standards, this is less of the an issue, I added there for those that know what I am saying, you can see this in man industries.
It is also less of an issue if you have to up the competition instead of imposing your way of doing things (funny that's where we are at currently). We saw that on the USB ports, and the fact Apple was trying to get ride of them, or the charger debacle in the EU. That's just some, there's google recently.
Ofc it won't be easy, but that's the spirit of competition, to try and do better. That's why monopolies are awful, they ruin everything from the people creating the product to those left with no options.
@@fred7371 Monopolies are bad in capitalism. A centralized system to handle things on its own isnt bad. And capitalism and competition isnt exactly good either. Monopolies in capitalism exist because you can take the entire pool of cash then constantly roll back expenses that ensure a quality product because the investment to start up is large enough that you can make a lot of money before that happens, and then bankrupt the company, buy out the competition, and do it all over again.
Realistically, the only real issue here is everything auto updated, so everyone was hit all at once, where as the problem would be more localized if everything didnt update at literally the same time.
10:00 - "It was a blind spot" makes sense from a QA perspective, but... clearly we, as a society, can't be having software systems with billion-dollar costs attached to them.
My personal feeling, being within the IT/Development space is that liability is also shared by the end users. Of course it sucks if an update brings your system down, I've been in those kind of situations when it was just an update for some small but important software we were using. But you need to have adequate backup plans in place to quickly recover, don't always force the latest updates immediately without testing, and have a system down plan in place where you can. I feel bad for everybody that has suffered because of this, but in terms of strengthening processes going forwards, this was an important lesson. We've handed our lives over to the concept of an always available infrastructure that can be brought down within minutes with very few alternatives in place.
This right here, is why a monopoly is bad. CrowdStrike has a monopoly on the IT security market when it comes to locking and managing systems. And it broke. Whether it was Microsoft's fault or Crowdstrike, it doesn't matter. Something broke that made systems with CrowdStrike go down, and the world stopped for a day because of it.
A monopoly in of it self isn't a bad thing, but with any monopoly there needs to be regulation and accountability.
Not entirely sure, but wasn't there a "SolarWinds" or some such that Google had a problem with a year and change ago? I remember a similar problem to this happening (though no nearly as severe) because ALL of Google services went down for a few hours and things basically stopped for the day as they removed it from their systems.
Even if you have multiple providers of this kind of service, any given organisation is going to use one or the other. You're not going to have half the hospital using CrowdStrike and the other half using StrikeCrowd unless you know that each has advantages in its niche that outweighs the added hassle of using two different service providers.
Crowdstrike isn't a monopoly, they have (had?) ~14% of the market. It's a substantial share, which is why this was so widespread, but it's not even close to a monopoly
Please don't talk when you don't have any Idea about the IT-Security landscape okay? Crowdstrike is as far from a Monopoly as McDonalds is from Healthy Food. Microsofts own Microsoft Defender XDR / Sentinel has way more Share then Crowdstrike has, heck even Kaspersky has more Marketshare then Crowdstrike.
Crowdstrike is a sample of what we thought Y2K was going to be.
it would be helpful in the future to have a date marker for when these conversations occurred. This feels like it occurred shortly after the wake of crowdstrike's outage, but it doesn't appear stated anywhere (that i can see at least. I could be missing it).
About hand-written airplane tickets: About a decade ago my intercontinental flight had to be rerouted and they issued me hand-written tickets for the three legs it would have. At the time I expected I would get stranded in Bangkok - thought no way that ticket will be recognized as valid. But lo, there was no problem in Bangkok - in fact Thai Airways staff was waiting at the gate and hustled me to the connecting flight. On my next stop in Singapore they again had staff waiting for me at the arrival gate, let me bypass the entire 747 queue at the x-ray machine and drove me directly to the departure gate.
I thought that was peak organization by the airlines involved (all Star Alliance). They even did not lose my baggage (which is more likely if connecting times are short).
He is wrong. The fault was 100% Crowdstrike. They changed a function to take 21 arguments but only gave it 20 and it wasn't coded to handle this error so it exited with an error code and since it was running in the kernel Windows stops to prevent data corruption and Crowdstrike is a boot driver which means windows won't boot if it doesn't boot.
The stupidest part of this is not check pointing a working config and automatically reverting to it.
When was this interview, and when did the information you mentioned become available?
Because I've been seeing clips from what looked like this interview for weeks, dunno how the timeline actually works out
Linux handled it fine and it was able to easily recover from a defective kernel module. Microsoft still has some blame for bricking windows if the kernel module fails.
@@gljames24 It was Crowdstrike that marked their driver as required for boot.
@@huttj509 The original podcast was on July 21st. So well over a month ago.
Great. So your info could be moments old. And this was broadcast the day of? Day after? Of course no one had all the info.
I was there working the night shift for our EMEIA branch. shift's from 10pm to 7am. I left at like 1pm that day. first hours were frustrating because we had no instructions. then instructions were found in reddit, took a few more hours to approve and implement them. and yes, it lasted weeks because people were on PTO/vacation that day so they came back to work with this issue in their laptop. fortunately affected people were nice.
our help line queues were MASSIVE that morning and did not relent, as expected.
9:06 It's like the line in shooters and strategy games.
"It's never a warcrime the first time."
Warfare with chemical weapons wasn't an issue before chlorine gas filled the trenches. Posing as/attacking dedicated field medics wasn't ever a problem to be considered before the red cross came to be.
And now, these IT-problems were too obscure for the end user and original developer to notice.
2:04 can confirm in AZ the entirety of the fire/pd dispatch system was back to radios and manual calls.
This whole process about prevention Thor describes reminds me of episodes of Air Crash Investigations where something goes wrong in a niche case that in hindsight is completely preventable with the smallest change to something...but you never knew it needed to be changed because nothing like it had ever happened before.
10:50 There was a saying in germany while the introduction of De-Mail from a computer expert: "For every technical problem there is a judicial solution". The law created basically stated that encryption of messages in transit on a server are not to be considered in transit for the purpose of deencrypting that message. (Otherwise the law would be in breakage of another law for data safety). And the next sentence was "No government is stupid enough to give their people a means of communication that can't be spied on".
I was one of dozens of Field techs doing contract work for The Men's Wearhouse. 200 of their stores had just had pinpads upgraded from USB to ethernet server connections. That server was affected by this. I personally went to the 6 stores I had upgraded and got them all fixed that same day.
They could still do sales, but it was 15 minutes per customer until I got the server fixed. Once I got the bitlocker key for each server, it was a breeze, but for the first 3 stores, I had to wait 45 minutes to 90+ minutes PER store in the queue to get the bitlocker key. Was easy to fix, but whew....that was a fun Saturday.
I assume most of this maschines are managed in an Active Directory. So from this AD you can assign and provide to run something at startup, so you can deliver with that some update that breaks the bootloop.
I work for my local county government and we use Crowd Strike thankfully we only have about 1400 desktops and about 500 laptops we had to fix. We were back up to 80% fixed by the end of the day Friday when this happened. That was with all 4 techs plus the radio comms techs which was 2 more and 2 sysadmins all going to each and every PC in the county. Some departments have offices 60-80 miles away from our main office.
In this case it was actually 100% preventable by proper processes, which we do not do due to costs... Any system that is this important should have a clone where updates get tested before they get deployed to the actual machine. This process is widely considered as DTAP Development/Test/Acceptance/Production. The patch should be automatically deployed on test and tested in a automated manner to see if it still runs, in this case it would've failed to even boot. Then in Acceptance to see if it still functionally works. Once you've done this you go to production.
If you say this almost any manager now a days will tell you, yeah but what are the chances! Well not that high, but hey we've proven the impact is disastrous as we've proven once again!
Honestly, I'd say the fault is how we approach stuff like this in the first place. We have companies creating imploding submarines, we have bowing, etc... At some point you have to ask yourself is this entity at fault or is it all of us for allowing them to be this faulty for the benefit of a few peoples personal wealth. Because that is what it comes down to in the end.
you can restore it.. but need to change SSD also you can get the key from account linked to the machine
I work in IT. We don't use Crowdstrike, but this kind of an issue is not unique in this space. We use CarbonBlack and we've discovered that the people who control it's configuration have INSTANT access to all configs on on any PC that use it. One time someone caused the software to BLOCK critical software we use and could not run the business for 20 minutes until someone turned whatever setting they used off. Currently, they have blocked specific browsers, but we can't do anything with them, not even uninstall them. So they are broken pieces of software stuck on the PC.
This problem is probably more related to lack of knowledge and communication to those that control the backend, and then we are moving to the cloud in a year or two, I absolutely HATE the state of IT right now, it's damn scary to have uninvested people in charge of our infrastructure.
Also, from working in the tech industry for that data recovery company... it was *very preventable* we ended up finding out, for those who want an update to it. They didn't QA the update that was pushed on Crowdstrike's side. The only computer that would have received a notification of an issue wasn't even a QA person, but one of the devs, and that dev's computer was locked while he was out on PTO.
If they had properly QA'd the update *and* had it set up for a proper notification channels to a qa person, it probably wouldn't have been as catastrophic.
A friend of mine had to courier his laptop back to his office to get it fixed. In total it was 5 days of time wasted
My company pulled all the bitlocker codes and put them in an excel doc and did each laptop one at a time. Wild times.
Back during the summer there was a different outage that affected the auto industry, CDK. What was "great" was going into a management role a month after that outage, being told they were still recovering from it, and that my pay was going to take a hit because they were still fixing sales numbers and recovering from a loss.. like cool, didn't even work here yet, I have an agreed upon salary, but sure pay me less.
I am a senior systems engineer and thankfully we only had 30 severs affected because we primarily use Carbon Black - we did have a few servers in Azure and it is a pain in the neck to fix those. We unmounted the disks from the affected VM and attached them to another VM to get the file deleted then had to move them back.
as someone who works in i.t. at a medical office, there is a reason why i have the system setup not to update for a minimum of three days. That being said, we also have a few systems that haven't been updated in years, to be fair they're isolated but the point is that in the medical industry from everything i've seen (at least at smaller offices) updates are resisted and only done out of necessity.
I remember a couple years ago someone hacked into a program that a hospital in my state used and in turn it infected all emergency services that used it in the whole state somehow, they loaded the service with some ransomware and it crippled the services for a couple months or so... hospitals, police, fire services everything connected to the service was completely locked out. My mother who's a nurse manager in a home health office here said they had to scramble to break out old paper files that hadn't been touched in years, pull others out of storage, or try to get more recent physical paperwork from other hospitals and care offices because no one could log into the online services... equipment like ekgs and central computers just stopped working
I was half involved in our company's recovery (I didn't perform the fix, but I prepped server restores from pre-update). It seems like "blame" should be easy to determine through contracts. I don't know if a jury or judge would need to be an IT person, they could literally just sit through a Thor-style MS Paint session, since the problem is logical in nature.
I thought I had to manually boot machines into safe mode and delete the update on around 200 devices. Fortunately a lot of the machines were off when the update went through so it ended up only being around 25 machines.
inb4 its the system temp folder change that caused this
For those wondering, this was from july. On Aug 6th crowdstike put out a blog post called "Channel File 291 Incident: Root Cause Analysis is Available" where they admitted they were 100% responsible. This actually is not the first time it happened. Its not even the first time *this year*.
This is misinformation. No part of that report admits to 100% of the responsibility. Out of the 6 described issues, only #6 (staged deployment) is something that is solely within Crowdstrike's domain. Based on the report alone, other issues can easily depend on external factors such as Microsoft-defined APIs that are not expected to suddenly change. The described mitigations can be seen as _additional_ precautions, not as something that is required of a Windows kernel driver.
In fact, they explicitly mention passing WHQL certification, and it isn't specified if Microsoft's or Crowdstrike's update broke a specified API standard. Maybe Crowdstrike relied on something that wasn't formally defined but practically remained unchanged for a long time. Maybe Microsoft failed to specify which APIs are changing with the update and communicate it to partners on time. Maybe both messed up.
While it does look (at first glance) like Crowdstrike's regex shenanigans are to blame, I can't help but remember the displeasure of dealing with Microsoft updates shutting down production due to server and client protocol updates being released simultaneously without a deprecation announcement, effectively killing all clients at once until the server gets updated (while provider admins are napping) or we blacklist a sudden Windows update. Extremely minor incident in comparison (~2h downtime), but Microsoft also breaks stuff from time to time, and it's entirely possible that Microsoft released something that was incompatible with their own WHQL certification.
We really need more information before we blame Crowdstrike for absolutely everything. They definitely made mistakes, but it's possible it isn't entirely their fault. And so far they haven't admitted it's solely their fault.
@@MunyuShizumi
When you say "it isn't specified if Microsoft's or Crowdstrike's update broke a specified API standard." here's literally the first line of the Root Cause Analysis:
"On July 19, 2024, as part of regular operations, CrowdStrike released a content configuration update (via channel files) for the Windows sensor that resulted in a widespread outage. We apologize unreservedly."
You said: "other issues can easily depend on external factors such as Microsoft-defined APIs that are not expected to suddenly change"
Microsoft wasn't involved at all. All six issues were wholly within Crowdstrike's domain. If you read the RCA, you would know that the root cause was "In summary, it was the confluence of these issues that resulted in a system crash: the mismatch between the 21 inputs validated by the Content Validator versus the 20 provided to the Content Interpreter, the latent out-of-bounds read issue in the Content Interpreter, and the lack of a specific test for non-wildcard matching criteria in the 21st field.". I'm not sure where you're getting that a Microsoft API changed anywhere in that document. It was Crowdstrike software attempting to a read an out-of-bounds input in a Crowdstrike file sending Windows into kernel panic.
You said: "Maybe Crowdstrike relied on something that wasn't formally defined but practically remained unchanged for a long time"
Both the Falcon sensor and Channel Update file 291 are Crowdstrike software - not Microsoft. Issues 1, 2 and 4 described what Crowdstrike's Falcon sensor didn't do. Issues 3 and 5 are gaps in their test coverage (gaps is an understatement). Issue 6 they didn't do staged releases leading to a much more widespread issue.
You said: " and it's entirely possible that Microsoft released something that was incompatible with their own WHQL certification."
The WHQL certification is only certifying the Falcon sensor, not the update files - thus it's irrelevant to the root cause. The issue isn't that they didn't change the sensor software as that would require a new WHQL certification testing process, it's that they changed what the sensor was ingesting. It's like saying I have a certified original Ford car, but then I'm putting milk in the gas tank and wondering why the engine is bricked.
You said: "We really need more information before we blame Crowdstrike for absolutely everything. They definitely made mistakes, but it's possible it isn't entirely their fault. And so far they haven't admitted it's solely their fault."
They literally did. What more information do you need? They released the full RCA. There's no more information to be had. Crowdstrike pushed a buggy update that bricked millions of systems resulting in trillions in damages and almost certainly lead to deaths (emergency services and hospitals were offline for many hours).
@@MunyuShizumiNo, it isn't misinformation. Go reread the RCA. Literally the first line is "On July 19, 2024, as part of regular operations, CrowdStrike released a content configuration update (via channel files) for the Windows sensor that resulted in a widespread outage.
We apologize unreservedly."
Microsoft isn't involved in the incident beside the fact it's a windows machine.
It was Crowdstrike software trying to access an out-of-bounds input from a bad config file that somehow passed all of Crowdstrike "testing". All aspects involved are Crowdstrike.
The certification is irrelevant. That's only for the Falcon sensor itself, not the inputs. That's like getting a certified new car from Ford and pouring milk in the gas tank and blaming Ford for making a bad car.
All issues are Crowdstrikes fault. None of the issues in the RCA have anything to do with Microsoft. 1,2, and 4 are what the Falcon sensor failed to do. 3 and 5 are (enormous) gaps in test coverage. 6 is the lack of staged releases.
What more info do you need? There's not going to be any. This is it. This is literally the document that says what happened and why. Microsoft didn't change anything, Microsoft isn't really mentioned besides the certification and the pipes.
Quality assurance. Especially testing with enterprise level security software because of how hacky they do what they do but also gradual rollouts of updates so if something goes wrong it doesn't go wrong on every machine at once so you can catch it and stop the update
My understanding of the cause from Crowdstrike's end (I admittedly haven't looked into this in a while) was that a file was empty that shouldn't have been. Another part of the software tried to read from that file and crashed with a Null Reference exception. It's possible the file itself was fine when tested, but something went wrong in their release process which corrupted the file and it isn't something that could have been directly caught by QA at that time.
That being said, it seems like QA or Dev should have caught the "bad file causes null reference" problem as a Null check should always be done before trying to access the reference, not matter how sure you are that it can never happen. It may be ok to crash loudly in dev, but the prod release should always handle it gracefully.
I was working 911 that night. We had no Computer, CAD or anything except phones. No text to 911, half our services were crippled. You should see how difficult it is trying to obtain people's locations during a traumatic event with them screaming mad at US that the systems were down. I was there with a pen and paper and google maps on my phone. Nothing we can do except deal with it until shift ends.
i would like to see where Thor got his accounts of windows pushing an update that crashes the configuration of named pipe execution since that's what crowdstrike claims they did (updating the channel files)
I was working that night at my hospital when everything rebooted and bsod at 2-230 am. We thought it was a cyber attack and acted immediately. Our entire team went all hands on deck, and stayed all Friday and weekend to recover critical servers and end devices to keep the hospital running.
The blame could lie in forcing updates that every company seems to wants to do nowadays, back in Win7 updating was optional, so far it still is with most software. I've been with Thor on his short about "I am the Administrator, you are the machine" bit as I HATE getting updates shoved on me, because it seems every few updates it breaks something and they seem to insist on shoving phone games and bloatware on your device, we would be well served in making updates a default-option again and you can schedule an IT guy to check every week or month or whatever their policy is to be.
Crowdstrike has requirements for using their systems by mandating updates. They do this because if you don't keep everything up to date, you are exposing your services to unnecessary risk that they may not be able to defend against, at least the non updated versions of their software can't defend against, and because of that, wouldn't be held liable if something happened.
@justicefool3942 but we come back to Windows updates are notorious for breaking random ass shit for an update that I could guarantee you didn't need to happen.
Forcing updates is a blessing for everyone, you don't value it until you meet a company that blocked updates and every machine is still on Windows 10 Redstone 2.
You'll curse so many people the day you have to update all those machines because software doesn't work anymore and now 1 out of every 3 machines starts bluescreening or locking up because the drivers are completely fucked up and outdated.
@@devinaschenbrenner2683remember wannacry, only out of date systems were affected. the broad attack came in may, but the fix already in March. most private machines were save, because of auto updates, but alot of company's were screwed, because they never got the fix.
not really. large companies use enterprise versions of windows, you can stop/start/schedule updates whenever you want. but crowdsource needs the updates for security or else what's the point in having a crowdsource software (or any other similar one) with an os full of holes.
If I had to point to a likely blame, I would lean toward Windows. MacOS, prevents issues like kernel panics with features such as System Integrity Protection (SIP), Windows doesn’t have the same safeguards in place. I'm not saying macOS is inherently better, but it shows that it's possible to protect against these types of vulnerabilities.
To access the file they are talking about at my company we had to use the bitlocker key first then if it worked we could make the changes. If bitlocker did not work we reimaged the device. We 12-hour shifts going for 7 days to fix over 6 thousand PCs.
First stories: Looking at the actions people took with the benefit of hindsight. This typically attributes blame to individuals and not systems.
Second stories: Looking at the system and seeing WHY mistakes were made. This approach makes the reasonable assumption that everyone is a rational actor and makes the best decisions they can with the information they have. Using this approach typically allows organizations to grow and improve, tackling the real root causes of problems instead of taking the easy route of blaming individuals.
Karin Ray (on a discussion by Nickolas Means)
Does microsoft use crowdstrike and if they do how do they not test their own OS updates against their own systems? Pushing updates straight to production does not sound like something OS used for military etc purposes should be doing.
Difference field (home inspections), but "If nothing happens, nothing happens" is very much on point here. We get new safeguards only after something damaging happens. This failure was predictable and I'm sure you can find plenty of people who predicted it. But it's extremely rare for action to happen because a very bad outcome is theoretically possible.
It's also entirely possible that Crowdstrike and Microsoft tested the latest update in isolation and it worked, but broke when pushed to live end-users. Really curious to see how this unfolds
For any kind of large scale upgrade, you test the update on 1 machine running whatever apps are important to you to validate the update. Only when that testing is successful do you ever update the rest of your stuff. You also always have a method of rolling back if it fails tested and verified to work. So a failed update should never disrupt anything for longer than a few minutes. Also terminal servers exist. Anything I can do consoled directly into a machine I can do just as well from 3000 miles away, other than maybe having someone physically power cycle it so I can interrupt the boot sequence to get in. There's no way I could do my job if I had to fly all around the country constantly to physically console into stuff.
With Thor talking about the courts deciding who's at fault, I just think of the saying "I didn't say you're the one who's responsible, Im just said im blaming you."
The crazy thing is in the 90's as computers became incorporated into industry.. industries had crash kits. so they could continue to operate if the network went down. credit card rollers to process payments, paper pads with forms that would normally be inputted into a computer.
Hell in the restaurant industry they still have manual overrides. Ovens have computer boards that can change the temp add stem add time all on its own. but there is a manual override inside a panel. for if the motherboard fails.
I love how we had probably the 2 worst cyber incidents withina month of eachother. In June we had the CDK cyber attack that took out about half of dealerships and cost billions, only to be one upped by the Crowdstrike incident that crashed nearly the entire world.
In Canada Twice Rogers went down nation wide and they had to do some sort of "if our systems are down we'll allow other infrastructure to be used" Sort of situation.
Canada isn't the whole world but it's pretty similar to the CrowdStrike situation in that it halted a lot of stuff in Canada for a day not once but two separate times.
So it's not like this sort of thing never happened it's that our world is too connected via software and a few wrong forced updates can wreck a nation.
crowstrike is to blame: their bad update definition caused their kernel mode driver to crash. there was no microsoft update that set this off the issue occured when the machines rebooted for a routine windows update and witch point the crowdstrike kernel mode driver ate its own tail and took the os with it. if microsoft has ANY fault its they should have put a stop of kernel mode drivers as a whole years ago. but the 3d party AVs bawked at that and cried antitrust and now people are dead because of it
This comment matches my understanding of the situation. I'm confused as to why this video is still up, as I can find no source relating to a related Windows update that caused, influenced or even corrected (after the event) this matter. If anyone has such a source, please share it. I'm aware of an unrelated Azure bug in the US Central region that day.
I believe the other issue was how Crowdstrike deployed the update to bypass the Microsoft verification, the Kernel program got the code from outside the Kernel, so that code didn't need the verification.
I also believe that they bypassed the staging options that people had setup, which would have reduced this to virtually no impact.
What program does Thor use to draw on. Is it just paint? My paint doesn't look like that?
The most difficult check with QA as a former QA video game tester myself is CROSS CHECK VERIFICATION. Its pretty clear that this probably was a situation where the update ticked properly EVERY box but the cold boot restart. Cause its 99% of the time the most easily explained answer with testing is the real issue.
If you don’t have a way to switch immediately to manual record keeping for at least a temporary span, then these things will shut down your business.
Any retail store can keep written records of transactions and process them later like we used to do 20-30 years ago-even with credit cards or checks if you take down the info properly and have some level of trust with that customer.
If you don’t have a way to switch immediately to manual record keeping for at least a temporary span, then these things will shut down your business.
Any retail store can keep written records and process them later like we used to do 20-30 years ago
I also worked ems in a rural area and legally it’s fine without digital records as long as you do have a way to digitize them eventually
I think this is going to push for a lights out out of band management for every single laptop / desktop the same we do for servers. We can get into any server via iLO / iDrac / BMC / etc to fix this. But every desktop and laptop, we have to go hands on for.
I built a custom setup for our org that you just boot to the boot media put it in safe mode then run a bat with local admin once you get into windows. it still took over a week to really clean up with a solid 72 hours of pain trying to touch all of our 100k+ devices.
well *hopefully* this means that airports finally update their infrastructure. all of aviation is incredibly behind the times but especially so our computer infrastructure, its ridiculous
edit: yeah that's the thing. processes and procedures often get written in blood, and I suspect this will be one such case. there are many like this, 14CFR is one of them that is chcok full of them, and they will continue to happen as long as technology improves
I think how it's gonna play out is along the lines of "once a certain amount of people depend on you you need to start announcing your rollouts a certain amount of time in advance"
Certainly from a liability standpoint MS gets to say "well we told them we were rolling this out 2 weeks ago or whatever, they should have checked"
May I ask for the source on where Microsoft pushed the second update it?
i read somewhere that crowdstrike has some in's with regulatory bodies so companies just use it by default to comply because otherwise they'd have to jump through hoops, that means people 'forcing' them to do this made the scale of this vulnerability much more of an issue and should also be held liable in some way
Has there been a follow up on this
cloudstrike is crazy because i was stranded in Connecticut while i live in IL while my mom was slowly having a stroke (we didn't know until we got home 2 days later)
Our checker machines died off, only could do cash. People turned into almost panic mode like "omg i need to get my cash out of the bank!" it was going to spiral out of control on retail, but ..i can see this having long lasting effects. Been a while since ive seen the public panic.
why does an MRI machine even need to be connected to the internet and receive updates
What they haven't discussed and is going to be equally relevant is that because this is/was a global issue, not a national one, SCOTUS is ultimately just one Court. There are lawsuits being filed in the UK, in France, in Germany, in Australia, Japan, China, South Africa, Argentina - they're not all going to be resolved by one court, and SCOTUS' ruling does not set a precedent outside of the US (though I imagine the verdicts of some of the courts that resolve this soonest will be used as part of the argument of whichever party the ruling favours).
The investigation will take months.
The legal fallout? Years. Closer to decades.
It seems this has largely blown over and isn’t being talked about any more. Were any decisions made about blame/liability?
When was this recorded? I just went back and reread CrowdStrike's post mortem and they don't mention anything about a windows configuration changing, just a regex issue.
This was recorded pretty soon after it happened so the information was not fully out yet, at this point we know the fault was fully on crowdstrike.
This is why i hate windows forcing updates on systems. You cant test stability on an isolated box before the entire office gets an automatic update pushed on every box
In the enterprise space, you can set when updates happen. So you can roll out an update, or test updates in a VM.
The issue is that cloud strike have the ability to ignore that for some reason. Something about that particular update was considered critical update.
So even if your IT team was doing their due diligence, windows and cloud strike fucked it up, cause as you said, they forced it to happen.
It's fucking stupid that all users are essentially beta testers.
ever heard of wsus?!?!
@@FFXfever Everyone is a Beta Tester, spot on!
Immutable Linux is perfect for enterprise stability. It's basically impossible to break, and if you do manage to break it, simply revert to the previous system image.
@@FFXfever Even in an enterprise space you still cannot prevent windows from performing updates tagged "critical", unless you configure the OS to use a custom windows update server.
so comfy to choose a company that does all these things for you, automatic so you don't have to wait too long and reduce your exposure to potential attacks.
Just from a diagnostic perspective, if the only place where the update Microsoft released caused issues on machines that were also running Crowdstrike software, I would have thought that who is responsible for this problem is pretty clear, logically speaking.
It's like a "not all cars are Toyotas, but all Toyotas are cars" kind of deal. I guess though it's more about what protocols there are for dealing with how well different pieces of software play with each other, and the responsibility for that is a bit unclear.
To somewhat counter what Thor said, this was preventable for the companies affected. At least for the larger ones. If you have critical infrastructure and you need to update the software you test that in a dev environment first. As Thor said, the software providers won't test every possible combination and you, as an end user, don't know if your particular concoction of software will cause a compatibility issue with a new update. So you use a separate test machine, load the update, and monitor the stability. If it's all good THEN you push the update to all your machines.
Like how is Crowdstrike of Microsoft gonna know that Burger King's order queuing software crashes with the new update? They won't, and if Burger King just blindly pushes that update to all their machines no one will be getting any Whoppers. In that example it would be up to a Burger King IT employee, likely at the HQ, to first load the new update onto a test machine and see if it all works well together before releasing the update to all their machines.
Clearly a lot, and I mean a LOT, of companies weren't doing this.
It's probably going to be shared liability between CrowdStrike and Microsoft, but it's what percentage each gets that is up in the air.
Oh... I guarantee you, most of the liability is going to be thrown at Microsoft at first. And it has nothing to do with their real liability or how much evidence there is of fault... it's that Crowdstrike has a net worth of $65 billion and Microsoft has a net worth of over $3 TRILLION... people are going to sue the company that's likely to throw them the most money to go away.
How is it Microsoft’s fault?
I don do that crystal math, but jus 100% both of em xD
Can Microsoft argue that they were forced to create a unsafe environment by the EU, do you mandated they allow third party companies to have colonel level access for the purposes of any malware and the like, Microsoft had a plan in place to allow third party antivirus to just hook into the same kernel level driver they use for Windows defender but that was rule monopolistic by the EU
@@goomyman23 How is it not Microsoft's fault? Is a better question.
I had the assumption that releasing an update globally at the same time wasn't a good practice.
I work in a large company and our application updates are always gradually rolled out to users.
There is a cost and a tech challenge because several versions need to coexist and keep working.
But crowd strike is a security company with major clients and ultimately lives at stakes. And preventing this kind of damage (from malicious intent) is literally their business.
I would have loved to catch this live. I watch a number of lawyers on youtube and from what I've learned, If one side is found say 25% responsible then they can also be held 25% liable.
Can someone point to me to something that explain what update Microsoft pushed? Reading anywhere I always read about crowd strike pushed an update of their data and nothing else..
Microsoft didn't release any update related to this. The actual episode is over a month old, and I suspect Thor was speaking to what he thought was reliable information at the time. The actual issue was entirely on CrowdStrike.
That's what I thought.. but afaik it was almost immediately clear that windows had nothing I do..when was recorded this video? I just read again official crowdstrike report and there is no reference on Ms updates:
www.crowdstrike.com/wp-content/uploads/2024/08/Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf
It hit us hard. With Encrypted Devices I got into a pretty good rythm. But even with the tool I made it still took about 1.5 minutes per machine. I had thought of using something a bit better, but didn't have the time to make it. With so many people down I just didn't have the time to craft it. Still though regardless touching as many machines as I did it was aweful. I setup an triage center and just have people bring their machines to me. It was the fastest way to recover things rather than go to place to place.
There's a funny side effect to poverty. The provincial airport I had to fly from had no outage because they could not afford CrowdStrike. Malwarebyte and machines that can be swapped out in 15 minutes. It's crazy how this is the "defense" against that monopoly. You need to break up CrowdStrike and limit how much critical infrastructure gets "protected" by one company.
When was this? I’ve never heard of this
The real issue is that vital machines and critical infrastructure are updated without testing the updates on an offline machine/test bench before updating the whole system, to ensure everthing still works after the update. If everyone did this the problem would never have happened because no one would roll out that windows update to their systems.
I work for a medical device company and while none of our devices were directly impacted by this issue (none of them are Windows machines themselves or require a Windows machine for them to be used), the work computers utilized by the vast majority of the company WERE impacted (very few Mac/Linux users because you need to receive special permission for IT to issue one).
Among my team of software developers only about 10% of them had functional computers the morning it all went down. I was "lucky" enough to be one of them and all the others in a similar position had one thing in common - the night before when we left the office all of our computers were disconnected from the company network and instead connected on our own small personal networks without internet access used to separate off prototype devices.
Everyone with their computers connected to the main company network had the updates pushed to their machines automatically that night. I don't fault the IT department for that at all because honestly some users' computers would never get updated otherwise, but it was the one instance where a near-universally beneficial IT policy happened to have an unexpected outcome that turned it into a liability instead.
I'm sure someone's already said it USB rubber ducky. You've still got to go to every device but if you've got 20 of them you can do 20 devices at the same time.
Can someone please, for the love of god, help me figure out what the drawing software that Thor uses is? People say it's MS paint, but I have no idea how to get that configuration on it.
Bad idea to push this dedicated clip well after Crowdstrike released their RCA confirming they're at fault.
Really should take this down or preface the clip with something indicating this was filmed within days of the outage occurring.
Here in egypt nothing shutdown because we use paper, now I appreciate the dictator XD
so a process-fix to avoid such issues would be gradual rollouts of updates, but that in itself will make it harder to make sense of what's going on, and is a potential extra point of failure