This one got public. Ours wasn’t. CrowdStrike had a Linux kernel memory leak that killed applications that had heavy network usage, like MPI apps. Didn’t leak enough for light use hosts, so it looked like our software had suddenly developed a hidden bug.
The recent Blue Screen of Death incident triggered by a CrowdStrike update, along with the scam involving a fake fix file, highlights the critical need for strong cybersecurity measures. Healthcare facilities encountered significant challenges as well. Hospitals depend on electronic health records and other IT systems for patient care, and the BSOD issue disrupted these systems. This disruption had the potential to jeopardize patient safety and treatment by delaying access to essential information.
The side effect was that many companies told workers to log off and finish early on the first day, so engineers could deal with the problem. It may have been just under 9 million machines that were directly affected. You can treble that with the amount of people who could not work the first few days. At restaurant on Sunday everyone had to pay in cash, because of the spread panic.
Outsourcing QA is such typical move from companies that don't understand QA. The idea in outsourcing is that you don't need to pay QA during most of development, just get people to test it according to spec once most of it is done. The problem is, if you hire competent testers, they often find problems with the spec, or at least the scope of the spec. And if you bring them in late, they start raising big flags about problems, or they have absolutely no idea what is actually needed and basically work blind. They most useful thing QA does during development is that they ask a lot of questions.
"you don't need to pay QA during most of development", these definition 'updates' are a constant process, so there's no "most of development", if they are done with one, the next is already ready to test. Thus also the question if this can be done by a human QA team. Also, cost isn't always the only reason, it can also be an issue of competency. In the last 30 years there have been times where you couldn't spit and avoid hitting a jobless IT person or times where you couldn't get an IT person unless you were fishing with literal gold (way over the top salaries). So if you have company X that does service Y on the other side of the planet that has the skills to do Y, why not outsource it there? If you can't do it yourself or you can only get those that no one else wanted (for often very good reasons) to do the job. I've worked for companies where outsourcing was fantastic, horrible or just normal. Sometimes even in the same company. For example 10+ years ago we outsourced our Cisco networking to another company. A fantastic group of people, that delivered absolute quality! On the other hand, we've outsourced simple IT tasks (that could be done outside of office hours) that has done so horribly wrong and incompetently that it took more time to fix everything the next day then if I did everything myself from the get go... Outsourcing can be great, but is often horrible when done for the wrong reasons.
In most corporate offices, the $10 ubereats gift card would get you the sum total of one (1) cup of coffee. The delivery fee (assuming no tip!) itself would cost more than the coffee. You might also get a bottle of water or a drink like lemonade or soda (say, Coca Cola). You could not buy any food at all with that meager amount. And regarding all of those drink options...the vast majority of corporate offices in the USA already have their own coffee stations and bottled water / soda vending machines. The ubereats gift card is a stupid decision, for an insulting amount, and utterly transparent as the cheap ploy that it is. May the lawsuits fly fast, ferocious, and furious against CS. And its toxic CEO.
Allowing automatic updates to your corporate systems verges on criminal negligence. As a system maintainer, apply updates to a test system and test, before pushing updates to production systems. Regardless of CloudStrike's negligence, the companies who suffered this failure share responsibility for lack of due diligence.
As a former network technician/software developer, I couldn’t agree more with your astute observation. CS have failed the ultimate security test, and one in which they had a direct hand in. Businesses fully indoctrinated into the digital fantasy will never feel 100% secure after this. Hold and use cash as often as you can
As a maintainer, how quickly could you protect from a zero-day in SMB (for example), and find an alternative means of file sharing? Would it be more or less effective than automatically updated security product?
How exactly are you managing the definition updates for your AV/EDR solution? And which solution is that exactly? Are you doing a full test suite for each definition update? For each specific piece of hardware/software combination? If so, who is the CEO that's paying for all that, maybe they're interested in a piece of moon real estate? ;) And who gets lined up for execution then when you're a couple of days late on a definition update rollout and you're whole infra gets ransomwared while that definition update would have protected you from it. MS Defender does 50-60 definition updates per week for example... Do you test them all every week before rolling them out?
@@vasiliigulevich9202even if you ran basic tests that took an hour (eg does a test system still run after this update) before shipping these updates to all production systems it would be better than blindly rolling out new software where a failure is long term inevitable compared to the small chances of you being an hour behind in updates compared to what you could have been and some bad actor getting in in that hour.
As an one of the last manual testers in my company I agree with you 100%. For some time now I can see that all companies (including the one I'm working for) are laying off manual testers and are looking for automation testers. The worst part is that many of companies doesn't even try to hire junior QAs and train them so it seems in a long term we will have only senior QAs and not having enough junior or mid ones to fill the gaps.
After a lifetime working in the UK IT industry, US companies make the worst serious software "mistakes". Why? It all comes down to profit at all costs. Proper testing costs money, therfore, lets skimp on it.
Even a release on Friday can be necessary, because you do not want critical systems be vulnerable for 3 days until your workers finish weekend. Some parts in the world have other working days and hackers don't care.
This was so bad. Good that someone wants to sort it out because there were so many failures. I hope they get to pay big, because otherwise they will do it again.
I'm in no way trying to defend the Crowdstrike clowns, but it would be interesting to know just how many Rapid Updates they have pushed that did not result in catastrophe. By responding to risks very quickly, they inevitably fall into the "low frequency/high severity" scope of hazard management, as would any player in this market. I can just imagine a swarm of middle-management types quickly throwing Power BI pie charts together showing that "Actually... we've done really well, because we've only had 1 multi-catastrophic event for every XXXX Rapid Updates, so, please can I have my bonus now...?"
They did temporarily improve security by denying the vast majority of users access to the computers... and anyone who has worked on an IT Helpdesk knows just how damaging the average user can be!
Yeah, that might be the bigger issue. But on the other hand, it's an American company and admitting fault is a very bad idea in America especially when the first thought of everyone was "Can we sue!?!?". What exactly happened isn't to be found in the analysis of RUclipsrs, bloggers or the company statement. Only an independent audit by experts would give a clear picture, and even then you have to ask yourself how much the company/employees were able to hide to make themselves look better. And often with those independent audits, they aren't shared publicly.
It is a huge failure from all the clients that they use a security program that has this big problems in their release process. It is understandable for small companies, but if your IT department is dozen of persons, they should know better the software they are using
The video repeats some unprofessional opinions from media: - 6:12 The file was never a problem or was claimed to be. The problem was null-dereference when parsing a valid file. - 8:13 Phased rollout and deferred deployment allows malicious parties a time window to circumvent the security product Other information seems to be correct.
I think you're missing how things work. A company would never install it's software on it's own computers before pushing to customers precisely because if it goes wrong/doesn't work they won't be able to work themselves and that would be bad. Always better to use someone elses s/w. :)
It should be noted that Crowdstrike had a similar event happen earlier *this year,* although that was a bug which disrupted systems running certain versions of Linux. Apparently they reacted to that by shrugging, instead of by re-examining their QA procedures. The video from Dave Plummer which is linked to in the info of this video mentions that incident and Dave is very informative. I'd also note that this was released very early on a Friday morning (sometime around 2am ET). I work in a computer center which is available 24x7, and have worked in that kind of environment for over 30 years. We do have a rule about not releasing changes on a Friday, but what we really mean by that is "No major changes after 11am". It's fine to make changes which will be finished before 8am _(as long as other IT staff are aware of the change before it happens),_ because the company should have plenty of staff around for at least the early part of Friday. The fallout from *this* specific change would not have been all that much better if they had done it at 2am ET on a Tuesday. I'd also mention that I'm fine with adding in automated QA procedures. It isn't the automated QA *per se* which is bad, it's when you then jettison your experienced QA employees because you expect the mindless automated QA will catch any and all problems. And while we might laugh and point fingers at the management of a *security* company for doing this, I'll note that right now we're seeing many organizations racing to do the same damn thing while salivating over the profit potential of using AI for more of the organizations key operations. This may have been the biggest IT failure in history, but I expect we'll see even bigger failures in the next few years.
FWIW, for us all important servers were fixed and back to normal operation by maybe 4pm that Friday. There were a number of desktop machines which were still broken, but most of those were because the person who worked at that desktop was on vacation so they weren't using their computer anyway. I know of other organizations which weren't back to 100% normal operation until Wednesday of the following week.
From a different perspective, you are undoubtedly correct but very instructive. This is precisely what I'll point out in the agile methodology implemented today: no human test for supposedly a gain of time and money. Many thanks!
I don’t care what the ‘official testing process’ is if you write a piece of software or a new input file for an existing piece of software (e.g. Falcon Sensor) *you run it on your machine or a test machine at your desk first* if it crashes your test machine you *fix it before* you release it to your ‘official testing process’ let alone releasing it into the wild…. All that said I completely agree with your ‘fish rots from the head’ comments. I spent 30 years in the ITC industry and I’ve spent much of it cleaning up messes caused by people/companies who made little mistakes that caused major disasters because they were trying to ‘save money’ or ‘optimise operational efficiency’ or some other corporate BS phrase the equates to mindless penny pinching.
This would be a Microsoft level of effort - remember - to test this code in an environment close to production you have to reboot test host multiple times.
what you describe is Testing 101. This appears to be a dying art/skillset. As for the rollout i can' understand the mindset that thinks it's best practice to implement everywhere in a big bang approach. Even if testing was clean you need to plan for the worst cast scenario. Also as a customer why let it become Live without checking that it does work.
The CrowdStrike business model and primary differentiator from all of its competition is their automated testing. This is why we adopted them as our corporate standard along with our primary vendor Rockwell Automation. Other companies like SYMANTEC impose a 24 hour vetting process if not more, which leads to zero day attack vulnerabilities. CrowdStrike was not cutting corners by using automated testing! We do not want them to start using people for the QA process. We just want CrowdStrike QA people to review their automatic processes.
Oh come on now. How much human time would it take for a human to try the update before it is sent out and installed on millions of customer machines? Nothing having to do with computer security needs to be done so fast that a 30 minute delay would make any difference.
CrowdStrike may be relatively large in terms of market cap and market share, but their yearly revenue is less than $3.5 billion. Compare that to a company like Microsoft with ~$245 billion in revenue. Anyway, any ONE company should not have this large of an impact on critical businesses. Thanks for the video!
The biggest issue to me is that the client software tried to use a file filled with zeros! It is common in windows for the first four bytes to uniquely identify a file type, and they didn't check this! It also shows that there is NO checksum or validation checks.
In my opinion they didn't need an actual human tester or QA team for this update. What they needed was an actual test to see if the update worked. If not (BSOD or any other error) this would have been caught even with an automated test environment. They obviously didn't test it.
manual testing and quality assurance is fundamentally incompatible with releasing at speed, which is required in this market. this is what ci / cd is for. the problem here is that their software is split into 3 parts, and only the core kernel module gets proper testing. their rapid updates deliberately skip being run with the core module, and the template file is not tested either, only the signature file is checked by the validator, then the template and signature file are bundled together and deliberately shipped to everyone all at once with no further testing. the actual error was that the template file had a bug where the last parameter did not get collected, so when the correct signature file regex passed the test, this was not found as the template was never checked for this. as it also skipped the integration tests with the core kernel module, they just guessed that it would recover when something went wrong, but again it was never tested for this. agravating this was their communications to customers. their software offered an option to use older versions especially n-1 and n-2, but they failed to inform the customers that this only applied to the core kernel module. they also told customers that the kernel module was designed to auto recover from a bad update. this lead the tech teams at customers to believe that the live updates were exhaustively tested, and if anything went wrong the machines set to use older versions would not be caught in it, and could auto recover. much of this turned out to be outright lies based upon wishful thinking at crowdstrike, without the testing needed to back it up. this is basically miss selling, and negligence, which is why the lawsuits are coming, and why they will probably lose. even worse, they made statements to shareholders bragging about how good the testing was, thereby inflating the confidence, and causing them to underestimate the risk. there is a good chance this might breach some of their legal requirements as well. especially if any board members sold stock between talking to shareholders and the outage.
It looks like CEOs of different companies live is some sort of MBA bubble and don't learn from each other's mistakes. I've seen multiple software projects go down because of un-needed cost cutting to save pennies.
They do learn. But what they learned from for example Boeing is that cutting on quality control give in the short run increased profit and massively increased CEO bonuses. When it comes back and in the long-run costs the company magnitudes more than what they saved the CEO hopes to be at some other company with their ill-gotten gains leaving the mess for other to clean up, or getting a fat golden parachute for leaving to let others do so. You assumption that CEO would care for the company, shareholders or employees is in many cases unfounded.
As a software tester in a prior life, the thing that immediately stood out to me was that they CLEARLY were not doing any "golden image" testing - one test install of what they intended to publish would have shown it was bad immediately. That would have been possible even if they automated that stage. The whole song and dance over how this was all the fault of an error in an automated verification tool tells me that such tool had no real oversight. They just trusted it but didn't follow up with a "verify". This is really concerning in any of these low-level tools that run with high privileges.
yeah i am not putting my client website out of test mode for payment until i make them test every subset of features and approve. i feel like it is how you do small project defining your bigger ones
I can't believe CrowdStrike doesn't have a division that does live testing before they send out software. Of course a lot of software developers put all testing off until after they finished writing the software which makes it very hard to find where the errors are occurring and to fix them.
'ummm ackchually we value sending our customers fast codes instead of reliable tested good code. i mean it worked until now, you were not crying while it worked, yeah. We are good'
100% failure rate. One test would have caught it. Also chef taste own cooking before serving to customers. Repeat two more times, permanent black list.
Most corporate windows desktops are locked down even for developers, so they too would have been sitting around waiting for infra to fix their machines.
The problem isn't that the test was automated, it was that there was no smoke test - a test to sanity-check the other deployment tests. The smoke test can be totally automatic.
Testing is a good thing but Microsoft and Crowdstrike cobbled together a design requiring kernel-level access because Microsoft floundered in creating a safe tested API. Creating a backdoor to the kernel was the root cause of this failure. Microsoft should be subject to whatever damages Crowdstrike must pay because Microsoft gave this bad design their blessing.
Crowdstrike gambled with their customers' computers and businesses and lost. Too bad the customers were put out of business for many hours or days as the result. Since the faulty update crashed 100% of machines; it is quite obvious that it was never tested manually or automatically prior to being deployed. Crowdstrike may continue as a business but their reputation is damaged forever.
How is it that no one is talking about disaster recovery and deployment management practices. Really bad that no one is talking about SLA practices of normal companies as "works in my PC" is a thing.
Crowdstrike cut corners, and so did all the companies that relied on them to reduce IT-related expenses. These companies could instead invest in well-staffed IT departments utilizing less intrusive security solutions, like Norton AV, which would allow them to test updates in a sandbox before deploying them company-wide.
no they could not. norton is not even in the same market as crowdstrike, who do end point response systems, which includes antivirus and malware protection, but lots more as well. there are alternatives, but not norton.
as to sandboxing, a lot of companies believed the hype from crowdstrike about the n-1 deployments, which did not apply to the channel file updates, so they had their systems go down, fell back to the previously working version, and had the update take that down as well due to live updating of the channel files. a lot of these companies have installing live updating software being a sacking offense, so it is no surprise that rowdstrike are now having to apply their n-1 settings to these updates like their customers were lead to believe was already happening.
I agree that they pushed untested code out the door. I don’t agree with your focus on the necessity of a dedicated QA team. They need a real investment in QA, but that could be in staging environments and automated tests and the infrastructure for them. Amazon doesn’t have QA engineers. They have enormous infrastructure for staging and automated testing as part of their CI/CD.
The model used to be that developers and QA were separate departments in order for QA to be unbiased and independent. To save money QA human roles have been eliminated and replaced by DevOps writing automated regression suites. Hard to believe the corrupt definition file made it past the content validator. It depends on its implementation. They will probably add a regression test for the crash scenario.
Please consider doing another video on this subject starting with the premise that the Crowdstrike fiasco was done DELIBERATELY ... sabotage made to look like incompetence is a very believable scenario nowadays
Great video as always! A question that we may never get the answer to is if the update was supposed to counteract some immediate threat that might not had been disclosed to the public? Not that likely but also not totally impossible. If this was the case then we would expect the lawsuits and whatnot to happen anyway to make it seem like this wasn't the case. (Btw I'm a conspiracy enthusiast. By that I mean that I'm open to conspiracies existing even though I'm never convinced of any of them being true unless there are actual proof. I.E. the scientific sound counterpart to a conspiracy theorist :D )
You keep saying "updates" , it was not an update released on Friday. It was a content file not config file. Content file never caused any failure (as per them) and they did automation testing and laid off manual testers. The world should realise the importance of manual testing and signoff and I agree, there should have been release notes.
I suspect several members of the development team tried to shout at management. How many times have I seen this in my career? I'm looking at you Clients 1, 2, 4, 5, 6 Number 3 was good.
It's funny to me that shareholders can sue the company considering that ultimately they are the company owners and if the people they select to run it do a bad job the financial loss is on them. But then, a lot of things related to stocks and the related economics don't make sense to me.
the reason they can sue is that based on company statements, the value was over estimated, and the risks under estimated. when the outage happened it became clear that the statements about risk mitigation were near criminally misleading, causing their share price to lose 30%+ of its value as the correct information became known. this is basically how you would run a pump and dump scheme to maximise the value of shares the board were intending to sell. if any board member did sell significant amounts of shares between the statements to shareholders and the outages, then they could potentially be in trouble for insider trading, and for fraud in their statements to shareholders.
Even years ago, patching a punctured car tyre was something of an "experiment", especially if the repair was to an existing repair (maybe even to a form of itself as an existing repair! And so on, ad infinitum). However, certain "tyre engineers" developed a formal process for determining if, and only IFF (sic), a punctured tyre could be repaired and safely returned to everyday service for its normal lifetime. The process became known as "common sense", and is also known as such in much of the rest of the world.... Simples... (A.K.A. Not rokit syence) ):-)
The explanation of the outage. Good point not to rely on Artificial and Automated Intelligence, completely and have an actual human coder do the testing for any updates. Their insurance 😮might cover them, good example why not to rush things and instead take your time and do some more testing until the testing is passed! 😺👌🎥👩💻🥰⭐️💃
This is a Microsoft failure. Why are they allowing untested updates to code that runs in the ring 0, the kernel. All updates that have this level of privilege should and must go through Microsoft before updates being sent out. This is why Microsoft has such a crappy reputation.
I read somewhere they crowdstrike did some sort of loophole to skip microsoft checking because it takes too long to check and verify updates that need to be sent out immediately
@@vasiliigulevich9202 the point is microsoft didnt test to failure. any good programmer will test all possible cases, including edge cases. Such as driver failure regardless of the reason it failed. It is a failure in their testing and a failure in their recovery code. If you invite a third party into your code base you are asking for trouble. The mantra is don't trust user data! In this case the user is crowdstrike.
@@andrewortiz8044 they updated a text file which the driver reads in. That doesn't excuse Microsoft from letting a third party driver cause such a serious failure. Testing should cover driver failure.
@@DeveloperChris I've worked on product with various degree of test coverage. The reliability you are talking about is NASA level. That's not achievable for a consumer product. And even NASA fails sometimes.
This one got public. Ours wasn’t. CrowdStrike had a Linux kernel memory leak that killed applications that had heavy network usage, like MPI apps. Didn’t leak enough for light use hosts, so it looked like our software had suddenly developed a hidden bug.
The recent Blue Screen of Death incident triggered by a CrowdStrike update, along with the scam involving a fake fix file, highlights the critical need for strong cybersecurity measures. Healthcare facilities encountered significant challenges as well. Hospitals depend on electronic health records and other IT systems for patient care, and the BSOD issue disrupted these systems. This disruption had the potential to jeopardize patient safety and treatment by delaying access to essential information.
The side effect was that many companies told workers to log off and finish early on the first day, so engineers could deal with the problem. It may have been just under 9 million machines that were directly affected. You can treble that with the amount of people who could not work the first few days. At restaurant on Sunday everyone had to pay in cash, because of the spread panic.
Outsourcing QA is such typical move from companies that don't understand QA. The idea in outsourcing is that you don't need to pay QA during most of development, just get people to test it according to spec once most of it is done. The problem is, if you hire competent testers, they often find problems with the spec, or at least the scope of the spec. And if you bring them in late, they start raising big flags about problems, or they have absolutely no idea what is actually needed and basically work blind. They most useful thing QA does during development is that they ask a lot of questions.
"you don't need to pay QA during most of development", these definition 'updates' are a constant process, so there's no "most of development", if they are done with one, the next is already ready to test. Thus also the question if this can be done by a human QA team.
Also, cost isn't always the only reason, it can also be an issue of competency. In the last 30 years there have been times where you couldn't spit and avoid hitting a jobless IT person or times where you couldn't get an IT person unless you were fishing with literal gold (way over the top salaries). So if you have company X that does service Y on the other side of the planet that has the skills to do Y, why not outsource it there? If you can't do it yourself or you can only get those that no one else wanted (for often very good reasons) to do the job.
I've worked for companies where outsourcing was fantastic, horrible or just normal. Sometimes even in the same company. For example 10+ years ago we outsourced our Cisco networking to another company. A fantastic group of people, that delivered absolute quality! On the other hand, we've outsourced simple IT tasks (that could be done outside of office hours) that has done so horribly wrong and incompetently that it took more time to fix everything the next day then if I did everything myself from the get go... Outsourcing can be great, but is often horrible when done for the wrong reasons.
In most corporate offices, the $10 ubereats gift card would get you the sum total of one (1) cup of coffee. The delivery fee (assuming no tip!) itself would cost more than the coffee. You might also get a bottle of water or a drink like lemonade or soda (say, Coca Cola). You could not buy any food at all with that meager amount. And regarding all of those drink options...the vast majority of corporate offices in the USA already have their own coffee stations and bottled water / soda vending machines. The ubereats gift card is a stupid decision, for an insulting amount, and utterly transparent as the cheap ploy that it is.
May the lawsuits fly fast, ferocious, and furious against CS. And its toxic CEO.
Allowing automatic updates to your corporate systems verges on criminal negligence. As a system maintainer, apply updates to a test system and test, before pushing updates to production systems.
Regardless of CloudStrike's negligence, the companies who suffered this failure share responsibility for lack of due diligence.
As a former network technician/software developer, I couldn’t agree more with your astute observation. CS have failed the ultimate security test, and one in which they had a direct hand in. Businesses fully indoctrinated into the digital fantasy will never feel 100% secure after this. Hold and use cash as often as you can
As a maintainer, how quickly could you protect from a zero-day in SMB (for example), and find an alternative means of file sharing? Would it be more or less effective than automatically updated security product?
How exactly are you managing the definition updates for your AV/EDR solution? And which solution is that exactly? Are you doing a full test suite for each definition update? For each specific piece of hardware/software combination? If so, who is the CEO that's paying for all that, maybe they're interested in a piece of moon real estate? ;)
And who gets lined up for execution then when you're a couple of days late on a definition update rollout and you're whole infra gets ransomwared while that definition update would have protected you from it. MS Defender does 50-60 definition updates per week for example... Do you test them all every week before rolling them out?
Couldn't agree more.
@@vasiliigulevich9202even if you ran basic tests that took an hour (eg does a test system still run after this update) before shipping these updates to all production systems it would be better than blindly rolling out new software where a failure is long term inevitable compared to the small chances of you being an hour behind in updates compared to what you could have been and some bad actor getting in in that hour.
As an one of the last manual testers in my company I agree with you 100%. For some time now I can see that all companies (including the one I'm working for) are laying off manual testers and are looking for automation testers. The worst part is that many of companies doesn't even try to hire junior QAs and train them so it seems in a long term we will have only senior QAs and not having enough junior or mid ones to fill the gaps.
This is the issue with the monopolization of sectors. Crowdstrike should not be so powerful that it can cause that much damage 🤦🏽♀
After a lifetime working in the UK IT industry, US companies make the worst serious software "mistakes". Why? It all comes down to profit at all costs. Proper testing costs money, therfore, lets skimp on it.
Even a release on Friday can be necessary, because you do not want critical systems be vulnerable for 3 days until your workers finish weekend. Some parts in the world have other working days and hackers don't care.
This was so bad. Good that someone wants to sort it out because there were so many failures. I hope they get to pay big, because otherwise they will do it again.
A lot of companies are learning the hard way what the people they got rid of actually did. Short term “line go up” failed thinking
I'm in no way trying to defend the Crowdstrike clowns, but it would be interesting to know just how many Rapid Updates they have pushed that did not result in catastrophe. By responding to risks very quickly, they inevitably fall into the "low frequency/high severity" scope of hazard management, as would any player in this market.
I can just imagine a swarm of middle-management types quickly throwing Power BI pie charts together showing that "Actually... we've done really well, because we've only had 1 multi-catastrophic event for every XXXX Rapid Updates, so, please can I have my bonus now...?"
Clownstrike! A suitable name.
They did temporarily improve security by denying the vast majority of users access to the computers... and anyone who has worked on an IT Helpdesk knows just how damaging the average user can be!
That whole postmortem reads like an attempt at BS-ing themselves out of an apology and of admitting how much they screwed up.
Yeah, that might be the bigger issue. But on the other hand, it's an American company and admitting fault is a very bad idea in America especially when the first thought of everyone was "Can we sue!?!?". What exactly happened isn't to be found in the analysis of RUclipsrs, bloggers or the company statement. Only an independent audit by experts would give a clear picture, and even then you have to ask yourself how much the company/employees were able to hide to make themselves look better. And often with those independent audits, they aren't shared publicly.
It is a huge failure from all the clients that they use a security program that has this big problems in their release process. It is understandable for small companies, but if your IT department is dozen of persons, they should know better the software they are using
The video repeats some unprofessional opinions from media:
- 6:12 The file was never a problem or was claimed to be. The problem was null-dereference when parsing a valid file.
- 8:13 Phased rollout and deferred deployment allows malicious parties a time window to circumvent the security product
Other information seems to be correct.
No one at CrowdStrike installed this on ONE of their machines? Maybe they don't use their own software.
I think you're missing how things work. A company would never install it's software on it's own computers before pushing to customers precisely because if it goes wrong/doesn't work they won't be able to work themselves and that would be bad. Always better to use someone elses s/w. :)
@@beentheredonethatunfortunately you would ALWAYS install it on at least ONE of your machines to check that it works
It should be noted that Crowdstrike had a similar event happen earlier *this year,* although that was a bug which disrupted systems running certain versions of Linux. Apparently they reacted to that by shrugging, instead of by re-examining their QA procedures. The video from Dave Plummer which is linked to in the info of this video mentions that incident and Dave is very informative.
I'd also note that this was released very early on a Friday morning (sometime around 2am ET). I work in a computer center which is available 24x7, and have worked in that kind of environment for over 30 years. We do have a rule about not releasing changes on a Friday, but what we really mean by that is "No major changes after 11am". It's fine to make changes which will be finished before 8am _(as long as other IT staff are aware of the change before it happens),_ because the company should have plenty of staff around for at least the early part of Friday. The fallout from *this* specific change would not have been all that much better if they had done it at 2am ET on a Tuesday.
I'd also mention that I'm fine with adding in automated QA procedures. It isn't the automated QA *per se* which is bad, it's when you then jettison your experienced QA employees because you expect the mindless automated QA will catch any and all problems. And while we might laugh and point fingers at the management of a *security* company for doing this, I'll note that right now we're seeing many organizations racing to do the same damn thing while salivating over the profit potential of using AI for more of the organizations key operations. This may have been the biggest IT failure in history, but I expect we'll see even bigger failures in the next few years.
FWIW, for us all important servers were fixed and back to normal operation by maybe 4pm that Friday. There were a number of desktop machines which were still broken, but most of those were because the person who worked at that desktop was on vacation so they weren't using their computer anyway. I know of other organizations which weren't back to 100% normal operation until Wednesday of the following week.
From a different perspective, you are undoubtedly correct but very instructive. This is precisely what I'll point out in the agile methodology implemented today: no human test for supposedly a gain of time and money. Many thanks!
The post-mortem(s) must have been exciting.
C-suite 👉 PO 👉 👈 Lead 👉 👈 Dev 👈💻
HR be getting tons of QA resumes.
That "windows kernal" around 2:30 made me smile. Nice sideswipe.
I don’t care what the ‘official testing process’ is if you write a piece of software or a new input file for an existing piece of software (e.g. Falcon Sensor) *you run it on your machine or a test machine at your desk first* if it crashes your test machine you *fix it before* you release it to your ‘official testing process’ let alone releasing it into the wild…. All that said I completely agree with your ‘fish rots from the head’ comments. I spent 30 years in the ITC industry and I’ve spent much of it cleaning up messes caused by people/companies who made little mistakes that caused major disasters because they were trying to ‘save money’ or ‘optimise operational efficiency’ or some other corporate BS phrase the equates to mindless penny pinching.
This would be a Microsoft level of effort - remember - to test this code in an environment close to production you have to reboot test host multiple times.
what you describe is Testing 101. This appears to be a dying art/skillset. As for the rollout i can' understand the mindset that thinks it's best practice to implement everywhere in a big bang approach. Even if testing was clean you need to plan for the worst cast scenario. Also as a customer why let it become Live without checking that it does work.
The CrowdStrike business model and primary differentiator from all of its competition is their automated testing. This is why we adopted them as our corporate standard along with our primary vendor Rockwell Automation. Other companies like SYMANTEC impose a 24 hour vetting process if not more, which leads to zero day attack vulnerabilities. CrowdStrike was not cutting corners by using automated testing! We do not want them to start using people for the QA process. We just want CrowdStrike QA people to review their automatic processes.
Oh come on now. How much human time would it take for a human to try the update before it is sent out and installed on millions of customer machines? Nothing having to do with computer security needs to be done so fast that a 30 minute delay would make any difference.
CrowdStrike may be relatively large in terms of market cap and market share, but their yearly revenue is less than $3.5 billion. Compare that to a company like Microsoft with ~$245 billion in revenue.
Anyway, any ONE company should not have this large of an impact on critical businesses.
Thanks for the video!
Crowdstrike's CEO should face a congressional hearing like that of Boeing's CEO
The biggest issue to me is that the client software tried to use a file filled with zeros! It is common in windows for the first four bytes to uniquely identify a file type, and they didn't check this! It also shows that there is NO checksum or validation checks.
In my opinion they didn't need an actual human tester or QA team for this update. What they needed was an actual test to see if the update worked. If not (BSOD or any other error) this would have been caught even with an automated test environment. They obviously didn't test it.
I love that "local developer testing" was on their list of things to improve haha
manual testing and quality assurance is fundamentally incompatible with releasing at speed, which is required in this market.
this is what ci / cd is for. the problem here is that their software is split into 3 parts, and only the core kernel module gets proper testing. their rapid updates deliberately skip being run with the core module, and the template file is not tested either, only the signature file is checked by the validator, then the template and signature file are bundled together and deliberately shipped to everyone all at once with no further testing.
the actual error was that the template file had a bug where the last parameter did not get collected, so when the correct signature file regex passed the test, this was not found as the template was never checked for this. as it also skipped the integration tests with the core kernel module, they just guessed that it would recover when something went wrong, but again it was never tested for this.
agravating this was their communications to customers. their software offered an option to use older versions especially n-1 and n-2, but they failed to inform the customers that this only applied to the core kernel module. they also told customers that the kernel module was designed to auto recover from a bad update. this lead the tech teams at customers to believe that the live updates were exhaustively tested, and if anything went wrong the machines set to use older versions would not be caught in it, and could auto recover.
much of this turned out to be outright lies based upon wishful thinking at crowdstrike, without the testing needed to back it up. this is basically miss selling, and negligence, which is why the lawsuits are coming, and why they will probably lose. even worse, they made statements to shareholders bragging about how good the testing was, thereby inflating the confidence, and causing them to underestimate the risk. there is a good chance this might breach some of their legal requirements as well. especially if any board members sold stock between talking to shareholders and the outage.
@@grokitallAlso, kernel modules should validate their inputs and not segfault/NPE on an invalid input, like, say, a file full of zeroes.
@@miknrene not just kernel modules. it has been standard practice for functions to validate their input for decades.
@@miknrene this is the first actual point made in both video and comments. Thanks!
It looks like CEOs of different companies live is some sort of MBA bubble and don't learn from each other's mistakes. I've seen multiple software projects go down because of un-needed cost cutting to save pennies.
They do learn. But what they learned from for example Boeing is that cutting on quality control give in the short run increased profit and massively increased CEO bonuses. When it comes back and in the long-run costs the company magnitudes more than what they saved the CEO hopes to be at some other company with their ill-gotten gains leaving the mess for other to clean up, or getting a fat golden parachute for leaving to let others do so.
You assumption that CEO would care for the company, shareholders or employees is in many cases unfounded.
@@cynic7049 yeah, makes sense.
As a software tester in a prior life, the thing that immediately stood out to me was that they CLEARLY were not doing any "golden image" testing - one test install of what they intended to publish would have shown it was bad immediately. That would have been possible even if they automated that stage.
The whole song and dance over how this was all the fault of an error in an automated verification tool tells me that such tool had no real oversight. They just trusted it but didn't follow up with a "verify". This is really concerning in any of these low-level tools that run with high privileges.
yeah i am not putting my client website out of test mode for payment until i make them test every subset of features and approve. i feel like it is how you do small project defining your bigger ones
CrowdStrike has increased the security to the ultimate level. Nobody could hack the updated systems :D
u r really bold to make such content. thank u so much and keep it up. Bravo !
I can't believe CrowdStrike doesn't have a division that does live testing before they send out software. Of course a lot of software developers put all testing off until after they finished writing the software which makes it very hard to find where the errors are occurring and to fix them.
It seems they don't even validate the updates by updating a local server before pushing it to customers...
'ummm ackchually we value sending our customers fast codes instead of reliable tested good code. i mean it worked until now, you were not crying while it worked, yeah. We are good'
It also monitors what each "user" is doing constantly.
100% failure rate. One test would have caught it. Also chef taste own cooking before serving to customers. Repeat two more times, permanent black list.
The $10 Uber Eats gift card is absolutely ridiculous. I don't know what the hell they were thinking with that.
Most corporate windows desktops are locked down even for developers, so they too would have been sitting around waiting for infra to fix their machines.
The problem isn't that the test was automated, it was that there was no smoke test - a test to sanity-check the other deployment tests. The smoke test can be totally automatic.
Testing is a good thing but Microsoft and Crowdstrike cobbled together a design requiring kernel-level access because Microsoft floundered in creating a safe tested API. Creating a backdoor to the kernel was the root cause of this failure. Microsoft should be subject to whatever damages Crowdstrike must pay because Microsoft gave this bad design their blessing.
I wonder if all the threats blocked historically by Falcon add up to the one outage it caused.
Thanks for the video!
Crowdstrike kept everyone safe ... by crashing the machines ...
Crowdstrike gambled with their customers' computers and businesses and lost. Too bad the customers were put out of business for many hours or days as the result. Since the faulty update crashed 100% of machines; it is quite obvious that it was never tested manually or automatically prior to being deployed. Crowdstrike may continue as a business but their reputation is damaged forever.
How is it that no one is talking about disaster recovery and deployment management practices. Really bad that no one is talking about SLA practices of normal companies as "works in my PC" is a thing.
Crowdstrike has caused multiple similar smaller failures in the past that should have warned they had huge issues with QA.
Crowdstrike cut corners, and so did all the companies that relied on them to reduce IT-related expenses. These companies could instead invest in well-staffed IT departments utilizing less intrusive security solutions, like Norton AV, which would allow them to test updates in a sandbox before deploying them company-wide.
no they could not. norton is not even in the same market as crowdstrike, who do end point response systems, which includes antivirus and malware protection, but lots more as well.
there are alternatives, but not norton.
as to sandboxing, a lot of companies believed the hype from crowdstrike about the n-1 deployments, which did not apply to the channel file updates, so they had their systems go down, fell back to the previously working version, and had the update take that down as well due to live updating of the channel files.
a lot of these companies have installing live updating software being a sacking offense, so it is no surprise that rowdstrike are now having to apply their n-1 settings to these updates like their customers were lead to believe was already happening.
Norton is garbage
WOW its interesting analysis of the problem.
Dee! Finally, long time no see
it's nice to have another video to watch after finishing all other in her channel😂
I know, work has been busy but glad to be back!
I agree that they pushed untested code out the door. I don’t agree with your focus on the necessity of a dedicated QA team. They need a real investment in QA, but that could be in staging environments and automated tests and the infrastructure for them.
Amazon doesn’t have QA engineers. They have enormous infrastructure for staging and automated testing as part of their CI/CD.
The model used to be that developers and QA were separate departments in order for QA to be unbiased and independent.
To save money QA human roles have been eliminated and replaced by DevOps writing automated regression suites.
Hard to believe the corrupt definition file made it past the content validator. It depends on its implementation. They will probably add a regression test for the crash scenario.
Please consider doing another video on this subject starting with the premise that the Crowdstrike fiasco was done DELIBERATELY ... sabotage made to look like incompetence is a very believable scenario nowadays
Never ascribe to conspiracy that which may be explained by incompetence.
@@miknrene Hanlon's (not Occam's) razor?
Great video as always!
A question that we may never get the answer to is if the update was supposed to counteract some immediate threat that might not had been disclosed to the public?
Not that likely but also not totally impossible. If this was the case then we would expect the lawsuits and whatnot to happen anyway to make it seem like this wasn't the case.
(Btw I'm a conspiracy enthusiast. By that I mean that I'm open to conspiracies existing even though I'm never convinced of any of them being true unless there are actual proof. I.E. the scientific sound counterpart to a conspiracy theorist :D )
Clonwstrike delivered the ultimate and best security - you cannot have a computer infected when you cannot use it, right?
You keep saying "updates" , it was not an update released on Friday. It was a content file not config file.
Content file never caused any failure (as per them) and they did automation testing and laid off manual testers.
The world should realise the importance of manual testing and signoff and I agree, there should have been release notes.
Thank you
Incompetence at Crowd Strike.
I suspect several members of the development team tried to shout at management.
How many times have I seen this in my career? I'm looking at you Clients 1, 2, 4, 5, 6
Number 3 was good.
It's funny to me that shareholders can sue the company considering that ultimately they are the company owners and if the people they select to run it do a bad job the financial loss is on them. But then, a lot of things related to stocks and the related economics don't make sense to me.
That's only true if someone owns 51% or more of the stock. If you own less, your vote might not have the desired effect.
the reason they can sue is that based on company statements, the value was over estimated, and the risks under estimated. when the outage happened it became clear that the statements about risk mitigation were near criminally misleading, causing their share price to lose 30%+ of its value as the correct information became known. this is basically how you would run a pump and dump scheme to maximise the value of shares the board were intending to sell.
if any board member did sell significant amounts of shares between the statements to shareholders and the outages, then they could potentially be in trouble for insider trading, and for fraud in their statements to shareholders.
@@grokitall ahh, thanks
CrowdStrike. When the windows fell.
Even years ago, patching a punctured car tyre was something of an "experiment", especially if the repair was to an existing repair (maybe even to a form of itself as an existing repair! And so on, ad infinitum). However, certain "tyre engineers" developed a formal process for determining if, and only IFF (sic), a punctured tyre could be repaired and safely returned to everyday service for its normal lifetime. The process became known as "common sense", and is also known as such in much of the rest of the world.... Simples... (A.K.A. Not rokit syence) ):-)
I think it was deliberate.
9:32 That's a bingo!
I see you were featured on a flutterflow video for no code 😆
The explanation of the outage. Good point not to rely on Artificial and Automated Intelligence, completely and have an actual human coder do the testing for any updates. Their insurance 😮might cover them, good example why not to rush things and instead take your time and do some more testing until the testing is passed! 😺👌🎥👩💻🥰⭐️💃
Reignited the dislike of 'pointy hair' Dilbert manager....
O.K So you are in South Africa.
This is a Microsoft failure.
Why are they allowing untested updates to code that runs in the ring 0, the kernel. All updates that have this level of privilege should and must go through Microsoft before updates being sent out.
This is why Microsoft has such a crappy reputation.
I read somewhere they crowdstrike did some sort of loophole to skip microsoft checking because it takes too long to check and verify updates that need to be sent out immediately
They do not allow untested updates. The driver in question reads independently updated data from filesystem. No code is changed.
@@vasiliigulevich9202 the point is microsoft didnt test to failure. any good programmer will test all possible cases, including edge cases. Such as driver failure regardless of the reason it failed. It is a failure in their testing and a failure in their recovery code. If you invite a third party into your code base you are asking for trouble. The mantra is don't trust user data! In this case the user is crowdstrike.
@@andrewortiz8044 they updated a text file which the driver reads in. That doesn't excuse Microsoft from letting a third party driver cause such a serious failure. Testing should cover driver failure.
@@DeveloperChris I've worked on product with various degree of test coverage. The reliability you are talking about is NASA level. That's not achievable for a consumer product. And even NASA fails sometimes.
Hello Dee! Do you edit videos yourself or have you hired someone? I’m an editor and would love to discuss it if you’re interested
Nothing is as secure as a disabled system. 😂
why attack my star faced girlfriend
Abort Windows !!!