when a null pointer dereference breaks the internet lol
HTML-код
- Опубликовано: 6 ноя 2024
- but it may not be the devs fault.
If you're a developer, sign up to my free newsletter Dev Notes 👉 www.devnotesda...
If you're a student, checkout my Notion template Studious: notionstudent.com
Don't know why you'd want to follow me on other socials. I don't even post. But here you go.
🐱🚀 GitHub: github.com/for...
🐦 Twitter: / forrestpknight
💼 LinkedIn: / forrestpknight
📸 Instagram: / forrestpknight
UPDATE: New info reveals it was a logic flaw in Channel File 291 that controls named pipe execution, not a null pointer dereference like many of us thought (although the stack trace indicates it was a null pointer issue, so Crowdstrike could be covering). Devs fault 100% (in addition to having systems in place that allow this sort of thing). Updates to Channel Files like these happen multiple times a day.
This should be pinned
Thanks for the update. Have not used named pipes in a Long time....
Source please?
@@anaveragehuman2937
-
www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
Then delete this video.
Fun fact the null point reference were also on the linux crowdstrike, the linux kernel just handled it like a boss
There seems to be an articles where it is said that some time prior, like maybe month or a few months before CrowdStrike allegedly did same as we see now for Windows to Debian 12 or Rocky Linux.
Potentially because smaller blast radius it went not noticed in media.
But I myself don't know if it was true or not, so take it with grain of salt
well, null pointer dereference is something that should throw an error, not be allowed and silenced without the dev explicitly handling it correctly. Problem here is, how was this ever allowed to be mass delivered to everywhere at once with such a glaring and general case bug that should have showed up at any sort of testing
So linux handling it itself quietly, might not be the own you think it is
@@Songfugel it throw an error, it just didn't kill himself like windows did, where the fix were to reboot your machine 15 times and hope that the network go up first than the driver 💀
@@vilian9185 But a failure like that should kill the program and not let it continue, and since it was the kernel itself, it should fail to boot past it. Windows had the exactly correct reaction to this very serious error, problem is that it should have never gotten past first patch ot tests
@@Songfugel No, you should leave it to the developer to handle a crashed driver. The end user using the driver does not care if it crashed or not, only that the program using it works, *and most importantly,* that the machine works. Fail silently (let the user boot), inform whatever's using the driver that it doesn't work and why, and let the developer handle it. There are times and places for loud failures. This is one of the occasions where it's better to silently fail and inform the developer. A crashed driver almost took down society with it.
Edit because seems like it wasn't clear: no, I'm not saying we should dereference the null pointer. Of course not. I'm saying that we should crash the driver only, and let the system move on without it loaded. Or unload it if it crashed at runtime. If another program tries to use it, it will raise an error and will be able to recover why the driver failed. In enterprise environments it's much better to have the system running vulnerable than not running at all. A vulnerability costs millions on one company. A company-wide crash costs billions. A worldwide crash is incalculable.
“You cannot hack into a brick”
-Crowdstrike, 2024
@@capn_shawn 😂
Torvalds has always held that security bugs are just bugs and should not be granted special status whereby, given the obsession of some, all functionality is lost in the name of security.
Crowdstrike just proved Torvalds is correct.
@@jedipadawan7023 He is kinda right here, but he is also often wrong and is just a normal person who just managed to get away with mostly plagiarizing (not sure if that is the right word, like Linus I'm a Finn, and not that great with English) Unix into an extended version as Linux
but you can break it
😂😭
Crowdstrike is DEFINITELY still at fault. You never ever ever push an update out live to millions of computers without extensive testing and staged rollouts, especially when that update involves code that runs at the kernel level!
Yeah I cannot possibly comprehend how and why this was pushed to everyone so quickly. Also why didn't the clients of crowdstrike say: heyy, do we really have to update everything day 1?
They're a security company and it becomes a necessity that they rollout security patches to everyone at the same time. Staged rollout means, you leave the rest of the customers viable to being compromised.
@@inderwool I agree if this was some kind of critical security update, but apparently it wasn't.
Especally code that can execute within the Kernel
I am still surprised that everyone apparantely just loaded the update. Surely in a big organisation at least, you run it through your test network first. And if you really don't have one, you will know better now.
But it's still their fault for pushing it out to everything everywhere all at once.
And they did so ignoring clients' SLA/update management policies, too! Damages as a *result* of breach of contract? Crowdstrike's *done* for.
Well in my previous role, I deployed crowdstrike for a major broadcaster, and one common misconception in all of this, is that crowdstrike can push updates to customer endpoints without their knowledge or consent. It doesn't work like that. Endpoint management is handled centrally by IT admin, and when crowdstrike release a new Falcon sensor version, after reviewing, we can choose if we want to use the latest version or not. You can of course configure crowdstrike to auto update the sensors but that would be ludacris for obvious reasons.
@@marcus141 It wasn't a new version, it was just a definition file.
@@kellymoses8566 maybe I am missing something but if the driver file got updated, wouldn't the affected PCs boot into recovery only when shutdown? So, they could still in theory keep running if not shut down?
100% agree. They should've updated 5% of users and compared failures to before the update. Not send updates to everyone in one go!
Especially for such a huge company writing a critical kernel level software.
I still don’t understand how this patch didn’t brick the machines they tested it on, the idea that a company worth $70 billion didn’t catch this in CI or QA is mind blowing
They didn’t run it 😅 tested sections of code, but not the integrated product.
@@simoninkin9090 Or how they did not stage the patch on a small fraction of machine per hour and then pull it back when BSOD happens
this company went big because of politics, they are the one who investigated alleged hacking of Democrat email server.
@@ingiford175 Right. Rolling out to the world in one fell swoop is really irresponsible. Even given the best QA in the world, mistakes will get by, that's why you stage deployments.
@@stevezelaznik5872 that’s just it. They clearly did not go integration testing.
Tired of hearing like Y2K was some panic or something that just magically fixed itself or wasn't a big deal. It wasn't a big deal because people spent years before fixing it
Yeah buddy we all watched Office Space.
@@HeeroAvarenwell I am old enough to have seen it first hand, I don't remember what office space said but I do remember all the overtime 😅
Welcome to bad reporting by the media, and a general lack of knowledge by the layman of tech
there is another one coming in 2038 when old unix system's 4 byte time integer overflows. jan 19 2038 ought to be interesting if any of those systems havent been fixed and are doing something critical.
I spend many months doing tests and fixing code for Y2K. That nothing happened is testament to the fact that we did our jobs well.
This is precisely why you actually test the package that is being deployed. If you move release files around, you need to ensure that the checksums of those files match.
And you don't deploy anything on friday
@@rekko_12 Amen.
And you don’t deploy to the whole flipping world in one go.
And md5 checksum all files...
They must use some sort of cryptographic signature securing package integrity. It means they don't test "end product" - they probably tested compilation products, then signed the files and sent it to whole world - and somewhere between end test and release one of the files was corrupted.
I bet it was something silly - for example not enough disk space :)
@@grzegorzdomagala9929exactly my thinking. Just skipped on some critical integration tests - environment mismatch or something of a sort.
However I don’t think it got exactly “corrupted”. The only reason the world got into this trouble, was because they have packaged the bug within the artifact.
Saying the root cause was a "null pointer dereference" is like saying the problem with driving into a telephone pole is that "there was a telephone pole in the way." The root cause was sending an update file that was all null bytes. The fact that the operating system executed that file and reported a null pointer dereference as a result is not the fault of the OS, and is not a root cause.
Bingo. And I can’t believe the testing server was apparently the only (apparently single) server in the whole world not affected. I get that we don’t want to make assumptions and point fingers willy nilly, but this one is a bridge way too far.
the problem is centralized control
that word salad is a tertiary problem
Well actually because of this shitty OS people miss flights etc. Should never run such critical systems in Windows.
Just leave that for your employees PCs
@astronemir so what alterations would you make on an OS level to avoid this?
The root cause is ACCEPTING null bytes, just check for them, its LITERALLY the programmers fault
giving any software unlimited kernel access is just crazy to me
MSFT: "Should we do something about the kernel, or develop AI screenshot spyware?"
Let me guess. Maybe Crowdstrike recently laid off a stack of experienced developers who knew what they were doing, but were expensive, and kept the not so experienced developers who didn't know what they were doing, but were cheaper.
Then on top of that because of the reduced head count, but same workload, then under pressure the developers cut corners to rush product out.
I'm not saying that is what happened. But I have seen that happen elsewhere, and I'm sure people can come up with loads of examples from their own experiences.
Oh funny enough, there's a topic on reddit (18 hours ago) told this: "In 2023, Crowdstrike laid off a couple hundred people, including engineers, devs, and QA testers…under RTO excuse. Aged like milk." But is there any official (or at least trusted) sources?
Sounds like a lot of companies
It’s true, but I think it’s vice versa. They’re keeping the “experienced” programmers while throwing away rookies. Atleast that’s the trend we see with google and Microsoft.
They want to pay less to employment as a whole, and the only way to do that without tearing the whole team apart is kicking people out.
Not at all. They skipped the first test and went directly to the cheap inexperienced suckers.
Not likely. The underlying issue was obviously introduced already a long time ago but never catched. So far only valid "param" files where pushed and parsed by the driver. The error itself is likely easy to fix if you accept it has also to be able to parse non-valid files without crashing.
I have no doubt they will do a thorough investigation as this was such a massive impact with millions and billions of dollars of implications.
🤔 worldwide? probably trillions 🫣
updated 24h later to add:
the peanut gallery is correct, the wikipedia entry makes it more clear that some enterprises and markets were unaffected and some were only affected for a short time 🧐 thanks everyone
@@astrocoastalprocessor Nah, it was big for sure, but i don't think you get quite how much a trillion is.
I have no doubt Crowdstrike are going to be sued into oblivion.
I have been reading the comments from employees reporting how their company's legal departments are being consulted.
@@jedipadawan7023 Just because a legal layperson is trying to find out from a lawyer if there is any legal liability doesn't mean that there actually *is* any legal liability. That doesn't mean people won't try to sue them, and that will be costly fighting them off.
The question is will anyone outside ClownStrike ever hear what actually happened?
Has the name says - crowd strike , every device goes to strike
DoS like a boss
the name is quite fitting seeing how many people were left stranded in airports
Thank you for your insights. Man, I hope CrowdStrike does a thorough post-mortem for this one. That's the least they're owing the IT professionals at this point.
I hope a third party does an investigation
It did not break "the internet". It broke a lot of companies' office computers, but those are not on the internet. In fact, the internet chugged along just fine.
I have not written operating system code, but generally code is supposed to validate data before operating on it. In my opinion, developers are very likely the cause. Even if there is bad , the developers should write code that can handle that gracefully.
Also, this video asserted that this kind of issue could slip by the test servers. That sounds ridiculous to me. The test servers should fully simulate real world scenarios when dealing with this kind of security software. They should run driver updates against multiple versions of windows with simulated realistic data.
But, I would be surprised if a single developer was at fault. Because there should be many other developers reviewing all of the code. I would expect an entire developer team to be at fault.
It'll be interesting to learn more.
I'm just astonished that this got past testing, AND was deployed to everyone at same time. Just screams of flaws in the entire deployment process at crowdstrike.
... and also the management for not giving enough resources for testing. Its always features, features, features!
most of the backlash from developers -> ego devs who write like 2 lines of crap code a day but are (for whatever reason) extremely vocal
Narcissism is so prevalent in this profession. As with surgeons, violinists and physicists ;)
ive been bitching about crowdstrike for a long time.
@@juandesalgado Speaking as a retired violinist who now works as a software dev, I feel like physicist might be the next career I should look into!
@@FritzTheCat_1030 As a dev that started as a physicist, what are your tips to learn violin?
@@FritzTheCat_1030 lol - I hope you keep playing at home, though!
At the most fundamental level, it is obvious that CrowdStrike never tested the actual deployment package. Things can go wrong at any stage in the build pipeline, so you ALWAYS test the actual deployment package before deploying it. This is kindergarten-level software deployment management. No sane and vaguely competent engineer would voluntarily omit this step. No sane and vaguely competent manager would order engineers to omit this step. Yet the step was definitely omitted. I hope we get an honest explanation of how and why this happened.
Of course, then you get into the question of why they didn't do incremental deployments, which are another ultra-basic deployment best oractice. I am beginning to form a mental image of the engineering culture at CrowdStrike, and it's not pretty.
First rule of patch management is you dont install patches as soon as they are available.
If I know that then why some of these massive companies don't is beyond me. It seems that IT management has forgetten the fundamentals.
Also technically it can be done remotely if it's a virtual machine or remote management is enabled.
The problem is that these patches are automated OTA (Over the Air) patches.
Which was marketed to businesses as there would be less administrative work in installing patches, since these patches come directly from CrowdStrike the trusted vendor. Thus, they wouldn't need to hire as many qualified IT people for cybersecurity tools patch management.
It was like a SaaS service handled by the vendor that they didn't need to worry about.
Little did anyone realize that there was no proper isolated testing done before pushing this out to production globally.
The lack of testing and slow gradual rollout + Windows OS architectural design flaws
Combined, it created a single point of failure.
Didn't help that AzureAD was also down as well so anyone trying to login via Active Directory to remediate the issue and get Bitlocker keys were also screwed. 😅
The update was more like a virus definition data file. The actual scanning engine driver file was not updated. These types of updates are apparently pushed multiple times a day as new “threats” are encountered. It is astonishing to me, that the Falcon driver cannot handle or prevent garbage data being loaded into it.
Also it’s the poor architecture of Windows that driver crashes bring down the OS. Additionally possibly a bad architectural decision by CS to embed their software so deeply into Windows that the OS will crash if the Falcon driver misbehaves.
yeah, for the physical machines, I hope they have vPro set up; if they don't, I bet they're really wishing they had done it sooner. Lol.
@@Lazy2332Virtual machines proved even more unrecoverable than physical machines - you need a physical keyboard connected to enter safe mode (assuming you actually have the bitlocker keys).
A guy on Twitter theorized that maybe it was some sort of incomplete write - like when the filesystem records space for a file, but stops before copying any data leaving a hole of just zeroes. If sometime like that happened in the distribution server or whatever it's called and didn't manifest during testing, well, kaboom!
Would be even "funnier" if it turns out a bug in the file system.
@@TheUnkowI'm still suspect of windows. I work enterprise IT at a big defense contractor. I see drivers fail ALOT in windows and most of my job now is just updating drivers. I see memory management, inaccessible boot device, nvpcl.sys crashes, all related to drivers that get rolled back/corrupted from windows updates. I'm just not good enough yet to find it and expose it.
@@sirseven3 As a developer myself I know sometimes it is the most weirdest bugs that cause the issue ... just a bit of an incorrect offset and any file or code may become totally useless ... sometimes even hazardous.
I haven't been using Windows for a while because of being unable to determine what causes some the issues, I know that using an alternative is not always an option ... but debugging closed source is a really challenging process.
Just because we get a set of API's or other functionalities from Microsoft to use ... no one guarantees they are bugfree or security/privacy/memory leaks free. Even if they were 100% ok, on the next update (such as in this case), an issue may be introduced and we will have trouble again.
Note that Linux and any other software isn't fully clear of these issues as well, for example just recently they had the RegreSSHion bug, which was also a bug introduced in the update which enabled most serious security vulnerabilities.
Still I would say the transparency of open source would make such issues easier to overcome and harder to introduce.
Easier life with closed source has it's downs not just ups, we must take precations against that, glad to hear some people like yourself are serious about it.
As a developer, I always checked if a pointer was null before dereferencing it.
Doesn’t macOS fail gracefully when a kext misbehaves? If so, you can still technically blame Windows for not handling that situation well
I don’t know about later iterations but from my experience in Big Sur and earlier iterations kexts’ can still cause kernel panics, at least when an invoking an instruction that raised an uncaught/unhandled CPU exception, in my case I was trying to access a non-existing MSR register on my system. The thing is whether it’s kexts/drivers/modules on macOS/windows/linux doesn’t really matter, cause at that point your in ring 0, the code has as much privilege as the kernel, the only safeguards at this level are rudimentary CPU exception handling hence why kernel panics and BSOD always seemed so CRUDE with just a few lines of text, since at this point everything has halted and and the CPU has unconditionally jumped to a single procedure and nothing else seems to be happening …
@@samyvilar CrowdStrike does not have kernel level permissions on new Macs, because Apple has been pushing people to move away from kernel extensions, so CrowdStrike runs as a system extension instead which is run outside of kernel.
The system files on Mac are mounted as read-only in a separate partition and you need to manually turn SIP off and reboot in order to be able to even write/modify them.
Good API designs encourages your developers to adopt more secure practices. CrowdStrike isn't intentionally malicious here, but lax security design in Windows stemming from good old Win32 days allowed such failure to happen.
I doubt it because MacOS is like Windows, it does in place upgrades to software.
Some versions of Linux and ChromeOS employ blue/green or atomic updates that allow for automated rollbacks if a boot failure occurs.
@@k.vn.k I was under the impression crowdstrike was windows only, for as long as I can remember Enterprise seemed to shy away from macOS, given Apples exorbitant price on its REQUIRED hardware. macOS Darwin kernel is significantly different from windows and Linux for that matter, crowdstrike may or may not need kernel level privileges, for feature parity across the 2 platforms, but make no mistake anything requiring ring 0 does!
What this shows me is that it's a bloody miracle that any computer works at all.
Any *Windows computer. The Linux systems that run crowd strike weren’t affected :).
@@nathanwhite704they were in april ;)
Friends don't let friends write C++
All parties need to fix this broken system:
- Security companies cannot ever force push without testing.
- OS (special MS) need to improve all aspects in this scenario with lots of new well documentated automated testing/check tools for multiple steps in the process.
- Essensial companies cannot trust blindly on updates without basic checks, and MS should not be the only OS running if you want to make sure that you online all the time.
We need better software build for failure special for essential compatines that cannot stop. If companies do not fix this on all levels it can open a new door for failure.
Is this man speaking into a cactus in a vacation setting? He's crazy. Subscribed!
Thanks for the detailed explanation of why I am spending the first 4 days of my vacation at the airport. Honestly.
He hardly said anything
I am so sorry
Its called deploy to 1% of customers at a time... Maybe starting on a Monday at 6PM..
Bingo.
The internet was not broken. Not sure why people kept saying it did
Right. If it was a network-type problem, the IT folks could have just applied a fix across their network from the comfort of their cubicles and then gone home at 5:00 on Friday. Instead some of them had to run around to individual machines and boot them to safe mode while others had to try to remember where the bitlocker keys were last seen.
Ye just a click baiting title but I guess everyone is just surprised off the scale of this. Most people couldn’t do their job and CrowdStrike probably made billions in financial loss to these companies, airlines etc
"tHe SiTuaTiOn tHaT bRoKe tHe ENtiRe InTernEt"
Instant downvote.
AI usage is going to make such occurrences common in coming decade.
That is why I like PiKVM on my servers. Although I don't use Crowdstrike or even Windows. But it can happen on Linux or Apple as well.
There is something you can do better for the case of a bug after testing, when you're going to push an update to a massive population of systems. Unless it's an emergency update that needs to be pushed NOW, you do a phased push. Push to 1% of the systems, and wait an hour (or longer); then push to 5%, and wait 5 hours; then push to 10%, and wait 12 hours; finally push it out globally. While you are waiting each time, you monitor support activity closely and/or look for any abnormal telemetry such as high rates of systems reporting errors, going offline, etc.
You can also split the application between kernel and user space, so that you have a minimal footprint in kernel space and do the more complicated work in user space. In that model, the kernel code can be hardened and shouldn't change on a regular basis; and the high frequency updates are then to the user space code, which is much less likely to take out the entire system due to bad data.
The first use of ‘Gnarly Event’ to describe a world wide catastrophe. Well done.
As I understand it. AssClownStrike has a "secure" conduit to copy anything to system32 folder. Windows happily runs any driver file there during startup. (Reworded for the pedantic).
Windows is designed to bugcheck (bluescreen) on any driver problem. Always has.
Having the ability to send trash to over to a billion computers system32 folder with one command is the real problem.
Yeah, I’ve heard Windows channel files mentioned. Sounds like a similar process to what MS uses to distribute new Defender signature files.
It wasn’t a driver fault - it was a bad file configuration the driver downloaded automatically.
@@allangibson8494but couldn't it have been a bad pull from the WSUS to the clients? The checksum wasn't verified on the client side, but verified before distribution
@@allangibson8494 that still sounds like the driver's fault for not gracefully handling that bad file configuration
@@reapimuhs Yes. The file error checking seems sadly deficient. A null file check at least would seem to have been warranted.
good explanation. one additional note that on Windows at least, a null ptr deref is basically a special case of an access violation… the first page of the process is marked as unreadable and any access attempt (like the 9c in this case) causes an access violation and any access violation in the first page is assumed to be a null ptr deref.
i’m really surprised people aren’t talking more about why this went out to millions of computers all at once. why aren’t they doing a phased roll out? i bet they will now 😂
other comments seem to suggest this wasn't an actual update but rather a faulty definition file that was downloaded, the real problem is why they were not validating the integrity of these files and gracefully handling corrupted ones.
@@reapimuhs that makes sense, but regardless of what they call it,imho even config files are a part of the software and require testing and roll out procedures just as if code had been updated.
So, it wouldn’t show up on the testing server, but it would show up on millions of servers all over the rest of the world? I can’t say that makes sense to me.
Yeah, this is.......... very dubious.
we have known how to write software so this does not happen since the moon landings. we have even better tools now. the only way for this to happen is for everyone including microsoft to ignore those lessons.
This can not make any sense unless you are delusional. So you're good. The man stated absolute nonsense, like he has no idea what he's talking about.
@@JamesTSmirk87 of couse if you canary release to the test servers first, then to your own machines, and only then to the rest of the world, it would have been caught.
3:20 don't get me wrong, it could have even been flipped by a solar flare, but saying something happened after CI/CD and testing still sounds like it should have been implemented better haha Edit: WTF I had never heard of the 2038 bug, makes sense tho, I always found Unix time to be limiting.
That's just an extra little thing for us all to worry about for 14 years! Good night, lol.
@@sadhappy8860😱
Enterprise gear uses hardware ECC RAM with a separate parity chip for error checking and correction to prevent that. Even if it flipped the same bit in 2 chips at the same time perfectly the file created would have failed a CRC integrity check and the build should have failed in pipeline. A failing disk or storage controller in a busy data center is not going to pick on 1 file and would be eating enough data to set off alarms. This was the either the perfect storm of multiple human errors or sabotage.
If the disk corruption occurred before checksum, how come it’s not caught in the CI pipelines. If the corruption happened after the CI pipelines, why don’t they check the checksum before distributing it
but that assumes that they have a decent testing and deployment strategy, despite all the evidence to the contrary.
to paraphrase terry pratchett, in the book raising steam, you can engineer around stupid, but nothing stops bloody stupid! 😊
As a software developer with over 30 years of experience I must say... you couldn't be more wrong. There's no way in the world this wasn't a developers fault. Software developers are responsible for testing the actual thing they're going to ship against the thing they're going to ship it on. If they don't do that, it's on them.
The devs did no doubt create the problem and wrote code that is prone to failure. The devs must take some blame for sure.
The dudes in charge of deploying the code around the world are also to blame. Why on earth would you not deploy this to a percentage of your clients first until it is proven to be reliable? Deploying to everyone at the same time is not a devs fault. It is still with Crowdstrike tho, very irresponsible of them.
I knew something bad would happen like this after I retired lol.
@@JeanPierreWhite Surely there's enough failure here to spread some blame around. Devs should check for & handle null pointers. Test suites should find bad channel files. Engineering department should properly fund & staff. Fortune-100 companies should be wary of all deploying the exact same EDR solution. etc etc.
Crowdstrike really had the crowd strike!!!!
OSs like openSUSE, Fedora Silverblue, macOS, and Chrome OS use automatic rollback mechanisms to revert to a stable state if an update or configuration change causes a system failure, preventing widespread issues.
the thumbnail says "not the devs fault". wrong, totally the devs fault. previous update broke future update, classic.
Pdf weeb
@@soko45You have a right to cry about a drawn picture on the internet.
I agree it test, test and more test. The should row test on some PC not all to test if was full work. You think with Airline that part should has most test. Code do not change just OS it run on clear was not test for that OS.
That's why there are two phases in product stages. One for development phase, one for master or production phase. Even if There is an error after all the merge requests coming together and introduce a bug, You would catch it during the development phase and fix it. You wouldn't release it right away. This fault is not justifiable.
I guess that doing a validation test on a limited scale before pushing to the world just never occurred to anyone at CrowdStrike or MS?
The scale it happened is 100% their fault and preventable period.
Fr. The buck must stop somewhere. I get protecting people from a mob but...
@@krunkle5136 Single dev isn't to blame, the whole company is. None the less no matter what there can always be situations like this, which is shocking they don't having rolling updates to minimize such damage. If for example they only rolled it out to places like McDonald Kiosk's, then slowly to more critically clients, they would of known of the issue before it became a huge cluster fuck
Crowdstrike will be sued out of existence due to this.
Microsoft also has some blame, only their OS was affected by a bad Crowdstrike release. Their system is too fragile and has no automated recovery or rollback.
@@JeanPierreWhite Hopefully that will bring some good to Windows for a change of pace
@@JeanPierreWhitethe issue is not that they had no recovery mechanism in place, it is that go into safe mode and fix it yourself does not work with locked down machines.
It is the fault of a kernel driver that had no programming code in it. The new file provided by Crowdstrike was filled with null characters. The kernel has no safeguards. So, a null memory reference causes failure!!! So it IS whoever scheduled this file to go out in updates!!! It IS ALSO the FAILURE OF CORPORATE consumers for not testing kernel updates in a limited fashion before complete roll out!!!
Pointers from a file, that is nuts. 😂
Well if your CI/CD and testing allow for file corruption afterwards then they are just set-up wrong. The update files should be signed and have checksums and you should perform your tests on the packaged update. Any corruption afterwards would result in the update simply not being applied. The fact that the update rolled out shows they either package and sign after testing (which is bad) or don't test properly at all (which is even worse and probably the case).
Imagine pushing an update
Then everywhere you look bsod
Even the news
Guess that's why you shouldn't always auto update
Run the update on a simulated network
So that way bugs like this doesn't happen unless the end user has some weird ass computer
Like the guy who crashed Myspace bak when
Because then they allow you to use custom html and scripts on your profile
And guy put some html on his page but due to sanitation issues, the servers interpreted it as something else and cause it to overload and ddos itself
Microsoft Windows is one patch ontop of another patch. There's a reason why linux and Apple software is preferred by developers.
🙄
It's also the reason they have less apps and programs they can run because it would mean a whole rewrite of the majority of apps that have no issues.
Also work around auchas wine are not the answer.
Yeah, because unix is a monopoly, you simply cant get good alternatives on windows (jk)
The billion dollar mistake just went nuclear
Respect for mentioning the 2038 problem. Gosh, I hope the code I wrote in 1999 isn’t still running by then😂
Even a simple 32 bit CRC would have detected that the file was corrupt. So incompetent.
Crazy how I can understand how one line of assembly code caused everything to just die
When parsing input data, especially from a kernel driver, one needs to be VERY defensive.
Validation should happen and the validation stage should not be able to crash on any input, especially empty or all zero files.
But doesn't this mean that whatever's reading this file isn't checking if the file has valid data? So there is a bug in the code and it was just sitting there for who knows how long? Have they not tested how their code handles files filled with zeros or other invalid data?
probably they might have laid off the qa who could have caught it
And the deployment team who would deploy to say 1% of their customers first to be double dog sure.
Look, it's 2024. Both at Microsoft and CrowdStrike you need to assume this can happen and that the impact will be huge. Don't tell me nobody ran a "what if" scenario.
At best both Microsoft and CrowdStrike could have done way more to allow some sort of fail-safe mode.
For example: you detect your driver was not started or stopped correctly 2 times in a row after a content update > let's try something different and load a previous version or no content at all > check with the mothership for new instructions.
Which would still be bad, but only "resolve itself after 2 reboots" bad...
Yes; This is why corporations should not use Windows in mission critical systems. Its too fragile with no resiliency or automated rollback built in.
Even my lowly chromebook can revert a bad OS update automatically by switching partitions at boot. Microsoft should have provided some level of resilience after all these decades.
Why should the Null bytes have to do anything with the file? If you deref a nullptr, you crash in cpp
Dude, it is still the driver developer's fault. What happened to using MD5 or SHA checksums to validate the contents of a critical file? If the driver did the one simple step to do checksum validation, it would have noticed that the contents of the data file is not valid, and could have refrained from loading the file and could then have issued an alert instead of BSODing. It would be a very simple step to also add the checksum and do validation during the CI/CD pipeline and the installation process.
There is no in person repair for most of the cases.
Servers in data centers usually do not use disks directly but though some storage network technologies.
They can access the file systems of the affected machines remotely.
Mac/Linux users: whistling and looking around.
Headline should be "level 1 techs save the world."
Maybe, but you don't push an update on a kernel driver to all your clients at the same time. Kernel drivers is a serious business you don't want to mess with it
You have testing environments for that.
If you don't push it to all at the same time you couls be sued for giving priority to some over other customers (ergo discriminate or downright steal) as some security updates may be essential.
A bug of this caliber simply should not have been allowed on live, it was a most basic and serious mistake for a security company.
@@TheUnkow BS.
Most large corporations have change management in place to prevent production software from being updated that doesn't go through all the necessary quality steps. Crowdstrike updated clients systems without their knowledge or permission.
In addition some customers are OK being a beta customer, those are the ones you target first, then the majority and finally the customers who say they want to be one release behind.
Deploying all at the same time is clearly irresponsible and highly dangerous as evidenced by this disaster.
Making an update available for download by everyone is fine, but pushiing said release to everyone at the same time is irresponsible and will be the basis for a slew of lawsuits against Crowdstrike.
@@JeanPierreWhite If someone screws up, they can always be sued.
If the option for having beta testers are included in the contract, then that is a kind of software model, those rules were agreed upon and that is fine.
BUT I would not want a big business when my software provider just decides that my competitors get some security updates before me just because they had some kind of an additional arangement or the software provider just deemed someone should get them first (even if it was round robin).
And yes, almost everyone tries to skip steps until a thing like this happens. Skipping saves capital and because the first priority of companies is capital, not the lives of humans in hospitals which depend on their software, things like this will continue happen, but are not right just because they are beind done by many of the corporations.
No code reviews, unit tests or integration testing at CrowdStrike? If a bug occurs, it should show in the lower development/staging environments. Then, via CICD, all of the existing code is merged with the new updates for the pending release, where it goes through user acceptance and quality assurance testing. Any bugs are cherry picked and go into a future release once the snapshot is more stable. I don’t see how this couldn’t be recreated prior to pushing to the main update channel, and I don’t know why they don’t do split testing. I’m struggling to understand the way that a dev and QA engineer isn’t at fault here.
Dev, QA and deployment manager.
No matter how good your QA and devs are mistakes will slip through. Deploying to everyone at the same time is super dumb and dangerous and bypassed all customers change management. WTF?
In c++ you have to check if data is valid!
yupp, I have seen this failure mode with solid state flash memory. What happens is part of the flash that records the sector file allocation is ok, but the part that contains the data is broken and all the actual data is null. Potentially this could happen in any device along the chain in the CD. What would be really interesting is if the hash / CRCs for the file were calculated and provided with the update package, this would always be best practice.
It was the Fisher Price dev tools they used!
You COULD prevent that by signing before testing, so the signature guarantees what you're shipping is what you've tested.
I am amazed by the fact that it is 2024 and we're still writing software ( especially operating systems and drivers ) in non-memory safe languages and without formal verification. Crazy. That's what we should be using AI for IMHO.
No I disagree, they should be testing the updates on Windows Mac etc by rolling out the updates to the machines, and preforming restarts on those machines and then running a full falcon scan to check that the application behaves as expected. Also an engineer can if a values is null before de-referencing so I think this is an engineering/testing issue by cs for sure. But hey live and learn I guess their testing processes will be updated to catch these types of bugs going forward.
A security company of that scale should have their testing and update pipeline figured out. Learning basics at that size is just unacceptable.
This is what happens when Gen Z social media types who never had professional experience pretend to act like they have professional experience. This guy totally has no idea what he’s saying but just saying it confidently.
True, he completely lacks understanding of what he is talking about.
SO so true.
NULL dereference is always a dev's fault, because it is a lack of simple error handling.
A pointer's validity (i.e. not being a null reference) can always be checked, so the dev that wrote that does hold some accountability. But what's more important is that code is run in small execution blocks that never take down the whole system when an exception of any kind occurs.
rubbish. this is kernel level code, and the wrong type of bug will crash the kernel on any operating system.
the problem is it should never have been deployed, should have immediately stopped deployment when the machines started crashing, and windows needed a better response to a broken driver than just put your locked down machine into safe mode and fix it yourself.
This guy is on drugs. All they had to do is test it on one PC.
He has 0 clue what he’s saying. You can do plenty of null pointer derefernce checks similar to typical null checks
Crowdstrike was anticipating a major worldwide attack by a never before heard of hacks so the development team decided to put Windows into "Super Safe Mode".
I wont finger point at devs. Id point releasing production changes on a working day. Especially CS being a agent that sits at Kernal level of OS. This means zero testing done.
This was truly hilarious, I hope we see more like this.
A simple Zero Initialization would have prevented this. ZII (zero is implementation)
Even with a basic tech understanding there seem to be way too many obvious errors involved in this incident. ZIL, staggered rollout, local testing...
This is not how a professional software security company operates.
The CIO of CS should be fired with all of the dev and testing team. Plus hospitals and airlines should never use Windows for critical activities.
Agreed. This proves beyond any shadow of doubt that Windows is not fit for purpose.
Thank you & agreed, I greatly appreciate your provided insight into the matter.. Someone, whomever from the SysAdmin guy/manager/whomever gave the go ahead to push out the file to all of their clients & then seeing very quickly afterward as half the world came to a screeching halt as a result, are not likely having a good weekend,..
IF? if it had only happened to certain computers, with certain versions of OS, then I'd maybe believe that testing might not have caught it. But with this many computers, all at the same time, CrowdStrike's pre-delivery testing on a deployment box should have broken also! So, deployment testing was not done properly! If at all?
I like that property. Not being creepy, just a compliment, looks nice.
"It's not the devs fault." Replace the devs.
Microsoft are definitely partially to blame. They have a monopoly on the desktop OS market and have been asleep at the wheel on changes to Windows.
If both MacOS and Linux have facilities to run Crowdstrike in user space - there's no reason Windows should run it in kernel space.
With Microsoft deprioritizing Windows because it now makes up less than 10% of their revenue - this event shows us the fragility of relying on one private os vendor for 90% of global computing
Bingo.
All CTO/CIO's are now on notice to move away from Windows.
Wouldn't it be funny if Crowdstrike used their own security product (I know, right?) and had bricked their own computers as well?
Test your final deliverable *as a customer*
Clearly Crowdstrike bypassed all change management that corporation employ. There is no way that all corporations worldwide decided not to do change management last Thursday. Crowdstrike updated customers production systems bypassing all change management. bad bad bad. They will be sued out of existence.
Yes, that is a nasty one. Thanks for this video and explanation. I don't suppose there is much exception handling possible at the kernel level. Everything has to be small and tight. Back in the old days when Windows didn't have compartmentalized memory management, we sometimes saw the 'Blue Screen of Death' when writing C code. We could write code for an application that would tromp on the memory of another running application. Bad things happened, LOL. Basically the operating system would barf.
I don't accept your comment that this could have been introduced after release. This bug looks to be reproducable and their release process should involve dev to qa to prd testing (both unit and functional testing). This is not an 'edgecase' bug.
I thought windows allowed you to rollback last known working settings on a blue screen especially before updates.
IT has done it again, so much for AI and the star trek technology we claim we have.
Even I know yer not spozed to dereference a null pointer... How could it not be the devs?
I wish everyone knew what really happened and would stop blaming Microsoft and even the government in those affected countries.
It's aliens.
It's always aliens.
but the machines staying down is directly due to microsoft.
if they had looked past safe mode and implemented something to detect and recover from bad driver updates, then it would have been a simple case of turning the machine of then on again and letting it recover.
@@grokitall Yep. Microsoft have a lot of 'splaining to do.
You kind of downplay the actual largest concurrent global IT outage in history, dude.
Just tech people typically downplaying issues and avoiding accountability.
Just imagine the amount of Jira tickets and story points within Cloudstrike right now... Non devs folks can leverage on this and micromanage all devs moving forward lol
Of course I have never forgotten to check the pointer, never happened in the past, totally impossible👀.
Those of us who aren't code geeks read:
Crowdstrike has way too much power.
Truth.
@@JeanPierreWhiteeven those of us who are code geeks believe that.
The idea that security somehow involves installing a remotely controlled agent that can potentially go full "Smith" on critical servers is the problem.
Yes.
Fr. Security should never require critical code that can CRASH the kernel to be continuously deployed so easily.
AV software is clearly very risky. The industry for some reason seems obsessed with it. Customers keep asking for it on their servers, I keep saying no we have other ways to handle this hazard. Why oh why are you letting a vendor push kernel changes to all your domain controllers at the same time? Why are you in a position where you feel you need this kind of software on critical servers? Are you letting your administrator browse the Web from your file server?
AVs are risky, users are riskier. No, riskiest.
@@kxjx then what happens if a bad actor manages to exploit your server in some way to get their malware onto its system and running? without an AV on the server to help catch it then surely it would be capable of running loose for far longer than it would have if an AV was present would it not? what other and better ways do you have to "handle this hazard" without something present on the server to try and identify and deal with it?
This could have been avoided if CrowdStrike used a null safe programming language.
No it couldnt, it seems like it was read from a file so no memory safe lang would have helped
It’s the developers fault. They didn’t check that the struct pointer was NULL before referencing it.