What Went Wrong With Crowdstrike?

TWiT Tech Podcast Network

Просмотров 10 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 15 окт 2024
On Ask the Tech Guys, Leo Laporte and Mikah Sargent recap the BSOD catastrophe of July 19, 2024. What happened at Crowdstrike that caused them to release such a widespread buggy update?
Full episode at twit.tv/atg2034
You can find more about TWiT and subscribe to our full shows at podcasts.twit.tv/
Subscribe: twit.tv/subscribe
Products we recommend: www.amazon.com...
TWiT may earn commissions on certain products.
Join our TWiT Community on Discourse: www.twit.commu...
Follow us:
twit.tv/
/ twit
/ twitnetwork
/ twit.tv
#crowdstrike #informationtechnology #windows
About us:
TWiT.tv is a technology podcasting network located in the San Francisco Bay Area with the #1 ranked technology podcast This Week in Tech hosted by Leo Laporte. Every week we produce over 30 hours of content on a variety of programs including Tech News Weekly, MacBreak Weekly, This Week in Google, Windows Weekly, Security Now, and more.

Комментарии • 48

@richardbrekke3289 2 месяца назад ⁺¹
Early last year you might remember CrowdStrike basically laying off a few hundred employees under the cover of return-to-office mandates. In other words, a lot of people with the talent to easily find another job simply left. Presumably leaving behind less experienced and/or qualified workers, who one might further assume would also have to carry the extra burden of whatever workload these individuals had been doing up to that point. The combination of inexperience and overburden can easily cause a cultural drift toward process corner cutting to meet due dates. Also note that while the current Windows outage got publicity due to it's massive blast radius, CrowdStrike has done this several times recently, taking down Debian and Rocky Linux. There appears to be a pattern here, and I would not be surprised to learn that this effect has arisen out of that stealth layoff from last year.
@JeanPierreWhite 2 месяца назад ⁺²
Leo. There is so much to be learned from this event. To say there is nothing to learn resigns oneself to repeating this over and over. We can only get better by learning.
@mallninja9805 2 месяца назад
Yeah but we won't. The real lessons like "Hey maybe we shouldn't outsource security. maybe we shouldn't centralize critical services. Maybe we shouldn't rush software into production." will be brushed aside. Instead this event will be used to justify further restricting walled gardens & limiting the choices of applications that can be published or installed.
@JeanPierreWhite 2 месяца назад
@@mallninja9805 Many companies won't realize their own stupidity I agree, However I believe almost all of the fortune 500 corporations already have a robust change management and release cycle process in place they use religiously. For their data center operations. End points are a blind spot for many organizations, seen as a security risk only, not as an asset to be protected by change management. The smart ones will include critical end points into their change management procedures. Dumb corporations will continue and simply blame Crowdstrike and Microsoft.
@gslim7337 2 месяца назад ⁺⁶
Hey Microsoft, I'm on the road between Melbourne and Perth stopping along some very remote 24hr diiners that on day 3 still have BSOD and hand writing all orders and only taking cash. Can you charter a jet along with a fleet to helicopters with a bunch of these USBs.? Just send the bill to George over at Crowdstrike. He'll know what its about.
@loup754 2 месяца назад ⁺⁶
Why wasn't the update sandbox tested at crowdstrike? Why aren't critical infrastructure servers implementing a staging of updates from their vendors. This is what is done in my workplace because we do not trust vendors to properly test their own software. Protecting against zero day needs to be balanced with uptime.
@joelrobert4053 2 месяца назад ⁺¹
They must’ve deployed that update as an emergency CO that didn’t call for testing
@haroldcruz8550 2 месяца назад
Companies shouldn't solely rely on 3rd parties for zero day protection that's how you prevent poorly tested updates like these.
@tibtrader 2 месяца назад
Cuz it was friday 😂
@eddy2561 2 месяца назад ⁺¹
I've been dealing with microcomputers since 1979 (yes, I'm that old..LOL) and this has been going on forever and will continue.....forever!
@tdbnz123 2 месяца назад ⁺⁹
This is Y2K wishes it was 😂
@davidew98 2 месяца назад ⁺³
Yep, This ended up being what Y2K was hyped to be!
@grokitall 2 месяца назад ⁺¹
the failure mode is that any kernel level code can cause this, and every driver is kernel level code.
every kernel is vulnerable to this, and it will cause the kernel to crash. the potential difference is how they handle it.
the solution is to report success when the next driver is asked to be installed, and when the system reboots, just disable the driver that crashed.
this makes the solution to just power cycle the machine. after that it can just tell the os and driver vendors that the driver broke.
mandatory automatic updates are a bad idea, as you then can not test it in a canary machine.
however this whole thing could have been avoided even with automatic updates.
first, they could have done continuous integration, creating a checksum file after those tests were done and making sure the driver update code checks the checksum.
then you deploy to test machines, preferably using continuous delivery.
at this point the files are on both the test machines and the ci/cd server, and you can test they match before pushing them to the public.
then you do a canary release cycle, gradually releasing to more and more people.
when the software update goes out, the client side update software also checks that the file signatures match, blocking the rollout if it does not, so,it does not break the kernel, and stops the canary release.
finally the os can track which driver it is loading, and after the kernel panic it can just block it, so you only need a reboot to recover.
none of this was done, else it could not have happened, so the blame squarely belongs to cloudstrike for shipping the broken driver, and microsoft for not fixing the recovery model after mccaffee did exactly the same thing.
none of this is new tech, so the only lesson to learn is to actually learn how to do your jobs, then do it.
@An.Individual 2 месяца назад ⁺³
I thought there is a weakness in the way Windows handles kernel extensions?
@JeanPierreWhite 2 месяца назад
Bingo
@mikaelstrom1114 2 месяца назад
It's a consequence of monolithic kernels. Tanenbaum was perhaps right :)
@aaronstevens9937 2 месяца назад ⁺¹
So home users really don’t need AV? Is Microsoft defender sufficient?
@mjmeans7983 2 месяца назад
The USB key doesn't, and can't, work on computers that use the MS bitlocker drive encryption where the encryption key is not available. Many IT departments don't record the bitlocker recovery key on end user systems due to security concerns over what could happen if those keys are exfiltrated from the company. They instead opt to discard the recover keys so that nobody can access the hard drives; and instead, implement a device replacement policy and at the same time mirroring any user data on the company's servers. It would be logical for all kiosk systems and all secure remote employee systems to be managed with this approach.
Apparently, CS doesn't (or didn't in this case) implement a fail-safe strategy such as a staged update, or utilize windows system restore to be able to revert to the last known good state. Logically however, if they had, it would have allowed hackers another vector to attack CS protected machines.
Will IT departments learn to manage bitlocker recovery keys for critical systems better? Will CS implement some kind of fast recovery that doesn't create new vectors of attack? Could some kind of client-side config update validation be implemented that doesn't create a new vector for attack? Will CS hire Steve Gibson to direct a new reliable, secure and failsafe sensor in Assembly Language? Only time will tell.
@PerryGrewal 2 месяца назад
We need more competition and choice in the commercial operating system market.
@jaygreentree4394 2 месяца назад
If there is anything Ive learned from my short time in IT it was never change anything on Friday or Monday.
@HitnRunTony 2 месяца назад
I work remote for a company who most people work hybrid, so I had to walk people through the process
@alexrodasgt 2 месяца назад
I don't know about "not using AV software". I decided to pay for Avast Premium and the folder monitoring feature saved my bacon. Turns out the solution I was using to get around Microsoft's terrible start menu (Rocketdock), tried to access a folder with bank statements, and my AV caught it and asked me if I wanted to block it.
Sure, I know it's old software, but it doesn't have online features, so I didn't think much of it. Now if it was part of a daisy-chained attack, I guess I'm compromised elsewhere and done for anyway?
EDIT: I also feel 3rd party solutions are always faster than the solution included in Windows.
@jameslarosa2396 2 месяца назад ⁺³
Sounds like something that could have easily been tested.
@AAEmohawk 2 месяца назад
Being not all system crashed It would have been tested and passed. I know my work laptop rebooted but came back up.
But more than 1/2 fleat didn't and a most servers.
But hey just think of all the overtime we are getting paid to fix it. 😅
@jameslarosa2396 2 месяца назад
@@AAEmohawk That just means to someone who was a software developer for forty years that the test plan was weak.
@AAEmohawk 2 месяца назад
@jameslarosa2396 Well, you would know you can't test every possible system setting they people might have.
I'm sure they will do better moving forward. And will not let this happen again.
We are not going to move away from the products as they are dam good.
@cuebal 2 месяца назад
Someone on Reddit photoshopped the las Vegas sphere one
@ProfessionalBirdWatcher 2 месяца назад
CrowdStrike is a billion dollar company, with a B. They're trusted by critical government, public, and private services and they shafted each and everyone. The lack of outrage from our authorities is infuriating!
@JanRademan 2 месяца назад
I assume that is Windows NT 3.51 server and not the desktop Windows 3.1. Two completely different things.
@davelogan77 2 месяца назад
Its like they should have had a test layer in place before pushing the updates live to machines globally...??? 🙂
@neiltsubota4697 2 месяца назад
Does the US Defense Department use Crowdstrike ?
@JeanPierreWhite 2 месяца назад ⁺²
I don't agree with Microsoft It's not their problem. It most certainly is because they have built an OS that is so fragile and open to failure with no easy recovery tools as some of the IT guys on the frontline of this disaster found out.
In addition to Microsoft the companies running crowdstrike are also to blame. They have a system that auto updates with no change management or testing prior to the update affecting production systems. Corporations don't think desktop PC's are that important, but clearly they are, and the dept in organizations that do spend time with desktop systems are the security guys who see them as a threat and therefore throw out all sensible release methodology in order to be "more secure". There is such a thing as too much security.
@AAEmohawk 2 месяца назад
There is testing.
But hey they are a security company to prevent hacking.
Try hack a brick. 😂😂😂
@GrimYak 2 месяца назад ⁺¹
What would you have the OS do? This was a kernel panic, any other system that runs into this panic will also do the same
@JeanPierreWhite 2 месяца назад
@@GrimYak OK Let me explain.
OS boots core drivers only so that it can read memory and disks.
It increments the boot counter on disk.
if boot counter > 3 then flip boot partition AB switch and reboot - system boots using prior version of OS and drivers and zeroes boot counter. after successful boot.
else boot loading all drivers (which may fail and cause a reboot).
This presupposes that the OS has two copies of the OS in separate partitions and a boot mechanism that can select one or the other dependant upon a boot loop detection algorithm.
ChromeOS and some Linux distros have this capability. Windows does not.
@thomj78 2 месяца назад
Leo your talking as if CTO himself did this? The people and process under him though its his responsibility, he doesn't see or do everything that someone below did. The bigger question is how was the QA process streamlined, what sort of test was done? How was this missed and how do we not repeat it. There are lesson learned. Also when it comes to service affecting devices, shouldn't be running on windows, we all know the holes and problems windows brings. So why put ourselves in that situation running infrastructure services on windows. Its a good desktop platform as that's what was its original intention. Today it was CrowdStrike and unintentional and tomorrow it can be someone else, so do you really want to run into same situation with BSOD. MS say they are not responsible, each of us says we are not responsible. We all have some responsibility in all of this, weather we like it or not.
@davidew98 2 месяца назад
I don't think south west uses win 3.1. I heard they use commodore 64!
@frankruss4501 2 месяца назад
The leading edge is all to often the bleeding edge.
@donjaksa4071 2 месяца назад
Time to dust off that NIST 800-53 Contingency plan
@JeanPierreWhite 2 месяца назад
These 15 million endpoints need to be running something else other than Windows. If the endpoints are this critical, Windows is clearly not resilient to boot failures.
Immutable Linux with Atomic updates would be much safer and capable of automated recovery.
@Spitfire_Cowboy 2 месяца назад
Nothing like telling our enemies what security platform many critical infrastructure organizations and companies are using. That data leakage is a cyber security issue. As now threat actors can tailor their attacks to bypass that system. APT40 and APT28 are watching closely.
@davidc5027 2 месяца назад
This too shall pass.
@bernardsimsic9334 2 месяца назад
And they al hire each other back because they are so good at stuff so what if the accident crash the world every ten years!
@nocturnus009 2 месяца назад
BSOD? Too bad to O was not silent 🐡
@CrazyWhiteBoomer 2 месяца назад
This is the FIX in case Kamala tanks in November.

Следующие

Автовоспроизведение

CrowdStrike IT Outage Explained by a Windows Developer