I am a network systems engineer that had to deal with this for 14 hours that day. This was one of the most informative videos I have ever seen. You helped simplify Windows OS in 15 minutes in a way that hours of reading hasn't. Something about real world scenarios to tag the concept with in my memory really helps. Thanks!
The company I work at got bought by a bigger one. They required us to install Crowdstrike on all servers. We found a memory leak, that Crowdstrike still hasn't fixed after 6 months so I have refused to install it until then. I was on vacation when I saw all URGENT emails from other divisions. Thank you Crowdstrike for not fixing your memory leaks, it saved my vacation. =P
@@elta6241 I work for a school district in L.A. We purchased it for our computers. I'm guessing the company has a strong influence with government institutions.
I've been passing on purchasing crowdstrike at my org every year since 2016 as they left a sour taste in my mouth for claiming that Russia hacked the DNC servers and then being unable to provide proof. Haven't trusted them since then.
When I was in high school I had a teacher that had a way of explaining things to you that temporarily elevated you to a fraction of his level of understanding. Today I got to experience that again. Thank you Dave! 🤯
Amen. Today's computational underpinnings are somewhat opaque to me, a 74-year-old whose first computing challenge was to code a very simple program into machine language (not assembly!) and put it onto punched paper tape to run on an old machine (which predated "big iron") in the University's basement.
While this is technically what crashed machines it isn't the worst part. CS Falcon has a way to control the staging of updates across your environment. businesses who don't want to go out of business have a N-1 or greater staging policy and only test systems get the latest updates immediately. My work for example has a test group at N staging, a small group of noncritical systems at N-1, and the rest of our computers at N-2. This broken update IGNORED our staging policies and went to ALL machine at the same time. CS informed us after our business was brought down that this is by design and some updates bypass policies. So in the end, CS caused untold millions of dollars in damages not just because they pushed a bad update, but because they pushed an update that ignored their customers' staging policies which would have prevented this type of widespread damage. Unbelievable.
yea this - this was exactly why I held off on many key systems for CS deployment - I was not happy they would just override all staging - nothing should EVER be allowed and I was going to have a zoom call with their team about this.
@@zippythechicken The CS product is one of the best available. It can monitor and alert for far more 'risky' behavior than most other in the industry. There is a reason why their install base is so huge.
@@zippythechicken Because ransomware is a big risk and companies want to have insurance and the insurance companies require insured companies to run software like CrowdStrike because it can and has stopped ransomware infections.
I guess if we look at it from the medias perspective, they need to explain it in a way so everyone who are not IT savy can understand it. But for us who knows a bit about IT, he does a very good job explaining it.
to be fair, he's doing it in more than 10 minutes which i'm sure is well over the limit of any "media outlet" - or even a dedicated tech program (for the masses - such as BBC Click). still, one cannot deny the clarity of his explanation which very carefully avoids hinting at responsibility of the error and just states the basic facts of what happened. i don't work in IT and only know cursory programming in HTML and some basic C++ but Dave's explanation made sense to me without having to understand the technical details ! very commendable and earns a Subscribe from me even - just because that's all he is in it for (!) - he deserves as many likes as i can give him for such a beneficial message to society !!
What a great explanation! No bull crap. No conspiracy theories. No badmouthing. Just plain facts. Even me… who rarely uses a computer anymore understands, and follows Dave’s explanation and walks away a little more knowledgeable. Thanks Dave😊
Back in those days , I was 19 years old when I decided to buy a Windows 95 book to try to find out how Operating Systems worked. Could not understand a thing but I do vaguely remember something about Ring 0 and Ring 1. Excellent video deserves a multiple like feature. 😄
@@rembautimes8808 I learned to code at 14 on an Atari 400 early 80's, and teaching my computer teacher. Always thought I'd be CEO of IBM but then I hit 17 and wanted to be a rock star. 😆 I remember some of this stuff and found I could follow the logic trail, more or less. He pulls it all together so well I feel in 10 minutes I have a way better picture than I have over the last 40 years.
On Friday everybody also got a crash course in Change Management (or how NOT to do it) too. There are normally multiple barriers in the way of something catastrophic like this happening, and they all got skipped.
They really are the best laptop, and I can do Final Cut on it. On the desktop, I've also got a PC with Windows (and Unbuntu via WSL2). But for anything video related, it's Mac, so...
@@thomasbrotherton4556Microsoft employees also use MacBook in Microsoft. They're a software company. They don't care what you use if you do your work. There are teams inside Microsoft that use slack instead of teams.
"If you cannot explain it to an 8 year old child, you do not know it well enough yourself." Some Scientist said (possibly Einstein) but my brain is a vast relational database of broken links so don't trust it!
What I've learned so far is that every OS has a big boss and that big boss ensures everyone follows the rules and as soon as someone gets out of line the big boss shuts the party down before the looting begins. In all seriousness this is a great video. Subbed!
I completely agree and totally appreciate how Dave does get straight to the point. I'm sure many other content creators start with useless background simply to "pad out" the video.
Absolutely agree. It was a pain to see so many "experts" around the globe talking so much while not explaining anything at all, except that there is nothing that could be done, while as an interested professional you knew that a business could build better systems and architectures (like a few, that were not impacted did) and these people were just talking heads not knowing what was actually going on.
Crowdstrike convinced my company I work as a Network Engineer for to swap over to them and we did around a month and a half ago... The person who made that decisions didn't have to wake up super early in the morning on Friday while panicking.
Tbh there is no guarantee that the other companies have their updater utility made in a safer way, at least cs will pay more attention to that part now. But overall that and their wokeness is something that gives off a bad vibe about the company
Well, to recognize a domain expert as such you also need a community of engineers who know at least enough to tell one apart from a quack. Those are becoming thin on the ground as well. 😅
the reason they are scarce is because "unfettered truth" is bothersome to parties with vested interests, i'm sure many media outlets (big enough to be legal targets) avoid explaining it totally for fear of upsetting the "wrong people".
Dear God! I’ve been out of the IT world for 15 years now, and I still understood his explanations. I’m VERY IMPRESSED by Dave’s clear and concise presentation and astounded by the fact that I remembered enough of this “stuff” to finish some of his sentences! Until today, I was convinced that a benevolent universe had purged all that out of my head to make room for important stuff (like cocktail recipes).
Yes, exactly how I reacted. LIke you, I am a retired IT consulant, retired for nearly 20 years. Dave's presentation could not be clearer. It made me wonder whether the Crowdstrike P-code interpreter creates another vector for introducing viruses, malware and rootkits.
@imairt Completely agree with you, and for pretty much the same reasons. I'd bet TH-cam is awash with channel hosts mumbling and umm-ing their way through this issue right now! (I'd have a look but I don't think I could bear it!)
Agreed! Well explained, and included depth I haven't heard from anyone else so far. Microsoft shares some blame, because Windows is easily broken, gives very little useful diagnostic info, etc. That said, CrowdStrike: I wonder how many people learned something about how to behave one's self whilst in kernel mode. LOL
As a predomainantly IBM Mainframe Sysprog (retired) I am heartened that I actually understood everything explained in this hugely informative explanation. Thank you!
Yes same here... One of the differences is that mainframes use storage protect keys in addition to supervisor/user mode in the PSW. And yes, I've had to fix my share of hard waits due to program checks in the supervisor code :(
As a retired analyst / programmer on IBM mainframes and various minis I've spent plenty of time investigating core dumps, particularly on DOS/VSE. It's one thing to locate the failed instruction, (invariable a decimal exception where a packed decimal field has an invalid value), but tracking what happened up until that point is the fun part. Perform stack chaining and linkage chaining through called subroutines gets very complex and a bit tedious, (especially when called in to work at 2.00am while attending a party on Friday night).
This was incredibly precise and VERY easy to understand. Fortunately my employer doesn't use Crowdstrike so I got to sit back and watch some of my friends scramble. Thank you for putting this out.
This is THE BEST explanation of the Crowdstrike-related outage!! In fact, so many other videos are not even explanations but mere rehashing of 'what' went wrong, instead of 'how' & 'why' it went wrong...
And this is the type of video (or own investigation) I hope government agencies do for the incident. The actual root problem needs to be addressed, not slaps on the wrists or finger pointing. Crowdstrike needs to be punished, but it needs to be understood that another bad actor can do this again or Microsoft themselves and beyond that that this isn't just a windows issue. Apple and Linux don't allow deep kernel level access like this, but theoritically they could still cause themselves a similar issue. We need better regulation over something so ingrained in our lives than the promise that it won't happen again.
@@vullord666 Hmm... I'd be careful in assigning blame...Accountability is fine, but culpability is another ball game altogether... Also, reg. more/better regulation, well, more regulation always has a trade-off of less freedom & less privileges...So, one should be careful what one wishes for... I understand your point; am just saying let's not be reactionary or have a knee-jerk reaction to this incident/issue...
This CEO looked like Don Knotts on speed during his TV appearance. 🤣🤣🤣🤣🤣🤣🤣🤣 He is the absolute Faceplant Champion of Computers. If he was an athlete, we would retire his number.
Just a classic case of a tech company taken over by non-tech leadership who doesn't really understand the intricacies of software development, cutting costs in the wrong places.
Rarely do you encounter a technical subject presented in a manner that effortlessly transcends a wide range of listeners' understanding or experience levels. This video conveys core concepts in an easy-to-understand and memorable way. Dave achieves this without forced analogies or a condescending tone. I learned something today that I will retain. Thanks, Dave, for the great content! 💯
3 days ago no one outside of IT had ever heard of Crowdstrike. Yeah.. idk about that. Everyone at my job knows what Crowdstrike is; and they are not in IT. There's a lot of people that knows about it (who uses a computer, and works in a office).. but not the average joe who's working on a machine or something. But not every work place used Crowdstrike.
@@joester4life Yeah, pretty much everyone who works in or with IT security knows Crowdstrike. As for the users though, the original assumption might very well be true.
I'm a SWE who uses a mac. I knew about crowdstrike falcon from looking at my activity monitior to see what was causing my fan to sound like a jet engine. Falcon was consistently at 800% CPU usage. Complained to the security people to no avail. Fortunately for me this borked update did not seem to affect the mac version. Hopefully my company ditches this junky software.
I think you overestimate how much the user side of this matters. Even juniors have little input on what their companies use, especially in large corporations, so what matters more is what the big wigs think of it
Hi Dave, I’m also a retired Windows developer. It was fun listening to you talk about all those old system components that used to be part of our daily life experience. I was impressed that I remembered enough to understand what you were talking about! Thanks a lot for your explanation. I confess that I feel kind of angry at the CrowdStrike developers for taking such liberties with the kernel code. Seems kind of arrogant. No doubt someone thought they were being super clever by defining their code as something required to run when the kernel starts up. Imagine if the CrowdStrike developer had just arranged a meeting with a Windows kernel expert at Microsoft to discuss what they were planning to do. A whole lot of suffering could have been avoided.
To be fair, I'm sure that if they didn't take that liberty, malware would all just be adapted to shut it down before it executed its malicious instructions. It may potentially have been necessary to continue to define itself as security software.
Sounds like MS needs a new 'must run' list/level so that only their own stuff is on there permanently and then the next level is sw like crowdstrike so its 'necessary' but can be turned off if it breaks the kernel.
@@89qwyg9yqa34t If you were really worried about malware you would be advocating for diversity among computing platforms in all businesses. Farmers know not to plant every acre with the same crops because it's not secure! Today, we really only have 3 platforms: Unix/Linix, Mac, and Windows. Three is not enough, but it would be a start if business settled on roughly 33% of their systems running each, and then began looking for a couple more.
I browsed TH-cam trying to find a good explanation about the Crowstrike outage. I found this one to be the best... Thanking the author for such a great explanation. Excellent job
Hi Dave, thanks for the explanation and bringing back some good old memories. I joined the Windows NT dev team in '93 and was at MSFT until 2011 so I'm sure we crossed paths. For all the talk about AI etc., kernel mode is still kernel mode, pointers are still pointers, and all drivers - I've written my share - should be developed with extreme care by people who understand that every line of code could cause a blue screen and heartache. "Move fast and break things" won't cut it.
@@ruk2023--skirting Microsoft driver certification procedures and low resilience code is very much a case of "move fast". The "break things" is simply a natural result.
@@ruk2023-- Kernel level QC takes time. That contradicts quick deployment. Hence the work around. The lack off stress testing and thus resilience is also a symptom of trying to just get things out the door as quickly as possible. QC takes time.
Extremely well and clearly described, Dave. As a former kernel developer (at Tandem Computers), we didn't allow such back doors, but then we were being deployed as a 24x7 hardware/software fault-tolerant server system and did not have millions using our systems, developing third-party drivers (or attacking them). Yes. Multiple failures at Crowdstrike. Someone wrote that driver code without the requisite error checking, no one caught it in reviews/inspections (if they do that, and if they don't...don't even want to go there), no one in QA thought to test for it or ran the test, someone in the release chain submitted that file (or failed to substitute the correct one if the default is an all zeroes file), etc. I don't expect today's developers/QA to think like we did (what could be corrupted if the processor/driver/adapter/etc fails between this instruction and the next and how can I prevent that corruption). Too time consuming and non-agile. But...apparently no one considers the consequences of not doing so and the damage to customers and the company it causes, or the bean-counters dismiss it as too unlikely and worth the risk.
15 years of industry experience in IT has my spidey sense tingling DIRECTLY toward the bean counters and either using poorly vetted outsourced devs or insufficient funding for enough QA staff or both.... ironic that it took down so many airlines as Boeing's bean counters did the same with the 737-MAX.
As a .NET developer whose work does not involve much around system functions, but higher level abstractions. I appreciate this breakdown of what's happening at the lower levels. Very clear and concise.
@@christophsiebert1213 i did too, some 20 years ago, Bil payed 50.000us$ and changed the code to his liking and called it dos. Pure waste of time, Ubuntu is better and more secure.
@@christophsiebert1213 Why not C/C++? than your software runs pretty much everywhere, you're wasting your time learning things one company decides or changes rather than a committee.
Dave, as a layperson I really appreciated your video. While I did not understand all the language, I found your explanation thorough and informative. I now have a better understanding of why the Crowdstrike crash was so disruptive. Thank you.
@@joelpichette that was my first thought also.. actually I remembered seeing that name for the first time just before shutting down my work laptop, and wondering about such a name. I haven't switched that laptop on since, and probably should remember this video here with Dave's instructions as to what to delete in safe mode, if it won't start next time
"They have a bug they don't protect against" is the key line. CrowdStrike added kernel drivers, but did not make them robust enough. Kernel code, especially when running such complex functionality, should be able to take more abuse from user code without causing a BugCheck. Very disappointing. Great explanation!
I wrote real-time kernel code communicating with a satellite base station via various PCI interfaces. Every friday I'd boot my system in a torture test where I'd intentionally try to crash my interface with malformed requests, out of order requests, logic errors and whatnot. I didn't want the customer's configuration scripts or user mode applications to be able to trigger a kernel panic in any way.
Thanks Dave, found myself to be on the spectrum just a few years ago, at 53. Changed everything! Thanks for your extremely lucid, helpful and complete lessons on this channel! 🙏🏻
I've been a professional software developer for 25 years, and I started out on my C64 when I was 8 years old, moved on to low level DOS 3d graphics programming and later into desktop business software and the web. Your explanations make perfect sense and I'm extremely impressed with the depth of your knowledge. I'll tip my virtual hat to you, sir.
@@DouglasLancy I once saw a guy in the subway who looked exactly like me, but that was years ago. Based on your profile picture our current resemblance isn't very strong, but age changes people
Hi Dave, as an IT student, your explanation of how the operating system works is soo good! I wish my professors were as clear as your videos. Thanks for explaining this to us. Your shirt is awesome!
Would it make sense to have a little spider like checker-tester buddy? Runs along the lines of code and executes all commands and whatnot in a VM so you can see it's all working properly and fix what isn't? That's all I can visualise we need nowadays.
Our IT guy separated our process and business LAN for security purposes and forgot to assign ethernet addresses to all the process modules. He took the whole plant down
@@Roadent1241 You are exactly right. If crowdstrike simply used virtual machines running in a sandbox (or multiple sandboxes) to test their update, none of this would've happened. I have a feeling if we ever know exactly WHO was responsible for this work I could bet their age within a +/- 5 year window. (sorry people over 50, but you know damn well what I mean)
@@lylechipperson3407 Weird that your reaction is to a 50+ yo man ( who once was a Windows developer ) who explains about the bug AND tells how to remove it to let Windows start up again as is should be.
From 40 years in the SW business as programmer, tech doc, project facilitator, I say: you make this difficult stuff understandable and digestible for everybody. Nice job! Thanks!
I am really amazed how you can have the ability to explain such deep subject in such a clear way. First time I've encounter this channel and I am now subscribed to it. Really good work!!
WIth 47 years in Systems Administration and Systems Programming, in Windows, Unix/Linux, and embedded systems, I've seen a lot of things go awry over this period of time, but this Crowdstrike Falcon situation was one of the most scary from the standpoint of having such a huge impact on IT services across the planet. Your description of the situation was perfect...technically spot-on, but also explained in such a way that it was understandable by just about anyone with any concept of the need to control access to device drivers, memory managers, and resource schedulers through kernel services. Very skillfully crafted, as well as calmly stated and with a subtle injection of humor that made it very engaging to listen to through the end...even for a crusty old IT guy like myself. It all goes back to the early days of computers that had "Priviliged Mode" and "User Mode" to enable multi-tasking (so that multiple user-mode programs couldn't step on each other or the operating system) and timesharing (creating virtual environments for multiple users that isolate them from the hardware). Even my old PDP 8/e system has a "Timeshare and Memory Expansion" board in it that adds "User Mode" that traps the execution of certain instructions (HALT, for example, as well as JMP, JSR(Jump to Subroutine), IOT (I/O Transfer) instructions, and of course, the instructions that change between user-mode and privileged mode). When such instructions are encountered when in User Mode, the instruction is not executed, and an interrupt is triggered, which turns on Privileged Mode, and vectors to a interrupt service routine that emulates the execution of the instruction(s) that triggered the interrupt, then sets the mode back to User Mode, and returns to the user program. It has a consequence of slowing down the system a bit, as the CPU has to emulate the instruction(s) that triggered the trap( for example, an instruction that checks the status of a Serial I/O board to see if a character is ready to be transferred), but it was worth it because of the ability to isolate user programs from the hardware. That's early 1970's computing technology. Even then, there were folks that figured out how to trick the system to be able to subvert the protections and crash the primitive multi-user timeshared systems that ran on the PDP 8/e (TSS/8). Such features existed in various forms in computers long before the PDP 8/e came out, dating back to the 1950's. Just change the names from "Privileged Mode" to Ring 0, and "User Mode" to Ring 1, and the concepts are much the same. It's a bit more complicated today, with all the stuff like multiple CPUs, look-ahead, caching, user and kernel memory spaces, and speculative execution, but distilled down to the base functionality, very similar. Crowdstrike is a widely-deployed solution, as it instantly became clear with outages in a huge number of systems that directly affected the public. The place I work for uses it, and we had a number of servers BSOD as a result of the update. The fix was simple as you described, except that a few had Bitlocker set up, which added an additional layer of complexity, but fortunately, the keys were all printed, and locked up the ubiquitous very beefy and heavily fire-rated IT Department safe. It caused some downtime of a number of applications, and certainly hassles for IT to get things back up and running as quickly as possible, but it was caught very quickly and the agents shut down on other machines before it could spread across all of the servers and end-user systems. The worry I have about all of this is that bad actors will inevitably go after the Crowdstrike kernel driver with Ghidra and other such tools and will figure out the instruction set of the p-Code interpreter, as well as finding ways to trick any security/validation wrappers put around p-Code submissions to validate them, and thusly could write their own p-Code routines to wreak havoc on systems that use Crowdstrike. Depending on what kinds of operations that the p-Code engine can perform, the consequences of someone putting together a user-mode program that loads a malicious p-Code program into the engine that causes irreparable damage would make the incident that occurred look tame in comparison. To me this says that Crowdstrike had better get cracking on A) fixing their release chain so faulty updates have much less chance (e.g., very closely approaching zero) of slipping through; B) seriously harden the methodology by which updates are validated to make forging any kind of update extraordinarily difficult, and C) completely revamping the p-Code instruction set such that any "old" p-Code routines fed to it will be trapped, as well as substantially hardening the p-Code's execution validation methodologies (e.g., making sure that the p-Code isn't trying to do something that could lead to system instability or kernel panic). If they don't do all of these things quickly I suspect a lot of customers are going to flee to other platforms out of knee-jerk reaction, which is rather sad, and won't necessarily eliminate the risks, as just about every behavioral detection engine must run in kernel mode, making such solutions potentially vulnerable. Crowdstrike's methodology is overall quite sound, and their methods of detection and analysis of emergent threats is very effective. Their "front-end" is pretty amazing, and has discovered quite a number of emergent threats and pushed out emergency updates that prevented our machines from being compromised. Perhaps engineering got so wrapped up in the threat identification and analysis aspect of Crowdstrike that the computer agent didn't get as much continuous attention that it should have received. Having a p-Code module of all zeroes cause a kernel panic just screams of problems in the p-Code interpreter. No matter what the situation is that allowed this serious problem to occur, it is yet another example of how a borked (I use this word frequently, nice to hear someone else use it!) update (either accidental or supply-chain induced as with Solar Winds) can have massive consequences. It just goes to show just how our world-wide computing infrastructure is perhaps a bit more tenuous than one might believe, and can suffer major difficulties as a result of something innocuous, or worse, maliciously crafted. The scary part is that there are lots of independent and state-sponsored actors out there that will spend lots of money and enormous amounts of distributed time and talent to come up with a way to cause such a situation to occur with who-knows-what piece of software (I'm not necessarily saying Crowdstrike...it could be anything) that could have even worse ramifications than this Crowdstrike incident. That day will inevitably come, and when it does, I sure hope I am retired from working in IT, as it will be a very, very unpleasant time for the world at large, and even worse for anyone who is working in systems administration. Thank you, Dave, for your great channel. Even this jaded old systems guy who has been around the block way too many times learns something and frequently gets a good chuckle from your subtly-injected humor. God Bless.
Dave - this was an insanely clear, concise, and thorough explanation, which is only possible in part to your depth of experience (and in part to your eloquence, wit and dry humor, which I relate to). Thank you!
HI Dave, I've taught operating systems for a long time at university level, so I know exactly what you're talking about. Your explanation here is excellent, short, clear and to the point, not even a little stumble or hesitation. Congrats, it was a pleasure to watch the video. I'm impressed. As a comment, I can't understand why they don't seem to have a robust test environment where they can test these updates to the hilt, the corrupted file is _also_ part of the software.
I believe that the reason here are obviously corporate rules. Cutting costs for maximum profit. Risk of huge fu'ps is calculated. Like in the car industry. Haven't you watched "Fight Club"?
My suspicion (and of course I have no evidence) is that because the distributed file contained only null values, the issue may have been after the testing farm. The update may have passed testing just fine but the file became corrupted when being transferred into the update distribution system. This is no excuse though, there are plenty of ways to easily validate that the file transferred as designed before distribution. Never just trust it. I am looking forward to the details when they are released.
I'm guessing they never saw this zeros/NULL filled file being distributed as a point of failure so there were no tests. It may be there is extensive testing but it never picked up a file corruption before distribution. Suffice to say, there will be a LOT more eyeballs on it now. The driver should have handled it better as well rather than just crashing the Kernel.
@@TC2290-wh5cb Both points you make are true. A driver with error trapping is 'one more chance' to handle an invalid definition file. But the driver executes at Ring 0. If I understand what Dave said, processes operating there cannot access user memory?
You're obviously a skilled and experienced technical powerhouse, but the writing style (sarcasm, wit, technical aptitude combination) and delivery make this more than just a "system dump" of data the viewer has to try and digest. Instead, we're treated to a bit of entertainment as we debug. Thank you for the package deal.
As someone with a computer science major and worked on software design, your definitions of kernel and user modes and how they were different and how they work were great... Better than my professors i had in college...
We still had people talking sweet to our ears like this in the 1990s the last decade before good computers and internet, i miss it so much because it's simple but somehow so stimulating i'm excited now lol. We had kids programmes like art attack and sMart that i frequently watched and somehow had an impact, and they just talked to us in such simple and kind ways and not infantilizing like even kids were little adults with a brain capable of learning. Now nearing my middle 30s i'm relating more to how people talked in older shows, i've been watching bullseye the game show and i really like Jim Bowen great man but the way he talks is just like described. I've seen a few episodes of Tomorrow's World and i felt myself lapping every word up while falling into a relaxed lull, there's just nothing better about the way things used to be explained something special about it that appeals in the right way to the brain.
I've read a bunch of stuff about this issue over the last few days and this video is, by far, the best and most understandable explanation of exactly what happened.
You crack me up, Dave. 😂 The blue screen of death shirt, the offhand reference to using a MacBook (at 0:40) to investigate. Brilliant. Of course it wouldn't be the same without your skill and technical insight to follow up with. I always enjoy your hearing your perspective and learning from your expertise. Keep up the good work.
Dave - this was brilliant. Simple - direct, easy to understand, and your outlining of the solution was amazing. Well done. Good job. Thanks. It just shows that our media (newspapers, TV, online commentators), do not really communicate, and their focus is more of sensationalist news - anything that sells their channell. You have done the most amazing job of succinctly explaining exactly what went wrong and how to fix it. Your explanation is so brilliant, you deserve an award of some kind, for such excellent communication, and understanding. You should be on TV, you are much better than the people who talk about tech on TV, you actually know what you are talking about, and know it very very well. Thank you.
I loved the way how you balanced the need for CrowdStrike to ship the updates swiftly by circumventing the WHQL and underscoring the importance of rigorous testing to ensure the delivered updates doesn’t compromise the integrity and stability of the underlying operating system and kernel. “With great power comes great responsibility” - Software that lives in ring 0 aka kernel mode should deal the changes in a sensitive way to prevent such instances in future. Thank you so much for the great video, as always I’m little late to the party, TH-cam algorithm recommended me this video after 2months of it’s release, I wish I would have watched this video in July’24.😅😊
I'm not even a programmer, but between you and Steve Gibson, I feel like an engineer. This is by far the most clear and in-depth explanation of what happened (based on the current knowledge) that I have heard. Thank you!
if memory serves me right it was someone using a null pointer. and the fact that the error was not caught by anyone in the chain makes me doubt the quality of their programmers.....not just the ones doing the grunt work but the ones that are supposed to conduct the code reviews. closed source software is not trustworthy.
I've been a heavy PC user forever and started with MS-DOS and IBM-DOS in 1985, 8086 etc. I've never written a line of code beyond a complicated batch file. Yet I actually followed you thru your entire presentation. I'm not that smart. You are that good.
I dont know much about IT and programming but man.. your explanation was perfect for a novice like me. Thank you Dave. Also as a deaf person i am thankful that you spoke in calm and clear sentences because that helped the subtitles to work nearly perfectly so thank you again.
@@Ryan-lk4pu I had not heard of P-code before. Since he included it with assembler, I just figured that it was another low-level language that is able to work directly with the hardware.
Great explanation Dave. I'm retired as well and spent the majority of my career developing microprocessors at Motorola and AMD. I would bet at this point that CrowdStrike has at least 4 lawyers for every engineer looking into this with another group of spin doctors looking at how to disclose what happened. It's not a business to be in if you have a weak stomach.
This explanation makes sense, and seems knowledgeable. I've been a systems programmer for 46 years, and I've done kernel programming on various operating systems, including windows.
We are worried about getting p0wned so we install a kernel driver, mark it as critical, and then let a suplier with a history of screwups push updates to it whenever they like with no testing or controls. Good job. Good job.
I'm here with my mouth open, amazed that this is how Windows works. What is the point of the certification process if a driver can do whatever it wants after it is certified? How is there no system in place to disable non-MS drivers that are causing kernel mode errors even if they are boot-start? I'm not sure if this is a valid concern but I'm thinking about all the Chinese computer products that install drivers on my system and what they could be doing in the background even if certified.
@@WhoTnT did you watch the video? The answer for your question is in video - Windows does everything you said but CrowdStrike marked its driver as critical and resticted booting without that.
@@sas408 Maybe you didn't read my comment fully. "disable non-MS drivers that are causing kernel mode errors EVEN IF THEY ARE BOOT-START" The OS of the system should not be overridden by a third party driver. The fact that the system can even be stuck in a boot loop because of a third party driver is insane.
I grew up with computers, I basically learned how to read on MS-DOS back in the Windows 3.1 era. So when I found your channel I had to subscribe because learning about Windows and everything makes me so happy. I know this have absolutely no relationship to your video, I just wanted to share and tell you a "thank you" for making this channel and taking your time to explain stuff.
Sabotage of an update probably by an intel service. Most QA guys are saying no way was this green lit for release without someone changing something post QA. Even the worst of the worst QA guy would have caught that bug so either the QA rubber stamped it, or someone changed something post sign off.
Excellent explanation and overview of kernel mode and rings 0 and 1. I am retired also. I was a C/C++ UNIX/Linux and some Windows programmer. It is refreshing to hear someone who worked on the bleeding edge and knows his stuff explain this problem so completely well. Thanks Dave!
For people like me who have no IT expertise or any particular skill for programming, I learned a lot and also made me understand the basics of operating systems. RING 0 or 1 were completely unknown to me. In short, you gave a good presentation on the subject. THANKS.
It has been a decade since I did development at this level. I have no idea if I will ever return to the field. Why am I mesmerized into keep watching this video? Dave, I think you offered a clearer presentation than any of my university CS professors.
I must admit I couldn't understand most of the terms... but no doubt you Sir must definitely have a LOT of experience! Please keep educating people with your videos. Even if some of us are tottaly beginners it's inspiring
accurate summary. the source of the zeroed file is either a crash during writeout during the build process (full disk/stopped vm scenario likely) or a cdn corruption. both would have been caught by the inclusion of a checksum/manifest pair to validate the payloads were intact. the moment the driver decided to bypass certification and dynamically include contents to speed up the process they should have known they needed to supplant it with a checksum manifest but chose not to for unknown reasons. this is sadly a VERY common outcome in cdn mapped content due a variety of corruption vectors and the trust modern software has in network integrity is rather poorly misplaced. always verify your content is intact regardless of how small/large
I'm just a hobby programmer and even I would have thought to do checksum testing. It's ridiculous, frankly speaking. In a chrome extension I wrote for my personal use which modified existing functions on a page, I only replaced the functions that I tested for the checksum, and the code warned me if the underlying page has updated the JS functions, so I could update my own extension to match the update (and this worked pretty flawlessly and saved me a lot of headaches.) The scary thing is that this kernel hog doesn't even seem to have a way to vet the driver files, the program blindly trusts those files to be the real deal.
If the reason is corruption, then it is mind-boggling that they would not at least have signed their updates with their own certificates prior to running them through QA. That would act as protection against corruption, but also as an additional layer of protection against tampering. Imagine if their distribution machines were compromised and an attacker replaced the update with a malicious rootkit. I'm tempted to say that with the cavalier approach that they took to bypass quality certification by Microsoft to execute code on ring 0, if they didn't sign their updates, then they are amateurs, should lose all business, and their company should disappear. It might happen anyway, if they get sued to oblivion.
It's also an insane risk if they're blindly accepting the file. It's lucky it just ran into a zero byte file and not something created and injected via a malicious third party.
They did it on purpose so they can pretend to be a "bad actor" and insert whatever they want into systems hosting their rootkit malware for whatever purpose they want including but not limited to taking servers and services offline, hard.
@@hesido checksum manifests is an advanced concept. the avg programmer doesnt understand why files would not be what they wrote in the first place. your description for function rehooking sounds just like multiple other good projects. same concept. search, compare, replace/skip. sadly a lot of shady crap goes on in driver land. there's a lot less examples of good ways to do things that low level so the expertise isnt available.
Wow, nice, thanks. As a developer for IBM starting way back in PC DOS 1.0, I understood everything you presented and appreciate your time in explaining not only what happened but how to fix it.
Our engineer dodged this one by not signing up for CS and keeping Sophos. CS charges about $30k extra for content filtering, which Sophos includes. We have computers all over the world so this would have hit us hard not being able to get to all those remote users and sites.
Sophos crashed out our distributed servers every week, sometimes every night. Since we changed to Crowdstrike only had this crash, we remain with CS for sure.
@@franmotero so you like leaving a huge backdoor open day and night. Interesting choice. I prefer the crashes than getting hacked by opening up the kernel to third party custom code.
A little over 10 years ago I was working on a project in the corporate offices of a major bank trying to upgrade from Windows XP to Windows 7 when the geniuses in charge of software deployment decided to force an uninstall of a password vault that tied into the Windows login. The problem was the uninstall process required a reboot and connection to the network for the new software to install. RIP the over 30% of workers that were working from home. I figured out pretty quickly how to fix the problem by using Safe Mode with the Command Prompt to edit the registry but then the geniuses in charge of IT security decided to disable Safe Mode on every system. The ineptitude of that place was astounding.
It is called shooting oneself in the foot with a semi automatic with a large magazine, never stopping pulling the trigger or moving the foot. I feel for you, I have been there.
No, this is basically IT feeling threatened that someone might try to install malicious code into the network and take the whole damn bank down. So they try to make it idiot proof with no one allowed to enter the standard back door of windows and do thing's they really are not authorized to do. You may know what your doing; but joe shmoe next to you don't and one wrong command later can take everyone out. But honestly no security software should ever run next to the kernel for any reason.
Came into work one morning, many years ago and a Windows Engineer had run a script that accidentally started deleting domain user accounts. Within seconds before he hit Ctrl+C around 500 users were deleted. It took a couple hours to restore from backup. The help desk was pounded. Shit Happens, all the time. You learn from the mistakes and you make sure it won't happen again. Nobody would have dreamed something like this could happen and so many systems being impacted so rapidly. That's the problem with complex interactive large systems. Take Joyent and AWS, they both had an engineer accidentally reboot an entire US East data center because they typo'd on the command console. Apparently the default behavior is to execute the command on all nodes on your data center. Both Joyent and AWS fixed their command line consoles from allowing that to ever happen again. One must specify a group of nodes or individual nodes to operate against or the command will refuse to run. One day, someone at Pixar deleted the wire frames for the Toy Story 2 characters. Pixar had to shutdown, have all hands on deck for weeks, practically sleeping at the office. They found a copy on a work from home artist on maternity leave with a Silicon Graphics SGI Irix workstation at home. They drove to her home, wrapped the computer in pillows and carried it on a gurney to a station wagon and drove 15 mph to the Pixar office with their hazard lights on. Then recovered the data from the drives. They still had to spend tens of thousands of hours version checking every file to rebuild what was deleted from the NFS shares that were wide open with practically zero permissions. Local restaurants delivering food to the Pixar office started dropping off food for free because they had their best month financially ever in the history of their business. Scale that scenario up to 8+ Million computers broken by Crowdstrike / Microsoft since Thursday night. Tech's everywhere are exhausted beyond measure.
if only my college professors taught in this easy going and comfortable way, I would have become a better CS graduate who still had a job! Thanks Dave, you are a gem!
You made that look simple. I've seen people scrambling to fix and talk as if they are piloting the star trek enterprise. This is straight to the point.
It's simple to fix a single machine if it's sitting on your desk. It's a bit different when you have thousands of machines on racks, and have to track down and physically interact with each one.
It is not simple to fix if disk is encrypted with BitLocker and CrowdStrike files have group policy , as it is in most corporations. Even if you manage to get bitlocker key to boot in safe mode, you still can't delete files as local admin.
I am a mechanical engineer and I Know fuck all about computers on this level but damn that's the best explanation I have heard for a complex system made simple. You sir are an amazing explainer.
Aren't you the guy that created the task manager?! It's amazing that you chose to use your time to create quality informative content. I'm subscribing immediately!
@@ikategame actually, he is. Follow the link to his book, and read the preview. He specifically says "if you've ever used Task manager, you're using some of my code"
Bought your book. I thought to myself, "This guy looks a lot like me and talks a lot like me." Then you shared your book at the end of the video and I read the entire sample. Thank you for your hard work and dedication to creating such an awesome read.
Ironically I was also worked at a company hit by NoPetya and other attacks that actually used drivers to destroy the boot loader and attempt to encrypt the data drives (while technically masking searching for financial data). Crowdstrike in their early days swooped in and I spent days on the ground with their Senior engineers taking apart the malware and hacking together methods to recover it. Now almost a decade later I work at Microsoft and Crowdstrike does a driver based corruption of the boot process, and I spent Friday and Saturday identifying, tracing, and fixing that aftermath.
I've just been getting into coding in my free time for fun and have made a couple little programs and I just love how much I learn from your videos. So funny, entertaining, informative, and inspiring. Thank you, Dave.
I use computers every day for programming, but at a higher level. I work with end users explaining our complex software. Dave's explanation style, depth, and content are A1. Outstanding video saved for reference.
I made BSoD T-shirts back in the day (mid 90's) and wore one to COMDEX LV one year. The front had a small pocket logo "BSoD The OS of Choice." and the back was a graphics driver BSoD in the classic hex dump version. It went over not so well visiting the Microsoft booth...but it seemed everyone else liked it. I still have a handful of them left over and periodically wear them when working on my cars.
Booting into safe-mose works great if you don't have BitLocker full disk encryption activated like most enterprises. If you didn't print out a hardcopy of the key to enter when you try to boot into safe-mode and your AD servers were also impacted, you better hope you have accessible backups of that data or a lot of information is going to be lost.
Someone found a way that you can get into SafeMode without Bitlocker (many people (including myself)) was happy someone found this.. I had a few machines that had an old Bitlocker key, and a new one didn't update to AD or MBAM. From what I read, the EFI partition with BCD and Boot Manager isn't encrypted and you still need to login with your account in safe mode.
@@joester4lifebootloader can never be encrypted if you want it to be bootable by non proprietary hardware. But this still shouldn't allow you to decrypt the drives using the tpm if you manage to boot into a new boot entry crafted for this. This is ensured by additional checks in the bootloader itself.
@@山田ちゃん actually, its not supposed to boot into safe mode without triggering the bitlocker, at least thats how it works on my machine. i would love to know how they did it
Dave’s explanation of the CrowdStrike IT outage was highly informative. By focusing on the role of the kernel mode driver, he sheds light on the depth and complexity of the issue. This serves as a crucial reminder of the importance of every component in our security infrastructure and the significant consequences when things fail.
Thank you for lifting the lid on this issue. I consider myself technical and I had firsthand experience with this issue impacting my work system, the remedy was applying the fix you described. My system is a corporate windows system with the usual corporate protections in place, so that when it occurred, I was prevented from peaking under the covers due to the limited privileges granted my user access. Seeing your explanation I now realise that as I only have a limited understanding of kernel mode and device drivers it is unlikely I would have figured this out the way you have. I find your explanation has relieved the stress I feel when not knowing the root cause of an issue. Once again thank you for making this information both detailed and accessible to the average person. Good work.
Incredibly unlikely in that regard. Any backdoor the government can use is a backdoor malicious attackers can use. In that case, Crowdstrike would have been screwed long before this.
"It crashes, because it has to" - I feel that the Matrix would have been more complete had they worked this line into the story. Preferably spoken by Morpheus or maybe even Cypher in a serious, deliberate, and credible manner. There's so much to unpack with this line...
They kinda maxed out the playtime getting Reeves to say "There is no spoon" , otherwise Morpheus would have to monologue about the cat glitching out, and then he would've said it.
"Agile, ambitious and aggressive" the sarcasm with which this phase was uttered, wonderful.
Move fast and break everyone's things.
Their product was so disruptive that our paradigm was shifted out of pocket.
7:30 I could listen to this on repeat lol
@@Alkatross you deserve a Level 1 comment with many likes, sir.
@@EujenSanduthis is just a rewrite of. "The road to hell is paved with...". Followed by the clank of some fallen piece of audio equipment.
I am a network systems engineer that had to deal with this for 14 hours that day. This was one of the most informative videos I have ever seen. You helped simplify Windows OS in 15 minutes in a way that hours of reading hasn't. Something about real world scenarios to tag the concept with in my memory really helps. Thanks!
That title is the dumbest thing I've ever heard
Self-doxxing
Pain is the greatest motivator to learn.
@@JM-bl3ihwhat exactly is dumb about a job title that is literally just “the engineer that administers the systems in our company network”?
@@LordSwagtron Super dumb confirms that he watched this stupid shit video
The company I work at got bought by a bigger one. They required us to install Crowdstrike on all servers. We found a memory leak, that Crowdstrike still hasn't fixed after 6 months so I have refused to install it until then. I was on vacation when I saw all URGENT emails from other divisions.
Thank you Crowdstrike for not fixing your memory leaks, it saved my vacation. =P
Give this man a raise! ;)
I feel for you. No one knowingly puts this rubbish on their machines. They have a lot of 'help' in getting an installed base.
@@elta6241 I work for a school district in L.A. We purchased it for our computers. I'm guessing the company has a strong influence with government institutions.
hahahahahaha amazing xD well i guess you can say hey boss i saved our compony do i get a raise xD?
I've been passing on purchasing crowdstrike at my org every year since 2016 as they left a sour taste in my mouth for claiming that Russia hacked the DNC servers and then being unable to provide proof. Haven't trusted them since then.
When I was in high school I had a teacher that had a way of explaining things to you that temporarily elevated you to a fraction of his level of understanding. Today I got to experience that again. Thank you Dave! 🤯
Absolutely the same feeling!
Amen. Today's computational underpinnings are somewhat opaque to me, a 74-year-old whose first computing challenge was to code a very simple program into machine language (not assembly!) and put it onto punched paper tape to run on an old machine (which predated "big iron") in the University's basement.
While this is technically what crashed machines it isn't the worst part.
CS Falcon has a way to control the staging of updates across your environment. businesses who don't want to go out of business have a N-1 or greater staging policy and only test systems get the latest updates immediately. My work for example has a test group at N staging, a small group of noncritical systems at N-1, and the rest of our computers at N-2.
This broken update IGNORED our staging policies and went to ALL machine at the same time. CS informed us after our business was brought down that this is by design and some updates bypass policies.
So in the end, CS caused untold millions of dollars in damages not just because they pushed a bad update, but because they pushed an update that ignored their customers' staging policies which would have prevented this type of widespread damage. Unbelievable.
well. sue them into the ground. nobody in this world needs commercial rootkits.
😯
Wow, that... is just bad? stupid? reckless? Hopefully they change at least that updating beavior, after this ... nightmare
yea this - this was exactly why I held off on many key systems for CS deployment - I was not happy they would just override all staging - nothing should EVER be allowed and I was going to have a zoom call with their team about this.
Thats crazy. I can't believe they took that liability.
As a former CrowdStrike employee this is the best explanation I have heard and is 100% accurate.
Was your last day Friday? LOL (not blaming you but I'd probably quit after that HAHA)
Thanks, I really appreciate that vote of confidence! I was pretty worried about getting something wrong!
@@zippythechickenBC they don’t want to hire cybersecurity experts in house. It’s cheaper for them to use CS.
@@zippythechicken The CS product is one of the best available. It can monitor and alert for far more 'risky' behavior than most other in the industry. There is a reason why their install base is so huge.
@@zippythechicken Because ransomware is a big risk and companies want to have insurance and the insurance companies require insured companies to run software like CrowdStrike because it can and has stopped ransomware infections.
This is by far the best explanation of the problem I have seen. Way better than any media outlet. Really great job Dave. Thanks for this
I guess if we look at it from the medias perspective, they need to explain it in a way so everyone who are not IT savy can understand it. But for us who knows a bit about IT, he does a very good job explaining it.
@@paulhansendk Yup, even most 'technical' news media wouldn't-and shouldn't-go anywhere near this level of detail.
to be fair, he's doing it in more than 10 minutes which i'm sure is well over the limit of any "media outlet" - or even a dedicated tech program (for the masses - such as BBC Click).
still, one cannot deny the clarity of his explanation which very carefully avoids hinting at responsibility of the error and just states the basic facts of what happened.
i don't work in IT and only know cursory programming in HTML and some basic C++ but Dave's explanation made sense to me without having to understand the technical details !
very commendable and earns a Subscribe from me even - just because that's all he is in it for (!) - he deserves as many likes as i can give him for such a beneficial message to society !!
Because the internet is full of technology nerd posers that don’t actually know how a computer works
@@binsarm9026 Individual television news segments are limited to 2.5-to-3 minutes in duration.
What a great explanation!
No bull crap. No conspiracy theories. No badmouthing. Just plain facts. Even me… who rarely uses a computer anymore understands, and follows Dave’s explanation and walks away a little more knowledgeable. Thanks Dave😊
Yes, very insightful. First time on the channel. Will come back.
Never mind the fact that CS could be executing malicious code on your machine.
I just learned more about system functions in 5 minutes then I would’ve imagined. What a clear breakdown on things.
Back in those days , I was 19 years old when I decided to buy a Windows 95 book to try to find out how Operating Systems worked. Could not understand a thing but I do vaguely remember something about Ring 0 and Ring 1. Excellent video deserves a multiple like feature. 😄
There's a city under the sheets.
@@rembautimes8808 I learned to code at 14 on an Atari 400 early 80's, and teaching my computer teacher. Always thought I'd be CEO of IBM but then I hit 17 and wanted to be a rock star. 😆
I remember some of this stuff and found I could follow the logic trail, more or less. He pulls it all together so well I feel in 10 minutes I have a way better picture than I have over the last 40 years.
On Friday everybody also got a crash course in Change Management (or how NOT to do it) too. There are normally multiple barriers in the way of something catastrophic like this happening, and they all got skipped.
than*
Love that while stuck at the airport Dave opened his MacBook. A fair amount of dry humor in this vid.
I caught that too. It would make sense that a retired Windows developer would use a MacBook.
They really are the best laptop, and I can do Final Cut on it. On the desktop, I've also got a PC with Windows (and Unbuntu via WSL2). But for anything video related, it's Mac, so...
Also, “Solitaire damages your Git enlistment” - a joke so subtle, it could be placed in an episode of The Office.
@@thomasbrotherton4556Microsoft employees also use MacBook in Microsoft.
They're a software company. They don't care what you use if you do your work. There are teams inside Microsoft that use slack instead of teams.
Windows pcs worked fine 😂
for some reason dave's explanation was waaay easier to understand than every other video about this
To be fair, it's much easier to explain something when you understand it, which hasn't been the case in most of the media.
Because he knows how to explain things to management that has no technical skills ...he just did that for us
Isn't Bitlocker involved in this mess?
That's what happens when people know what they're talking about, ime
"If you cannot explain it to an 8 year old child, you do not know it well enough yourself." Some Scientist said (possibly Einstein) but my brain is a vast relational database of broken links so don't trust it!
What I've learned so far is that every OS has a big boss and that big boss ensures everyone follows the rules and as soon as someone gets out of line the big boss shuts the party down before the looting begins. In all seriousness this is a great video. Subbed!
With how complex operating systems are and the levels upon levels of logic gates, yeah I’d say it’s good to have a big boss in your OS
I love that you get right to the point, you don't waste time on useless background, and go full speed ahead with just the information. Thank you!
Also absent in the background: annoying music.
All of us spectrum kids can get our clear and concise information here! :)
I completely agree and totally appreciate how Dave does get straight to the point. I'm sure many other content creators start with useless background simply to "pad out" the video.
Absolutely agree. It was a pain to see so many "experts" around the globe talking so much while not explaining anything at all, except that there is nothing that could be done, while as an interested professional you knew that a business could build better systems and architectures (like a few, that were not impacted did) and these people were just talking heads not knowing what was actually going on.
I work in IT. Crowdstrike sales has been calling me trying to get us to switch to them. I don't think they'll be calling us for a bit.
Just pick up the phone and say. "No, I like our systems to be stable and without a backdoor." hang up.
Crowdstrike convinced my company I work as a Network Engineer for to swap over to them and we did around a month and a half ago... The person who made that decisions didn't have to wake up super early in the morning on Friday while panicking.
@@videogamecoverssAnd how was this his fault exactly?
Tbh there is no guarantee that the other companies have their updater utility made in a safer way, at least cs will pay more attention to that part now. But overall that and their wokeness is something that gives off a bad vibe about the company
@@dustojnikhummer for believing in the snake oil.
This ladies and gentleman is what an expert sounds like.
👍
Yep! It's a lonely position. Real experts are thin on the ground, wankers abound
Well, to recognize a domain expert as such you also need a community of engineers who know at least enough to tell one apart from a quack. Those are becoming thin on the ground as well. 😅
the reason they are scarce is because "unfettered truth" is bothersome to parties with vested interests, i'm sure many media outlets (big enough to be legal targets) avoid explaining it totally for fear of upsetting the "wrong people".
Hear, hear!
Yeah, Dave truly is an expert. I understand everything he talks about here, but I couldn't explain it as well as him.
Dear God! I’ve been out of the IT world for 15 years now, and I still understood his explanations. I’m VERY IMPRESSED by Dave’s clear and concise presentation and astounded by the fact that I remembered enough of this “stuff” to finish some of his sentences! Until today, I was convinced that a benevolent universe had purged all that out of my head to make room for important stuff (like cocktail recipes).
Yes, exactly how I reacted. LIke you, I am a retired IT consulant, retired for nearly 20 years. Dave's presentation could not be clearer. It made me wonder whether the Crowdstrike P-code interpreter creates another vector for introducing viruses, malware and rootkits.
It must be sort of like Linux eBPF...
@imairt Completely agree with you, and for pretty much the same reasons. I'd bet TH-cam is awash with channel hosts mumbling and umm-ing their way through this issue right now! (I'd have a look but I don't think I could bear it!)
I agree - he just cuts through this stuff like a hot knife through butter!
Same here, but I've only been retired 4 years.
I appreciate your straight forward, no nonsense delivery that is organized in a logical and understandable format.
I hate how other people at youtube and social media is blaming microsoft
Agreed! Well explained, and included depth I haven't heard from anyone else so far.
Microsoft shares some blame, because Windows is easily broken, gives very little useful diagnostic info, etc.
That said, CrowdStrike: I wonder how many people learned something about how to behave one's self whilst in kernel mode. LOL
As a predomainantly IBM Mainframe Sysprog (retired) I am heartened that I actually understood everything explained in this hugely informative explanation. Thank you!
Yes same here... One of the differences is that mainframes use storage protect keys in addition to supervisor/user mode in the PSW. And yes, I've had to fix my share of hard waits due to program checks in the supervisor code :(
As a retired analyst / programmer on IBM mainframes and various minis I've spent plenty of time investigating core dumps, particularly on DOS/VSE. It's one thing to locate the failed instruction, (invariable a decimal exception where a packed decimal field has an invalid value), but tracking what happened up until that point is the fun part. Perform stack chaining and linkage chaining through called subroutines gets very complex and a bit tedious, (especially when called in to work at 2.00am while attending a party on Friday night).
Just love it when the deeper technicalities are explained for the most of us to at least get a sense of the problem. No magic, just machinery
being in IT for 30 years, your video is precise, easy to follow and on point. Well done.
This was incredibly precise and VERY easy to understand. Fortunately my employer doesn't use Crowdstrike so I got to sit back and watch some of my friends scramble. Thank you for putting this out.
This is THE BEST explanation of the Crowdstrike-related outage!!
In fact, so many other videos are not even explanations but mere rehashing of 'what' went wrong, instead of 'how' & 'why' it went wrong...
And this is the type of video (or own investigation) I hope government agencies do for the incident. The actual root problem needs to be addressed, not slaps on the wrists or finger pointing. Crowdstrike needs to be punished, but it needs to be understood that another bad actor can do this again or Microsoft themselves and beyond that that this isn't just a windows issue. Apple and Linux don't allow deep kernel level access like this, but theoritically they could still cause themselves a similar issue. We need better regulation over something so ingrained in our lives than the promise that it won't happen again.
@@vullord666 Hmm...
I'd be careful in assigning blame...Accountability is fine, but culpability is another ball game altogether...
Also, reg. more/better regulation, well, more regulation always has a trade-off of less freedom & less privileges...So, one should be careful what one wishes for...
I understand your point; am just saying let's not be reactionary or have a knee-jerk reaction to this incident/issue...
The most funny thing is that CEO of Crowdstrike was a CTO at McAfee... during their worldwide faceplant.
And that McAfee were suing Microsoft back in 2006 to ensure their software could talk to the kernel...
This CEO looked like Don Knotts on speed during his TV appearance. 🤣🤣🤣🤣🤣🤣🤣🤣
He is the absolute Faceplant Champion of Computers. If he was an athlete, we would retire his number.
Just a classic case of a tech company taken over by non-tech leadership who doesn't really understand the intricacies of software development, cutting costs in the wrong places.
Bunch of scammers across the av industry. They make money by scaremongering and they introduce at least as much risk as they claim to prevent.
Guy's got form alright.
Rarely do you encounter a technical subject presented in a manner that effortlessly transcends a wide range of listeners' understanding or experience levels. This video conveys core concepts in an easy-to-understand and memorable way. Dave achieves this without forced analogies or a condescending tone. I learned something today that I will retain. Thanks, Dave, for the great content! 💯
Wow, thanks for the kind words, I appreciate the vote of confidence!
3 days ago no one outside of IT had ever heard of Crowdstrike. Now the entire world knows the name. Reputation destroyed in an instant.
3 days ago no one outside of IT had ever heard of Crowdstrike.
Yeah.. idk about that. Everyone at my job knows what Crowdstrike is; and they are not in IT. There's a lot of people that knows about it (who uses a computer, and works in a office).. but not the average joe who's working on a machine or something. But not every work place used Crowdstrike.
@@joester4life Yeah, pretty much everyone who works in or with IT security knows Crowdstrike. As for the users though, the original assumption might very well be true.
I'm a SWE who uses a mac. I knew about crowdstrike falcon from looking at my activity monitior to see what was causing my fan to sound like a jet engine. Falcon was consistently at 800% CPU usage. Complained to the security people to no avail. Fortunately for me this borked update did not seem to affect the mac version. Hopefully my company ditches this junky software.
When the dust settles everyone will remember the name but not how or why they know it.
Short term reputation hit, long term success.
I think you overestimate how much the user side of this matters. Even juniors have little input on what their companies use, especially in large corporations, so what matters more is what the big wigs think of it
Hi Dave, I’m also a retired Windows developer. It was fun listening to you talk about all those old system components that used to be part of our daily life experience. I was impressed that I remembered enough to understand what you were talking about! Thanks a lot for your explanation. I confess that I feel kind of angry at the CrowdStrike developers for taking such liberties with the kernel code. Seems kind of arrogant. No doubt someone thought they were being super clever by defining their code as something required to run when the kernel starts up. Imagine if the CrowdStrike developer had just arranged a meeting with a Windows kernel expert at Microsoft to discuss what they were planning to do.
A whole lot of suffering could have been avoided.
To be fair, I'm sure that if they didn't take that liberty, malware would all just be adapted to shut it down before it executed its malicious instructions. It may potentially have been necessary to continue to define itself as security software.
Sounds like MS needs a new 'must run' list/level so that only their own stuff is on there permanently and then the next level is sw like crowdstrike so its 'necessary' but can be turned off if it breaks the kernel.
@@happyzahn8031 they shouldn’t let third-party code run as part of the kernel at all.
@@happyzahn8031 They do have that, it's called Safe Mode, and that's why the fix requires going into Safe Mode.
@@89qwyg9yqa34t If you were really worried about malware you would be advocating for diversity among computing platforms in all businesses. Farmers know not to plant every acre with the same crops because it's not secure! Today, we really only have 3 platforms: Unix/Linix, Mac, and Windows. Three is not enough, but it would be a start if business settled on roughly 33% of their systems running each, and then began looking for a couple more.
Dave's wardrobe coordinator deserves a raise. We'll played. 👏
I browsed TH-cam trying to find a good explanation about the Crowstrike outage. I found this one to be the best... Thanking the author for such a great explanation. Excellent job
Finally an "Air Crash Investigation" style explanation of what actually happened. I now understand WHAT, WHY and HOW. Thank you, Dave!
Hi Dave, thanks for the explanation and bringing back some good old memories. I joined the Windows NT dev team in '93 and was at MSFT until 2011 so I'm sure we crossed paths. For all the talk about AI etc., kernel mode is still kernel mode, pointers are still pointers, and all drivers - I've written my share - should be developed with extreme care by people who understand that every line of code could cause a blue screen and heartache. "Move fast and break things" won't cut it.
This is not a case of move fast and break things.
@@ruk2023--skirting Microsoft driver certification procedures and low resilience code is very much a case of "move fast". The "break things" is simply a natural result.
@@Keiranful no. Move fast means deploy quickly. This has been a problem for a looooong time. It’s a result of poor quality control
@@ruk2023-- Kernel level QC takes time. That contradicts quick deployment. Hence the work around.
The lack off stress testing and thus resilience is also a symptom of trying to just get things out the door as quickly as possible. QC takes time.
@@Keiranful But this was caused by a problem that was pre-existent for a long time. Did you watch the video? The definition file is just a catalyst.
Could listen to Dave explain IT all day. A natural teacher!
well put. a natural teacher. thoughtful
I concur
I concur
I concur
I concur
Extremely well and clearly described, Dave. As a former kernel developer (at Tandem Computers), we didn't allow such back doors, but then we were being deployed as a 24x7 hardware/software fault-tolerant server system and did not have millions using our systems, developing third-party drivers (or attacking them).
Yes. Multiple failures at Crowdstrike. Someone wrote that driver code without the requisite error checking, no one caught it in reviews/inspections (if they do that, and if they don't...don't even want to go there), no one in QA thought to test for it or ran the test, someone in the release chain submitted that file (or failed to substitute the correct one if the default is an all zeroes file), etc. I don't expect today's developers/QA to think like we did (what could be corrupted if the processor/driver/adapter/etc fails between this instruction and the next and how can I prevent that corruption). Too time consuming and non-agile. But...apparently no one considers the consequences of not doing so and the damage to customers and the company it causes, or the bean-counters dismiss it as too unlikely and worth the risk.
Black swan event perspective of the developers.
Want to bet someone used AI to write the code for them and then trusted it so completely they didn't check?😊😏
15 years of industry experience in IT has my spidey sense tingling DIRECTLY toward the bean counters and either using poorly vetted outsourced devs or insufficient funding for enough QA staff or both.... ironic that it took down so many airlines as Boeing's bean counters did the same with the 737-MAX.
Oh, SysGen. Good old day!
As a .NET developer whose work does not involve much around system functions, but higher level abstractions. I appreciate this breakdown of what's happening at the lower levels. Very clear and concise.
stop wasting your time with microsoft, it's f ake.
@@AnalogDude_ Idk, I do have windows and I also do program with .NET and I ALSO can run those programs. Not much fake I can see...
@@christophsiebert1213 i did too, some 20 years ago, Bil payed 50.000us$ and changed the code to his liking and called it dos.
Pure waste of time, Ubuntu is better and more secure.
@@christophsiebert1213I just clicked the x on a window and it closed. Seems to exist, but maybe only I can see it.
@@christophsiebert1213 Why not C/C++?
than your software runs pretty much everywhere, you're wasting your time learning things one company decides or changes rather than a committee.
Dave, as a layperson I really appreciated your video. While I did not understand all the language, I found your explanation thorough and informative. I now have a better understanding of why the Crowdstrike crash was so disruptive. Thank you.
it was so disruptive because it's their objective... get it ? crowd strike
@@joelpichette that was my first thought also..
actually I remembered seeing that name for the first time just before shutting down my work laptop,
and wondering about such a name.
I haven't switched that laptop on since, and probably should remember this video here with Dave's instructions as to what to delete in safe mode, if it won't start next time
"They have a bug they don't protect against" is the key line. CrowdStrike added kernel drivers, but did not make them robust enough. Kernel code, especially when running such complex functionality, should be able to take more abuse from user code without causing a BugCheck. Very disappointing. Great explanation!
I wrote real-time kernel code communicating with a satellite base station via various PCI interfaces. Every friday I'd boot my system in a torture test where I'd intentionally try to crash my interface with malformed requests, out of order requests, logic errors and whatnot. I didn't want the customer's configuration scripts or user mode applications to be able to trigger a kernel panic in any way.
@@brunovandooren3762 IMO such integration tests should be run with every push to repo, not just weekly.
@@brunovandooren3762 What do you mean I can't just test the happy path?! 🤔
They didn’t field test a small update that they absolutely fucked up. Someone is fired.
@@Don_Giovanni Feel-good software development
Thanks Dave, found myself to be on the spectrum just a few years ago, at 53. Changed everything! Thanks for your extremely lucid, helpful and complete lessons on this channel! 🙏🏻
I've been a professional software developer for 25 years, and I started out on my C64 when I was 8 years old, moved on to low level DOS 3d graphics programming and later into desktop business software and the web. Your explanations make perfect sense and I'm extremely impressed with the depth of your knowledge. I'll tip my virtual hat to you, sir.
Are you me? Basically same origin story. And I agree with your assessment.
Dave said a kernel panic on macOS is “pink” … uh, no, they’re dark grey, usually
I started on a TRS80 when I was 16......
@@DouglasLancy I once saw a guy in the subway who looked exactly like me, but that was years ago. Based on your profile picture our current resemblance isn't very strong, but age changes people
Crowdstrike - good name for a company which hit masses throughout the world with its product.
And Has Been for a decade +
Yeah when I heard the name I thought it was some military tech for launching missiles or something
and I don't believe in coincidence anymore..
Only better name would be “Plague” 😂
DigitalWorldNuke
People like Dave make TH-cam actually useful, love it please keep these videos coming
Hi Dave, as an IT student, your explanation of how the operating system works is soo good! I wish my professors were as clear as your videos. Thanks for explaining this to us. Your shirt is awesome!
My professor would fail us if we didn’t write crazy number of code to test our data structures. NOW I can appreciate why he made us do it. ❤
Would it make sense to have a little spider like checker-tester buddy? Runs along the lines of code and executes all commands and whatnot in a VM so you can see it's all working properly and fix what isn't? That's all I can visualise we need nowadays.
Our IT guy separated our process and business LAN for security purposes and forgot to assign ethernet addresses to all the process modules. He took the whole plant down
@@slipstreamvids7422sounds like someone who would be worried about losing their job
@@Roadent1241 You are exactly right. If crowdstrike simply used virtual machines running in a sandbox (or multiple sandboxes) to test their update, none of this would've happened. I have a feeling if we ever know exactly WHO was responsible for this work I could bet their age within a +/- 5 year window. (sorry people over 50, but you know damn well what I mean)
@@lylechipperson3407 Weird that your reaction is to a 50+ yo man ( who once was a Windows developer ) who explains about the bug AND tells how to remove it to let Windows start up again as is should be.
From 40 years in the SW business as programmer, tech doc, project facilitator, I say: you make this difficult stuff understandable and digestible for everybody. Nice job!
Thanks!
I am really amazed how you can have the ability to explain such deep subject in such a clear way. First time I've encounter this channel and I am now subscribed to it. Really good work!!
Same as me
WIth 47 years in Systems Administration and Systems Programming, in Windows, Unix/Linux, and embedded systems, I've seen a lot of things go awry over this period of time, but this Crowdstrike Falcon situation was one of the most scary from the standpoint of having such a huge impact on IT services across the planet.
Your description of the situation was perfect...technically spot-on, but also explained in such a way that it was understandable by just about anyone with any concept of the need to control access to device drivers, memory managers, and resource schedulers through kernel services. Very skillfully crafted, as well as calmly stated and with a subtle injection of humor that made it very engaging to listen to through the end...even for a crusty old IT guy like myself.
It all goes back to the early days of computers that had "Priviliged Mode" and "User Mode" to enable multi-tasking (so that multiple user-mode programs couldn't step on each other or the operating system) and timesharing (creating virtual environments for multiple users that isolate them from the hardware).
Even my old PDP 8/e system has a "Timeshare and Memory Expansion" board in it that adds "User Mode" that traps the execution of certain instructions (HALT, for example, as well as JMP, JSR(Jump to Subroutine), IOT (I/O Transfer) instructions, and of course, the instructions that change between user-mode and privileged mode). When such instructions are encountered when in User Mode, the instruction is not executed, and an interrupt is triggered, which turns on Privileged Mode, and vectors to a interrupt service routine that emulates the execution of the instruction(s) that triggered the interrupt, then sets the mode back to User Mode, and returns to the user program. It has a consequence of slowing down the system a bit, as the CPU has to emulate the instruction(s) that triggered the trap( for example, an instruction that checks the status of a Serial I/O board to see if a character is ready to be transferred), but it was worth it because of the ability to isolate user programs from the hardware. That's early 1970's computing technology.
Even then, there were folks that figured out how to trick the system to be able to subvert the protections and crash the primitive multi-user timeshared systems that ran on the PDP 8/e (TSS/8). Such features existed in various forms in computers long before the PDP 8/e came out, dating back to the 1950's.
Just change the names from "Privileged Mode" to Ring 0, and "User Mode" to Ring 1, and the concepts are much the same. It's a bit more complicated today, with all the stuff like multiple CPUs, look-ahead, caching, user and kernel memory spaces, and speculative execution, but distilled down to the base functionality, very similar.
Crowdstrike is a widely-deployed solution, as it instantly became clear with outages in a huge number of systems that directly affected the public. The place I work for uses it, and we had a number of servers BSOD as a result of the update. The fix was simple as you described, except that a few had Bitlocker set up, which added an additional layer of complexity, but fortunately, the keys were all printed, and locked up the ubiquitous very beefy and heavily fire-rated IT Department safe. It caused some downtime of a number of applications, and certainly hassles for IT to get things back up and running as quickly as possible, but it was caught very quickly and the agents shut down on other machines before it could spread across all of the servers and end-user systems.
The worry I have about all of this is that bad actors will inevitably go after the Crowdstrike kernel driver with Ghidra and other such tools and will figure out the instruction set of the p-Code interpreter, as well as finding ways to trick any security/validation wrappers put around p-Code submissions to validate them, and thusly could write their own p-Code routines to wreak havoc on systems that use Crowdstrike. Depending on what kinds of operations that the p-Code engine can perform, the consequences of someone putting together a user-mode program that loads a malicious p-Code program into the engine that causes irreparable damage would make the incident that occurred look tame in comparison.
To me this says that Crowdstrike had better get cracking on A) fixing their release chain so faulty updates have much less chance (e.g., very closely approaching zero) of slipping through; B) seriously harden the methodology by which updates are validated to make forging any kind of update extraordinarily difficult, and C) completely revamping the p-Code instruction set such that any "old" p-Code routines fed to it will be trapped, as well as substantially hardening the p-Code's execution validation methodologies (e.g., making sure that the p-Code isn't trying to do something that could lead to system instability or kernel panic). If they don't do all of these things quickly I suspect a lot of customers are going to flee to other platforms out of knee-jerk reaction, which is rather sad, and won't necessarily eliminate the risks, as just about every behavioral detection engine must run in kernel mode, making such solutions potentially vulnerable.
Crowdstrike's methodology is overall quite sound, and their methods of detection and analysis of emergent threats is very effective. Their "front-end" is pretty amazing, and has discovered quite a number of emergent threats and pushed out emergency updates that prevented our machines from being compromised. Perhaps engineering got so wrapped up in the threat identification and analysis aspect of Crowdstrike that the computer agent didn't get as much continuous attention that it should have received. Having a p-Code module of all zeroes cause a kernel panic just screams of problems in the p-Code interpreter.
No matter what the situation is that allowed this serious problem to occur, it is yet another example of how a borked (I use this word frequently, nice to hear someone else use it!) update (either accidental or supply-chain induced as with Solar Winds) can have massive consequences.
It just goes to show just how our world-wide computing infrastructure is perhaps a bit more tenuous than one might believe, and can suffer major difficulties as a result of something innocuous, or worse, maliciously crafted.
The scary part is that there are lots of independent and state-sponsored actors out there that will spend lots of money and enormous amounts of distributed time and talent to come up with a way to cause such a situation to occur with who-knows-what piece of software (I'm not necessarily saying Crowdstrike...it could be anything) that could have even worse ramifications than this Crowdstrike incident.
That day will inevitably come, and when it does, I sure hope I am retired from working in IT, as it will be a very, very unpleasant time for the world at large, and even worse for anyone who is working in systems administration.
Thank you, Dave, for your great channel. Even this jaded old systems guy who has been around the block way too many times learns something and frequently gets a good chuckle from your subtly-injected humor. God Bless.
What an educational pleasure listening to an expert explain a technical matter in such an understandable way.
Dave - this was an insanely clear, concise, and thorough explanation, which is only possible in part to your depth of experience (and in part to your eloquence, wit and dry humor, which I relate to). Thank you!
HI Dave, I've taught operating systems for a long time at university level, so I know exactly what you're talking about. Your explanation here is excellent, short, clear and to the point, not even a little stumble or hesitation. Congrats, it was a pleasure to watch the video. I'm impressed. As a comment, I can't understand why they don't seem to have a robust test environment where they can test these updates to the hilt, the corrupted file is _also_ part of the software.
I believe that the reason here are obviously corporate rules. Cutting costs for maximum profit. Risk of huge fu'ps is calculated. Like in the car industry. Haven't you watched "Fight Club"?
Look, there is a right way to do it, and a profit maximizing way to do it.
My suspicion (and of course I have no evidence) is that because the distributed file contained only null values, the issue may have been after the testing farm. The update may have passed testing just fine but the file became corrupted when being transferred into the update distribution system. This is no excuse though, there are plenty of ways to easily validate that the file transferred as designed before distribution. Never just trust it. I am looking forward to the details when they are released.
I'm guessing they never saw this zeros/NULL filled file being distributed as a point of failure so there were no tests. It may be there is extensive testing but it never picked up a file corruption before distribution. Suffice to say, there will be a LOT more eyeballs on it now. The driver should have handled it better as well rather than just crashing the Kernel.
@@TC2290-wh5cb Both points you make are true. A driver with error trapping is 'one more chance' to handle an invalid definition file. But the driver executes at Ring 0. If I understand what Dave said, processes operating there cannot access user memory?
You're obviously a skilled and experienced technical powerhouse, but the writing style (sarcasm, wit, technical aptitude combination) and delivery make this more than just a "system dump" of data the viewer has to try and digest. Instead, we're treated to a bit of entertainment as we debug.
Thank you for the package deal.
I've been waiting for a decent explanation of this issue. Duly provided by Dave. Thank you, Sir.
Yeah, no one really explained what it was all about.
Old school BSOD t-shirt is awesome!
"BSOD t-shirt" in you favorite search tool finds several. Mine's on the way!
That busted code snippet would make a cool new one lol
As someone with a computer science major and worked on software design, your definitions of kernel and user modes and how they were different and how they work were great... Better than my professors i had in college...
I haven't heard talk like this in almost 40 years! Thanks for the memories! 😁
We still had people talking sweet to our ears like this in the 1990s the last decade before good computers and internet, i miss it so much because it's simple but somehow so stimulating i'm excited now lol. We had kids programmes like art attack and sMart that i frequently watched and somehow had an impact, and they just talked to us in such simple and kind ways and not infantilizing like even kids were little adults with a brain capable of learning.
Now nearing my middle 30s i'm relating more to how people talked in older shows, i've been watching bullseye the game show and i really like Jim Bowen great man but the way he talks is just like described. I've seen a few episodes of Tomorrow's World and i felt myself lapping every word up while falling into a relaxed lull, there's just nothing better about the way things used to be explained something special about it that appeals in the right way to the brain.
I've read a bunch of stuff about this issue over the last few days and this video is, by far, the best and most understandable explanation of exactly what happened.
You crack me up, Dave. 😂 The blue screen of death shirt, the offhand reference to using a MacBook (at 0:40) to investigate. Brilliant.
Of course it wouldn't be the same without your skill and technical insight to follow up with. I always enjoy your hearing your perspective and learning from your expertise. Keep up the good work.
your seeing something you want to see that is not there in the way you are want to assume
Yeah, I want a BSOD shirt too! 🤣
Dave - this was brilliant. Simple - direct, easy to understand, and your outlining of the solution was amazing. Well done. Good job. Thanks. It just shows that our media (newspapers, TV, online commentators), do not really communicate, and their focus is more of sensationalist news - anything that sells their channell. You have done the most amazing job of succinctly explaining exactly what went wrong and how to fix it.
Your explanation is so brilliant, you deserve an award of some kind, for such excellent communication, and understanding. You should be on TV, you are much better than the people who talk about tech on TV, you actually know what you are talking about, and know it very very well. Thank you.
Dave's award could be a hundred thousand new likes and subscribes -- so what are people waiting for? Do it!
I loved the way how you balanced the need for CrowdStrike to ship the updates swiftly by circumventing the WHQL and underscoring the importance of rigorous testing to ensure the delivered updates doesn’t compromise the integrity and stability of the underlying operating system and kernel. “With great power comes great responsibility” - Software that lives in ring 0 aka kernel mode should deal the changes in a sensitive way to prevent such instances in future. Thank you so much for the great video, as always I’m little late to the party, TH-cam algorithm recommended me this video after 2months of it’s release, I wish I would have watched this video in July’24.😅😊
I'm not even a programmer, but between you and Steve Gibson, I feel like an engineer. This is by far the most clear and in-depth explanation of what happened (based on the current knowledge) that I have heard. Thank you!
if memory serves me right it was someone using a null pointer.
and the fact that the error was not caught by anyone in the chain makes me doubt the quality of their programmers.....not just the ones doing the grunt work but the ones that are supposed to conduct the code reviews.
closed source software is not trustworthy.
Do you know if Steve Gibson has a TH-cam channel?
Upvote for mentioning Steve Gibson!
I've been a heavy PC user forever and started with MS-DOS and IBM-DOS in 1985, 8086 etc. I've never written a line of code beyond a complicated batch file. Yet I actually followed you thru your entire presentation. I'm not that smart. You are that good.
well said, i too have rudimentary programming knowledge only and yet could grasp the gist of what happened despite all the technical jargon !
I dont know much about IT and programming but man.. your explanation was perfect for a novice like me. Thank you Dave.
Also as a deaf person i am thankful that you spoke in calm and clear sentences because that helped the subtitles to work nearly perfectly so thank you again.
Thing is it was perfect for a novice and for IT veterans alike,, Dave's quite the guy..
Not too take anything away but I would've like to have been told way P-code is
@@Ryan-lk4pu Wiki is your friend :) It seems it's a Microsoft version of bytecode, that is, code intended to be run on some virtual machine.
@@Ryan-lk4pu I had not heard of P-code before. Since he included it with assembler, I just figured that it was another low-level language that is able to work directly with the hardware.
well, sometimes people get confy and forgets that "with great power comes great responsability", thanks for the video Dave
Great explanation Dave. I'm retired as well and spent the majority of my career developing microprocessors at Motorola and AMD. I would bet at this point that CrowdStrike has at least 4 lawyers for every engineer looking into this with another group of spin doctors looking at how to disclose what happened. It's not a business to be in if you have a weak stomach.
This explanation makes sense, and seems knowledgeable. I've been a systems programmer for 46 years, and I've done kernel programming on various operating systems, including windows.
We are worried about getting p0wned so we install a kernel driver, mark it as critical, and then let a suplier with a history of screwups push updates to it whenever they like with no testing or controls. Good job. Good job.
Yeah,,,WTF
"We are worried about getting p0wned "
You hit the nail on the head here. The fear of the danger is quite often worse than the danger itself...
I'm here with my mouth open, amazed that this is how Windows works. What is the point of the certification process if a driver can do whatever it wants after it is certified? How is there no system in place to disable non-MS drivers that are causing kernel mode errors even if they are boot-start? I'm not sure if this is a valid concern but I'm thinking about all the Chinese computer products that install drivers on my system and what they could be doing in the background even if certified.
@@WhoTnT did you watch the video? The answer for your question is in video - Windows does everything you said but CrowdStrike marked its driver as critical and resticted booting without that.
@@sas408 Maybe you didn't read my comment fully. "disable non-MS drivers that are causing kernel mode errors EVEN IF THEY ARE BOOT-START" The OS of the system should not be overridden by a third party driver. The fact that the system can even be stuck in a boot loop because of a third party driver is insane.
I grew up with computers, I basically learned how to read on MS-DOS back in the Windows 3.1 era. So when I found your channel I had to subscribe because learning about Windows and everything makes me so happy. I know this have absolutely no relationship to your video, I just wanted to share and tell you a "thank you" for making this channel and taking your time to explain stuff.
As an Amiga user, I was very happy to watch this guru meditation.
guru meditation? where?!
"Borked" I have not heard that term in a LONG time. Thank you
I use that term all the time. It's just such a phonetically appropriate description of something misconfigured to the point of failure.
Usually accompanied by a loud, BOING!
Thank you for the technical dive. Nobody else seemed to really offer the reason, outside of "bad update".
Sabotage of an update probably by an intel service. Most QA guys are saying no way was this green lit for release without someone changing something post QA. Even the worst of the worst QA guy would have caught that bug so either the QA rubber stamped it, or someone changed something post sign off.
Or conspiracy theories. not that I'm bashing those per se, but with those it's always something way worse than what it really is. Occams razor etc
I actually learned what the Kernel does 😅 yeah this video was very informative
CrowdStrike go into some detail on their blog. They say it’s a logic error. They also say more info will be forthcoming.
Excellent explanation and overview of kernel mode and rings 0 and 1. I am retired also. I was a C/C++ UNIX/Linux and some Windows programmer. It is refreshing to hear someone who worked on the bleeding edge and knows his stuff explain this problem so completely well. Thanks Dave!
For people like me who have no IT expertise or any particular skill for programming, I learned a lot and also made me understand the basics of operating systems. RING 0 or 1 were completely unknown to me. In short, you gave a good presentation on the subject. THANKS.
This is one of my favorite tech channels on TH-cam. Excellent work, as always.
It has been a decade since I did development at this level. I have no idea if I will ever return to the field. Why am I mesmerized into keep watching this video? Dave, I think you offered a clearer presentation than any of my university CS professors.
There's no financial incentive for great CS field personnel to remain in academia - private industry pays a lot more and with far better conditions.
I must admit I couldn't understand most of the terms... but no doubt you Sir must definitely have a LOT of experience! Please keep educating people with your videos. Even if some of us are tottaly beginners it's inspiring
"It's a fair bet that update 291 will never be needed or used again" Dave you're a legend 😂🤣
Never needed again? Yeah.
Never used again? Bring Your Own Vulnerability says ehhh...
Update 291 should have its own Wikipedia page. "We really 291'ed that release."
Hi, I'm Dave and I'm based as f---, should be how he starts every video
@@Dwigt_Rortugal yeah its now a meme like Room 101
accurate summary. the source of the zeroed file is either a crash during writeout during the build process (full disk/stopped vm scenario likely) or a cdn corruption. both would have been caught by the inclusion of a checksum/manifest pair to validate the payloads were intact. the moment the driver decided to bypass certification and dynamically include contents to speed up the process they should have known they needed to supplant it with a checksum manifest but chose not to for unknown reasons. this is sadly a VERY common outcome in cdn mapped content due a variety of corruption vectors and the trust modern software has in network integrity is rather poorly misplaced. always verify your content is intact regardless of how small/large
I'm just a hobby programmer and even I would have thought to do checksum testing. It's ridiculous, frankly speaking. In a chrome extension I wrote for my personal use which modified existing functions on a page, I only replaced the functions that I tested for the checksum, and the code warned me if the underlying page has updated the JS functions, so I could update my own extension to match the update (and this worked pretty flawlessly and saved me a lot of headaches.)
The scary thing is that this kernel hog doesn't even seem to have a way to vet the driver files, the program blindly trusts those files to be the real deal.
If the reason is corruption, then it is mind-boggling that they would not at least have signed their updates with their own certificates prior to running them through QA. That would act as protection against corruption, but also as an additional layer of protection against tampering. Imagine if their distribution machines were compromised and an attacker replaced the update with a malicious rootkit. I'm tempted to say that with the cavalier approach that they took to bypass quality certification by Microsoft to execute code on ring 0, if they didn't sign their updates, then they are amateurs, should lose all business, and their company should disappear. It might happen anyway, if they get sued to oblivion.
It's also an insane risk if they're blindly accepting the file. It's lucky it just ran into a zero byte file and not something created and injected via a malicious third party.
They did it on purpose so they can pretend to be a "bad actor" and insert whatever they want into systems hosting their rootkit malware for whatever purpose they want including but not limited to taking servers and services offline, hard.
@@hesido checksum manifests is an advanced concept. the avg programmer doesnt understand why files would not be what they wrote in the first place. your description for function rehooking sounds just like multiple other good projects. same concept. search, compare, replace/skip.
sadly a lot of shady crap goes on in driver land. there's a lot less examples of good ways to do things that low level so the expertise isnt available.
Wow, nice, thanks. As a developer for IBM starting way back in PC DOS 1.0, I understood everything you presented and appreciate your time in explaining not only what happened but how to fix it.
Our engineer dodged this one by not signing up for CS and keeping Sophos. CS charges about $30k extra for content filtering, which Sophos includes. We have computers all over the world so this would have hit us hard not being able to get to all those remote users and sites.
I hope you bought that engineer a beer!
Sophos crashed out our distributed servers every week, sometimes every night. Since we changed to Crowdstrike only had this crash, we remain with CS for sure.
@@franmotero OMG are we secured by the equivalent of the Mexican cartels? Are there no good product on the market?
@@franmotero so you like leaving a huge backdoor open day and night. Interesting choice. I prefer the crashes than getting hacked by opening up the kernel to third party custom code.
@@zemm9003 Friendly crashes are better than missile attack
Crowdstrike: Run our software and we guarantee no one will access your system.
😂
"just trust me, bro", and then they drop billions in BSODs.
*no one else
They took “no one” to litterly
A little over 10 years ago I was working on a project in the corporate offices of a major bank trying to upgrade from Windows XP to Windows 7 when the geniuses in charge of software deployment decided to force an uninstall of a password vault that tied into the Windows login. The problem was the uninstall process required a reboot and connection to the network for the new software to install. RIP the over 30% of workers that were working from home. I figured out pretty quickly how to fix the problem by using Safe Mode with the Command Prompt to edit the registry but then the geniuses in charge of IT security decided to disable Safe Mode on every system. The ineptitude of that place was astounding.
It is called shooting oneself in the foot with a semi automatic with a large magazine, never stopping pulling the trigger or moving the foot.
I feel for you, I have been there.
No, this is basically IT feeling threatened that someone might try to install malicious code into the network and take the whole damn bank down. So they try to make it idiot proof with no one allowed to enter the standard back door of windows and do thing's they really are not authorized to do. You may know what your doing; but joe shmoe next to you don't and one wrong command later can take everyone out. But honestly no security software should ever run next to the kernel for any reason.
The dumbest things that companies do is adminlock every necessary tools 😂😂😂
Came into work one morning, many years ago and a Windows Engineer had run a script that accidentally started deleting domain user accounts. Within seconds before he hit Ctrl+C around 500 users were deleted. It took a couple hours to restore from backup. The help desk was pounded. Shit Happens, all the time. You learn from the mistakes and you make sure it won't happen again. Nobody would have dreamed something like this could happen and so many systems being impacted so rapidly. That's the problem with complex interactive large systems. Take Joyent and AWS, they both had an engineer accidentally reboot an entire US East data center because they typo'd on the command console. Apparently the default behavior is to execute the command on all nodes on your data center. Both Joyent and AWS fixed their command line consoles from allowing that to ever happen again. One must specify a group of nodes or individual nodes to operate against or the command will refuse to run. One day, someone at Pixar deleted the wire frames for the Toy Story 2 characters. Pixar had to shutdown, have all hands on deck for weeks, practically sleeping at the office. They found a copy on a work from home artist on maternity leave with a Silicon Graphics SGI Irix workstation at home. They drove to her home, wrapped the computer in pillows and carried it on a gurney to a station wagon and drove 15 mph to the Pixar office with their hazard lights on. Then recovered the data from the drives. They still had to spend tens of thousands of hours version checking every file to rebuild what was deleted from the NFS shares that were wide open with practically zero permissions. Local restaurants delivering food to the Pixar office started dropping off food for free because they had their best month financially ever in the history of their business. Scale that scenario up to 8+ Million computers broken by Crowdstrike / Microsoft since Thursday night. Tech's everywhere are exhausted beyond measure.
I didn't know you could disable Safe Mode.
You covered what was going on far better than all the news articles I’ve come across thanks Dave 😊
You are an absolute pleasure to listen to. Thank you sir.
if only my college professors taught in this easy going and comfortable way, I would have become a better CS graduate who still had a job!
Thanks Dave, you are a gem!
You made that look simple. I've seen people scrambling to fix and talk as if they are piloting the star trek enterprise. This is straight to the point.
It's simple to fix a single machine if it's sitting on your desk. It's a bit different when you have thousands of machines on racks, and have to track down and physically interact with each one.
Ha!
It is not simple to fix if disk is encrypted with BitLocker and CrowdStrike files have group policy , as it is in most corporations. Even if you manage to get bitlocker key to boot in safe mode, you still can't delete files as local admin.
the clarity you bring to your subjects is beyond impressive!
I am a mechanical engineer and I Know fuck all about computers on this level but damn that's the best explanation I have heard for a complex system made simple. You sir are an amazing explainer.
Well said! Deep knowledge is great, but the ability to concisely convey it to others is much more rare-and deeply undervalued.
I used to develop windows device drivers and your explanation of the process was the best of I have ever seen. Keep up the great work!
Aren't you the guy that created the task manager?! It's amazing that you chose to use your time to create quality informative content. I'm subscribing immediately!
He's got a whole video or three about that.
he worked on microsoft before but he is retired now.
Woe. Ctrl-Alt-Del is a real lifesaver. Especially if you run Mathematica, which can really bork out sometimes.
no, he isnt the task manager guy
@@ikategame actually, he is. Follow the link to his book, and read the preview. He specifically says "if you've ever used Task manager, you're using some of my code"
Bought your book. I thought to myself, "This guy looks a lot like me and talks a lot like me." Then you shared your book at the end of the video and I read the entire sample. Thank you for your hard work and dedication to creating such an awesome read.
Hope you enjoy it, and find it useful!
Ironically I was also worked at a company hit by NoPetya and other attacks that actually used drivers to destroy the boot loader and attempt to encrypt the data drives (while technically masking searching for financial data). Crowdstrike in their early days swooped in and I spent days on the ground with their Senior engineers taking apart the malware and hacking together methods to recover it.
Now almost a decade later I work at Microsoft and Crowdstrike does a driver based corruption of the boot process, and I spent Friday and Saturday identifying, tracing, and fixing that aftermath.
If you live long enough, you become the villain?
Does this mean that crowd Strikes are using methods that they borrowed from malware?
I've just been getting into coding in my free time for fun and have made a couple little programs and I just love how much I learn from your videos. So funny, entertaining, informative, and inspiring. Thank you, Dave.
You always cover IT related topics very well easy to understand and follow. Appreciate the time you put into making these videos, thank you.
Glad you like them!
I use computers every day for programming, but at a higher level. I work with end users explaining our complex software.
Dave's explanation style, depth, and content are A1. Outstanding video saved for reference.
I made BSoD T-shirts back in the day (mid 90's) and wore one to COMDEX LV one year. The front had a small pocket logo "BSoD The OS of Choice." and the back was a graphics driver BSoD in the classic hex dump version. It went over not so well visiting the Microsoft booth...but it seemed everyone else liked it. I still have a handful of them left over and periodically wear them when working on my cars.
I wore a BSoD t-shirt the day I met Kevlin Henney at a conference :)
Booting into safe-mose works great if you don't have BitLocker full disk encryption activated like most enterprises. If you didn't print out a hardcopy of the key to enter when you try to boot into safe-mode and your AD servers were also impacted, you better hope you have accessible backups of that data or a lot of information is going to be lost.
Someone found a way that you can get into SafeMode without Bitlocker (many people (including myself)) was happy someone found this.. I had a few machines that had an old Bitlocker key, and a new one didn't update to AD or MBAM.
From what I read, the EFI partition with BCD and Boot Manager isn't encrypted and you still need to login with your account in safe mode.
@@joester4lifebootloader can never be encrypted if you want it to be bootable by non proprietary hardware. But this still shouldn't allow you to decrypt the drives using the tpm if you manage to boot into a new boot entry crafted for this. This is ensured by additional checks in the bootloader itself.
@@山田ちゃん actually, its not supposed to boot into safe mode without triggering the bitlocker, at least thats how it works on my machine. i would love to know how they did it
@@Space_Rat1Unpatched recovery partition most likely. Missing KB5034441 and similar
@@Space_Rat1 same i've gone over to my friends house who runs a hobby server and bitlocker has fucked him on multiple occasions
You're one of the few actual technical TH-camrs. Thanks for explained it a bit more in depths.
lots of technical youtubers out there :D
Dave’s explanation of the CrowdStrike IT outage was highly informative. By focusing on the role of the kernel mode driver, he sheds light on the depth and complexity of the issue. This serves as a crucial reminder of the importance of every component in our security infrastructure and the significant consequences when things fail.
Thank you for lifting the lid on this issue. I consider myself technical and I had firsthand experience with this issue impacting my work system, the remedy was applying the fix you described. My system is a corporate windows system with the usual corporate protections in place, so that when it occurred, I was prevented from peaking under the covers due to the limited privileges granted my user access. Seeing your explanation I now realise that as I only have a limited understanding of kernel mode and device drivers it is unlikely I would have figured this out the way you have. I find your explanation has relieved the stress I feel when not knowing the root cause of an issue. Once again thank you for making this information both detailed and accessible to the average person. Good work.
Excellent summary, as a MSP engineer, this was a really refreshing and concise summary, very grateful.
CrowdStrike seems a perfect vehicle for State actor to gain back door access with or without MS knowledge.
Incredibly unlikely in that regard. Any backdoor the government can use is a backdoor malicious attackers can use.
In that case, Crowdstrike would have been screwed long before this.
Security through obscurity, the fact everyone relied on this platform tells a cookie cutter story of IT security with no redundancy
I'm sure they're much better at it than that.
Backdoors are already built into windows, and the CPU itself.
@@LyricsQuestand routers and switches and servers and hosting providers etc etc
Gotta rescind the CrowdStrike WHQL kernel certification. Great explanation - best I've seen from someone who knows...
"It crashes, because it has to" - I feel that the Matrix would have been more complete had they worked this line into the story. Preferably spoken by Morpheus or maybe even Cypher in a serious, deliberate, and credible manner. There's so much to unpack with this line...
Please, unpack
They kinda maxed out the playtime getting Reeves to say "There is no spoon" , otherwise Morpheus would have to monologue about the cat glitching out, and then he would've said it.
"the body cannot live without the mind" - Morpheus
They wouldn't call it "Windows" if it wasn't going to get broken and replaced from time to time.
@@joep9617 bro, profound.