I once was a employee at Carbon Black, a competitor to CrowdStrike, working in automated testing. It was competitive with the worst software development practices of any organization I've ever been exposed to. The devs were fairly smart, but the assumption was that the purpose of testing was to bless the code that they had written. I agreed to step up and manually test one dev's code, and I reported back that every time that I tried to run the code it killed the process without leaving any diagnostics. The dev said, how can I troubleshoot the problem without any good data? I looked at his code and identified he was not checking for null pointers but just dereferencing them anyway. This was an important step in getting myself terminated as not being a team player.
Only through the tireless efforts of countless engineers correcting other people's stupidity (and sometimes their own) does the world make it through another day without disaster.
I'm working in IT and I'm everyday amazed how far we've made it without the modern civilization folding on itself. Being part of that clusterf%ck in the belly of the beast is equally awesome and terrifying.
I always copy the entire path of the existing code into a -bak or -date directory, then run the new code in place to test it on one production level server, before deploying it to the rest of the servers. That way in can use scp to copy the old working coy back if things really go belly up. But I guess when they rely on automation and all kinds of layers of abstraction between them and the code, they cannot do it that simply and easily.
Of course they cannot (simply) test their code in production level environments. Corporations have made un-maintainability into an art form, where a single deployment-step is so automated, but requires so many manual steps as well, that no single person can ever deploy anything easily. And when you are new to the organization, and learned for the first time how insanely convoluted their deployment process is, you undoubtedly asked "why!?". But as always the answer is "has grown historically" (legacy). And by that time you entered the organization, it would take weeks or months to re-implement this insane architecture into something which can actually be deployed in a sane manner. But we all know to never touch a running system. Even if it's a running nuclear bomb close to detonation.
I don’t see many engineers on LinkedIn with more than a year or two of experience at a company before they move on. It took me a few months to understand our codebase and with all of the reorgs and compounding layers of rotating management it made it difficult for anyone to sit and focus on much of anything for very long.
As a web developer, I can confirm testing in production is the best way to go: the added pressure focuses you, and it saves having to push things to production. The way I like to do it is over ftp with notepad. Or if I'm on the toilet I'll use my phone and edit the files directly using the cpanel file manager. If my software was running on all the most vital computers in the world, I imagine that pressure would make me sharp as a knife and I'd never make a mistake.
This was 100% a culture issue. I left Crowdstrike in March of 2024 specifically because of these types of quality issues. I never expected anything to blow up this bigly, but the culture that enables this type of thing is why I left.
The company I work for has under 200 employees, under 30 devs, and we devs are writing education software. But even we have 5 levels test environments before any change hits production. That's besides automated tests written by the API devs, automated tests written by the front end devs, and automated end-to-end testing by the QA team. Then there is required peer reviews of all code, and the QA dev manual testing. It's scary if a software company with such a critical product is releasing code without at least these guard rails.
I heard from one employee that there's no automated testing. Also, this update was flagged to pass all canary testing at individual companies and to deploy everywhere immediately. And the driver itself is flagged that if it fails during boot Windows shouldn't disable it and boot anyway. The file that caused the crash was all zeros content. This is either intentional and someone shorted a lot of stock, or it's criminally negligent.
worked at a place that had 15 "QA" that all they did was click on the functionality, didn't even read the code, send it to the client to click around and than push to production, worst company I've ever worked at, fuck those guys - when I raised this issue I was fired withing 2-3 weeks!
I have a feeling I know what company you're talking about, not because I worked there but I used to work for a competitor with only one level of testing. 😂
This isn't even the first time this quarter CrowdStrike caused a bunch of machines to kernel panic/bug check. In June, Falcon Sensor was causing RHEL 9.4 to kernel panic. In April, it caused Debian to kernel panic. In both of those cases though it was a Linux kernel bug.
Programmers are generally terrified about missing deadlines and will do whatever you command them to. It's up to the project manager to track delays and ensure the boss is notified in advance that deadlines will be missed. It's up to the boss to ensure they have good project managers and QA testing practices. Yes, this is indeed an organizational failure.
@@WhiteSharks-wz6knNot really, just boot into safe mode and get rid of the borked driver. This is sure going to be annoying for the IT team if they need physical access to do it, and don't forget this must be done for EVERY DEVICES.
@WhiteSharks-wz6kn No. Just need to delete the latest Crowdstrike driver. Usually 2 major steps. A) Either get the specific encryption key access to the company laptop/desktop first or go straight into safe mode. B) Go to the command prompt and delete the latest Crowdstrike driver file (c-00000291*.sys). FYI... I work in IT. Out team of 13 had to go through this process for 700+ employee laptops 💻 on Friday. Some old and some new. Interesting stories to tell at a bar or on Reddit.
@@akin242002 What everyone hears: "Delete the latest Crowdstrike driver file (c-00000291*.sys)." What every malware author hears: "Delete all CrowdStrike files (c-00*.sys).".
In 2021 my previous employer invited a pair of supposed doctors to tell us with a straight face that we would literally all be dead in five years if we didn't get the experimental injection. 2026 confirmed!
I mean, if the driver was writen in Rust then it would crash anyway, since Rust by default crashes on memory unsafety. The c++ code already checked for null pointer as mentioned in the last twitter thread in the video.
@@rj7250a I feel like a kernel driver should probably have a panic handler that unloads or maybe restarts the driver with a count of number of retries. That way any unrecoverable errors (bar compiler bugs/unsafe block promises not being kept) will not bring down the system
@@minerscale I see you don't know much about writing kernel-mode stuff. Unlike a userspace application, nothing is tracked in kernel space, so there's no way to know how to "restart" or unload the offending driver... or anything that has been commingled with it. You have to the driver's shutdown and exit code; once it's done anything "bad", none of its data structures can be trusted, and by extension, the entire kernel, as in ring 0 it could've messed with literally anything.
Let's also mention that not only does the driver run in kernel mode, but it's flagged as running on boot. That is why this outage was so bad: Bluescreen because of driver -> Reboot, ah, this driver is marked as an essential part of the system that we can't boot without -> Bluescreen. Meaning them rolling out a fix will not fix machines automatically, an IT tech has to go over to every single machine and manually reboot in safe mode to have the fix actually applied.
So apparent Crowdstrike Falcon broke a Debian image about three months ago, but because Linux doesn't actually force software updates, it fucked the VMs of a few dozen nerds who reported the issue and rolled back to the previous image before the entire global ecosystem went down. Seems like there's a few lessons to be learned here.
In my experience here's how problem solving with code works. "I want to solve this problem with code. Here is my plan" "Let's write the code now." "Testing the code. Oh no, there are bugs." "I fixed the bugs." "Oh wait, what's this?" "This problem has to do with stuff I can't just fix, guess I'll work around it." "I hate my life, this is really hard." "This works, but it shouldn't. It looks ugly and I hate it." "Whatever, it's working."
@@JxH More like some companies use product A, other companies use product B, not a single using all at once. To use an agricultural analogy, you want a security polyculture, as monoculture is vulnerable to disease.
I worked at a small Dot Com in the early 2000's. We had a QA process for pushing changes to the production web sites. After the QA department had tested a new release, the QA manager manually signed a form that was printed on a sheet of paper, then that sheet of paper was handed to the sysadmin responsible for deploying changes to production. Seems like a foolproof process? Nope. After working there a few months, the QA manager told me that the producers (product owners) were printing out those forms and forging the QA manager's signature. We had no idea we were pushing untested code to production, yet until we found out about this we were being blamed because the production web sites were unreliable.
worked this year in a company that had their QA not look at the code at all, have them just test the functionality by clicking shit on the website, than send to the client (which doesn't understand code) to test it by also clicking shit around and if QA and the client said ok it was pushed to production.. Once I said code needed to be reviewed I lasted another 2-3 weeks before getting fired! fuck those guys, I hope that reporting their asses actually made something happen, but I doubt it
Interned on a major national telcom company as a Security Business Partner, the company had quite a rigid pentesting system where every new system or update requires a form that requires 2 written signatures, one from the higher ups of the cybersec team that confirms that the new asset is good to go for prod and one from the dev team. Turns out some dev teams (the company had multiple dev teams for different projects) just pushed to prod anyway without ever having this signed form or even requesting the cybersec team for one.
It's still a more robust system than most software companies employ these days. Somehow Agile is thought to mean in a lot of teams, it if compiles, it's good to go.
I'm a retired IT guy, part of a team that did global pushes quite regularly. While a flaw in one of our pushes might "only" take down our presence on the web, there were layers upon layers of pre-push testing, staged releases, and so forth. I remember the pucker factor each and every time we did a "for real" push. I empathize when I hear of D'oh!!! misadventures.
Regarding option 3: just wait to see if in 2025 you start hearing "The new government requested data that unfortunately was irrevocably lost during the Crowdstrike debacle."
What's crazy is that the update didn't even change any executable file. A change to a data file should not be able to crash the entire program and even operating system.
Not true a misconfigured config file yaml json toml files regularly cause parsing crashes however it’s unacceptable that a tool like this isn’t resilient to fail safely and gracefully. It’s running as the windows root or in the 0 layer perhaps it crashed it detected itself as a threat or the Os ? Unsure but static config can definitely causes crashes unsure why the bsod was happening unless the OS runtime requires this service to be running or fail this way which would be weird.
@@AIrtfical It's a boot-start driver. If any boot-start driver experiences an unhandled exception, the entire boot sequence fails. If Windows detects and disables a bad boot-start driver (I don't know if it can), the system would be running (yay), but it would violate company policy by running without a required software (uh-oh).
Kernel-level operations have to crash the system when encountering an error, because not crashing can lead to far worse outcomes when dealing with direct memory access. It is by design, and smart design at that. Now you can argue that not being able to boot without the faulty driver instantly after is not the smartest design, but that's on Crowdstrike for flagging their drivers are boot-start drivers.
"Failing upwards" seems to equal "They ~sure~ look great in a suit, let's promote them!". I've seen this over, and over, and over, over the last 30+ years, and it never ends well. It usually goes one of two ways: 1. The person in charge of a thing ends up being so bad or disinterested in their job that some really important thing ends up spectacularly failing even though they avoid blame (ie. today's example), and they stick around to screw up the next thing they're put in charge of. Occasionally they suffer the consequences of their ignorance, but by then the organizational and repetitional damage is done. 2. They muck around for a few years, cluelessly rising on the org chart until they shuffle off to some new employer who's even more impressed with their fashion sense, usually leaving behind a two-comma morass of overdue projects, impossible deadlines, expensive and inappropriate software subscriptions, disgruntled technical staff, and the like.
The thing to understand is that the C level doesn't work for the company or for the customers. They work for the shareholders. So CEOs who make obviously and openly stupid decisions outwardly are often just in effect cooking their books by sacrificing everything else to cut expenses and deliver a quarterly return. And then they bail with a great resume and a bunch of money before everything implodes. Or sometimes even after it implodes, because shareholders don't care and can easily move on to the next legacy brand with their gains. They know when to get out. This practice of corporate looting that pervades America started pretty much with Jack Welch who gutted GE while managing to earn an entire cult following for doing so.
What's worse is that Crowdstrike updates bypass staging policies. So even the smart companies that run critical software updates in their own test systems first to make sure they don't break anything before updating all computers still got the CS update forced upon them. So not only did they ignore their own staging and testing policies, they also ignored everyone else's staging and testing policies.
Yeah, the problem seems to be that those staging/testing policies apply to new versions of the sensor, but not to the data definition files. Which might be ok in theory if they were actually bulletproof against bad data files. But no matter what, they shouldn't have sent out the update to all their clients at the same time. Even if they sent it to a few thousand and waited an hour before sending the rest, it probably would have been enough to prevent this huge disaster. Just bad policies on top of bad policies
The prosecutor: Show me on this graph where did Crowdstrike touch you? Windows: "points at the kernel and starts to sob" The prosecutor: I have no further questions, your honor.
@@libertybelllocks7476 , that's the problem: he probably will get hired as the CEO somewhere else if he wants to be. Give it a few years, and he'll be fine. Instead of, you know, being poor and unemployed, like he deserves.
By the way, Friday 26th is "Admin appreciation day", where you can thank your system administrators who probably spend their weekend reading up on the issue and rebooting all the machines in safe-mode to remove the problematic config file.
I worked in software QA for years. Insane that they literally didnt have a battery of various os configurations setup to test their builds on either in real or virtual forms before live updating. 😮
It's also crazy that apparently a lot of companies bought and deployed the CrowdStrike software without having their penetration testers penetration test it first. "Nah, the marketing guy from CrowdStrike said that they did that test." "Did you also ask whether their test was successful?" "Yes, but suddenly there were free bottles of Champagne and free ladies everywhere..."
@@KatR264 That's a significant part of the reason. People are distressed by chaos, so they look for patterns and signs to explain things away, and also enjoy feeling like they know more than others. Put those together and you get conspiracy theories that both explain chaos and strange events and let them feel superior for 'seeing the truth'.
I immediately knew this was a management/structural problem not a simple IT/QA "standard" miss. So not at all shocked by that being one key takeaway lesson from this.
This boggles my mind as an IT professional. I was part of a team that deployed patches and software for years. This included OS deployment patch deployment, software deployment the whole thing on both Workstations and Servers. We tested our patches extensively before pushing them out to the entire population of the environment. This 1st included a sandbox environment, then a select user / system environment, then we would stage our patches out over several hours so if something happened we could back out before catastrophe struck. And honestly sometimes we would find problems with the patches, and we would be able to immediately stop, suspend and even back out. Yes we would use 3rd party vendor solutions to help with this, and any time we changed ANYTHING we would follow our testing procedures and matrix, normal business. We would never shirk our procedures to test 1st, then deploy. To me this is a total failure of IT Governance and failure to maintain standards. (IT Governance is setting and maintaining standards and policies for the IT Infrastructure)
Also an IT professional. You must be very lucky and VERY sheltered, because way too much of the industry works like this nowadays. It's the kind of thing that happens when you let the normals worm their way in. They immediately make a run on all the leadership positions that all the competent staff don't want to have to do anyway, and then they start getting rid of every policy, procedure, and precaution that could potentially stand in the way of their yearly bonus. Eventually, that shit metastasizes all the way up to the C-Suite, and that's when the seriously unethical and even illegal shit starts happening. I just got kicked off a project this month due to refusing to perform a task that the customer leadership made an extremely public show of ordering us not to touch. The PR hire who was told to take it over waited for months for their leadership to be sequestered in a multi-week meeting, then went psycho on my entire department until my management gave in. There was never any complaint that I was wrong, that I caused any problems, or that I crossed any lines. The official reason is that I was "seen butting heads" too many times. Meanwhile, this guy has almost completely destroyed one application, and is very likely going to tank an upgrade for another, MUCH more vital one by the end of the year. tl;dr....Stay where you are. NEVER leave that company.
The school of "Hey it compiled, it must work." I've been coding for almost 40 years. Yeah, I'm old. It drives me nuts that we do not learn lessons. Company hiring a guy who thinks delivering and using software is testing should have the entire C-suite fired. What happened to the concept of continuous integration, automated testing? Bosses are always too cheap, arrogant, impatient, whatever to put money into testing. And clients, to be fair, are also disinclined to plan for and budget testing.
There are languages where that's far closer to truth. Of course some people complain bitterly when GHC says their program is incomplete rather than produce a broken executable. Ada in particular was designed with this goal, published in 1983, but it likely never will get the huge marketing campaigns Rust or Java enjoyed.
a huge part of the blame must go to the CTOs of the corporations. they are the ones who are "testing in production" by allowing auto updates to run on production servers, and without a working DR plan. it is gross negligence that any change to production is not run in a testing environment first.
4:10 well it's both the employee and an organization issue. So many "developers" write bad codes (and that's such a gentle way to put it) and have zero professionalism about it. And organizational of course as knowing that, you have to create safeguards around deployments
Extra little detail: There wasn't so much a logic problem in the channel file... the channel file was null. Not zero size, but full of nothing but null bytes. And their kernel module apparently does ZERO checking for validity before trying to work with such files. Should be criminal negligence, but that is literally legally impossible since zero enforceable software standards of any kind exist.
@@tetrahedrontri thankfully in the USA they've decided that judges are more important than experts when it comes to this kind of thing. Wouldn't want people knowledgeable in a field to make decisions in it.
@@NoConsequenc3 This saddens me cuz most congress results are not consulted before with a task-force of experts. Also why license to a game sux than buying it steam vs Gog. Bruce Willis (lack of reference of an old article, prolly mads up but made headline anyway because of lack of cross checking vs ahoy physical games videos)
Extra detail: most development on digital laws are from the states - that Bruce Willis iTune article lawyer called for making arguments on that article or any licence to a game instead of owning the game case. I think from Eurogamer, but I know it from a hyperlink rabbit hole from chrome suggestions
But it's not just their negligence, it's also on the heads of people running those systems. When you have a critical system, one of the things you control is updates. Because every update is a potential disaster. Back at university, we had a simple rule - if your code crashes, you're finished; zero points. Never assume anything. Just because a specification says that you'll receive two integers doesn't mean that you'll always receive two integers. Always fail gracefully. There should be no input that causes your program to crash.
The speaker casually rolls over "staggered roll out" as if it's just one of a laundry list of safeguards. But isn't this kinda the big one? Code errors happen. Stagger the roll out and you minimize the damage.
That slap in the back about the null pointer is how my father taught me everything in life, it worked wonders, I'm definitely using it on all my children
I've seen several places today that still have the computers messed up. They are running but something else is going on. Files that they try to retrieve are no longer there. Customer histories wiped out. One place said the computers came back up on Friday and this morning the server fried.
Never attribute to malice that which can be attributed to incompetence. They simply never expected the file to contain null bytes so they never checked for it.
When their business model is entirely predicated on claiming they're less incompetent than the vendors of actually needed software on the same system, and will cover for those, this level of incompetence is gross negligence at best.
I had to Google Bret Hart's clip about dereferencing a null pointer to know if it's AI generated or not. It just looked so random, but I was relieved it was a real footage
4:30 I think this is the same thing as “CrowdStrike didn’t check their code”? There is a CNA article which states that CrowdStrike “Skipped checks”. The article also mentions that the update “should have been pushed to a limited pool first”.
We had a bunch of bunches of autotests updates by our tester. And they were failing regularly, and there was a tiny code that was showing who to blame o) And we had full testing before going to production, and still had minor rare occurring bugs there.
0:38 Once upon a time, I posted that picture of McAfee on FB and it was more-or-less immediately taking down. Sometimes I think that Anti-Virus / Anti-Malware companies were invented to make Microsoft look simply-excellent in comparison.
It probably was an organisational failure... Like Boeing, the manifestation is doors blowing off etc... but the real cause is unwise organisational changes... to boost profits, personal and corporate... at the expense of quality... I don't know if Crowdstrike has shareholders, but there will be pressure to increase profitability... ... outsource IT to India... employ under qualified, cheaper, staff etc... put pressure on managers to deliver... who put pressure on staff to deliver... ... and the manifestation is... global computer failure...
@@XIIchiron78 No it isn't, those are just next-word or next-pixel predictors. They don't understand anything and they're definitely not thinking. There's no "they" there to do the thinking. These are just glorified calculators. You could build one from vacuum tubes.
@@JxHWhile technically x86 has four security rings, in practice most operating systems use just two. Ring 0 for the kernel and Ring 3 or occasionally Ring 1 for all user code.
This is actually the big argument against simulation theory-an environment of that complexity running any length of time at all wouldn’t have glitches, it’d have bricked by now.
There probably was a problem report, but management couldn't read it because their account didn't have access because they reduced the number of licenses for the tool because it was too expensive.
if your CI/CD process does not include a checksum validation against known good code as its last step before deployment then you run an outdated process. this validation step was difficult to implement into an existing container based micro services process but proved invaluable many times over. i would think it would be relatively easy in a monolithic build like CS.
One of the big issues is that crowdstrike tries to do two things differently than their competitors: 1. They want to be fastest to protect machines around the world from novel new malware techniques. 2. They want their sensors to be extremely lightweight. There are two types of antivirus updates: Agent/Sensor updates, and Content updates. Agent updates are slowly rolled out by an IT organization. (This allows IT to test on and brick say, 10 machines before they go and brick 10,000.) Content updates (definition updates) are pushed to all machines, because what the bad guys are doing is constantly changing Most EDR software vendors make major changes to kernel-level detection logic with Agent updates. Because of Crowdstrike's goals however, they push most of that logic into Content updates. That philosophy and design choice has come back to haunt them.
CrowdStrike is better than that. Lol, you mention just 10 test machines! 8500000 machines will you you a much better coverage! /s Not sure whether the /s is really needed here 😉
Having made a lot of money preparing for and avoiding Y2K side effects, I can say with confidence that Y2K has absolutely nothing in common with this single point failure.
I once was a employee at Carbon Black, a competitor to CrowdStrike, working in automated testing. It was competitive with the worst software development practices of any organization I've ever been exposed to. The devs were fairly smart, but the assumption was that the purpose of testing was to bless the code that they had written. I agreed to step up and manually test one dev's code, and I reported back that every time that I tried to run the code it killed the process without leaving any diagnostics. The dev said, how can I troubleshoot the problem without any good data? I looked at his code and identified he was not checking for null pointers but just dereferencing them anyway. This was an important step in getting myself terminated as not being a team player.
"Say you're a Karen without saying you're a Karen" - robot woman's voice
Bro you weren't a team player. Didn't you hear, real men only fix after production
Classic Señor dev
ZII. Zero is initialization. Controlled failure vs uncontrolled failure
@@MrDragonorp TAF - Test After Failure
Anyone who's working in IT should only be surprised how doesn't this happen each month.
Only through the tireless efforts of countless engineers correcting other people's stupidity (and sometimes their own) does the world make it through another day without disaster.
If only things were about you know quality, and truth... instead of "who can get the most attention the fastest."
that's exactly what I was thinking, indeed
working in IT and working in multibillion security company is different.
I'm working in IT and I'm everyday amazed how far we've made it without the modern civilization folding on itself. Being part of that clusterf%ck in the belly of the beast is equally awesome and terrifying.
It's very self asssuring to know that I and a programmer at one of the most advanced tech companies have the same practices
I always copy the entire path of the existing code into a -bak or -date directory, then run the new code in place to test it on one production level server, before deploying it to the rest of the servers. That way in can use scp to copy the old working coy back if things really go belly up.
But I guess when they rely on automation and all kinds of layers of abstraction between them and the code, they cannot do it that simply and easily.
Of course they cannot (simply) test their code in production level environments. Corporations have made un-maintainability into an art form, where a single deployment-step is so automated, but requires so many manual steps as well, that no single person can ever deploy anything easily.
And when you are new to the organization, and learned for the first time how insanely convoluted their deployment process is, you undoubtedly asked "why!?". But as always the answer is "has grown historically" (legacy). And by that time you entered the organization, it would take weeks or months to re-implement this insane architecture into something which can actually be deployed in a sane manner.
But we all know to never touch a running system. Even if it's a running nuclear bomb close to detonation.
@@PhilLesh69 have you heard about Git?
So what’s up with the whole Diddy thing?
I don’t see many engineers on LinkedIn with more than a year or two of experience at a company before they move on. It took me a few months to understand our codebase and with all of the reorgs and compounding layers of rotating management it made it difficult for anyone to sit and focus on much of anything for very long.
As a web developer, I can confirm testing in production is the best way to go: the added pressure focuses you, and it saves having to push things to production. The way I like to do it is over ftp with notepad. Or if I'm on the toilet I'll use my phone and edit the files directly using the cpanel file manager. If my software was running on all the most vital computers in the world, I imagine that pressure would make me sharp as a knife and I'd never make a mistake.
Chad
Real web devs fix in production.
Holy fuck I haven't laughed this hard at a comment in months
As a developer i like to watch ci/cd on the toilet too 😂 it really helps me to not sit too much in the toilet
Giga Chad!
As a Rust programmer I cac confirm that this was our plan to get Rust into production
Even upgrading to c++23 will work instead of changing whole code to rust.
@@ashwithchandra2622wait until c++60, safe memory edition.
@@ashwithchandra2622works! but fails in prod😢.
@@ashwithchandra2622, how? What is so special about CPP 23
@@ashwithchandra2622 what feature of c++23 will make it immune to null pointer de-referencing?
I just finished my first puzzle on Brilliant and got invitation for job interview in CrowdStrike. Wish me luck boys.
Crack up, like it👍😆
Lololol
@@smithwillnot good luck i know you will do a better job, test check, test check, test check, lol
@@smithwillnot oh i forgot disable automatic updates on any OS before you send out the updated patches ha ha,
Don't forget to test in production on a Friday and have another job on standby!
Real men test in production... ON A FRIDAY
The best code is the Friday 5 minute before go home time is the best code
The competent folks are on summer vacation
F yeah Ianit ever ncode nerd
Litrealy never coded in my life
BUT I KNOW DONT UPDATE ON FRIDAY
At 5:30 PM😂😂😂
actually real men don't test their code at all, they just push their code and wait for a scream test
Hahaha damn that "Real men test in production" pic with the submarine guy killed me
it certainly killed the sub guy...
Wait until you find out that his last name was *Rush.* 😮
I bet he thought the same thing
Jeff made that joke in a vid right after that disaster already. If you enjoyed this, go back and watch that.
So what’s up with the whole Diddy thing?
This was 100% a culture issue. I left Crowdstrike in March of 2024 specifically because of these types of quality issues. I never expected anything to blow up this bigly, but the culture that enables this type of thing is why I left.
Emphasizing quality assurance and the organization's responsibility underscores why continuous integration and proper testing are so crucial.
Commented faster than CrowdStrike devs push into production
and way faster than the rollback :D
Smooth
Doubt
Impossible
@@DylanEdd_1 Considering the BSOD, is automated rollback even possible?
The company I work for has under 200 employees, under 30 devs, and we devs are writing education software. But even we have 5 levels test environments before any change hits production. That's besides automated tests written by the API devs, automated tests written by the front end devs, and automated end-to-end testing by the QA team. Then there is required peer reviews of all code, and the QA dev manual testing. It's scary if a software company with such a critical product is releasing code without at least these guard rails.
"The company I work for has under 200 employees, under 30 devs, ..."
FYI - The number zero (two places) is compatible with that sentence structure.
I heard from one employee that there's no automated testing. Also, this update was flagged to pass all canary testing at individual companies and to deploy everywhere immediately. And the driver itself is flagged that if it fails during boot Windows shouldn't disable it and boot anyway. The file that caused the crash was all zeros content. This is either intentional and someone shorted a lot of stock, or it's criminally negligent.
@@JxHI'd be highly impressed if zero devs managed to pull off that much procedure.
worked at a place that had 15 "QA" that all they did was click on the functionality, didn't even read the code, send it to the client to click around and than push to production, worst company I've ever worked at, fuck those guys - when I raised this issue I was fired withing 2-3 weeks!
I have a feeling I know what company you're talking about, not because I worked there but I used to work for a competitor with only one level of testing. 😂
1:13 what the hell is this stock video lmao
My Class Portrait!
F!
I would pay for it.
HR
Search “Bret Hart null pointer” and you’ll find it.
This isn't even the first time this quarter CrowdStrike caused a bunch of machines to kernel panic/bug check. In June, Falcon Sensor was causing RHEL 9.4 to kernel panic. In April, it caused Debian to kernel panic. In both of those cases though it was a Linux kernel bug.
Crowd strike crashed linux a few months back crash Microsoft now it's Macs turn dun dun dun🤪🤪
Programmers are generally terrified about missing deadlines and will do whatever you command them to. It's up to the project manager to track delays and ensure the boss is notified in advance that deadlines will be missed. It's up to the boss to ensure they have good project managers and QA testing practices. Yes, this is indeed an organizational failure.
So are all devices that used Crowdstrike unusable now and need a fresh windows install?
@@WhiteSharks-wz6knNot really, just boot into safe mode and get rid of the borked driver.
This is sure going to be annoying for the IT team if they need physical access to do it, and don't forget this must be done for EVERY DEVICES.
@WhiteSharks-wz6kn No. Just need to delete the latest Crowdstrike driver. Usually 2 major steps.
A) Either get the specific encryption key access to the company laptop/desktop first or go straight into safe mode.
B) Go to the command prompt and delete the latest Crowdstrike driver file (c-00000291*.sys).
FYI... I work in IT. Out team of 13 had to go through this process for 700+ employee laptops 💻 on Friday. Some old and some new. Interesting stories to tell at a bar or on Reddit.
@@akin242002 What everyone hears: "Delete the latest Crowdstrike driver file (c-00000291*.sys)."
What every malware author hears: "Delete all CrowdStrike files (c-00*.sys).".
the existence of project managers is often an organizational failure
"Most of us will be dead by then", that got me rolling
There is nothing funny about nuclear war.
That got me rolling too😅
I thought I got the joke... until I actually did and was like "wait a minute..."
@@ImperativeGames So you say. I find it hilarious! 😅😎
In 2021 my previous employer invited a pair of supposed doctors to tell us with a straight face that we would literally all be dead in five years if we didn't get the experimental injection. 2026 confirmed!
that's right - a classic null pointer dereference... nobody expects the spanish inquisition
it's such an insufficient explanation, a null pointer dereference is a symptom not the root cause.
Ianit ever ncode nerd
Litrealy never coded in my life
BUT I KNOW DONT UPDATE ON FRIDAY😊
They must now sit in the comfy (gamer) chair.
LOL! This has been a problem since the late 1970s.
and it STILL IT!!! OMG!!!!
Again, it was not a null pointer, there was a null check in the code.
The idea that a rust enthusiast would "prove a point" is the most believable thing in the world.
idk I like the point of where this was done on purpose to practice for a real event
I mean, if the driver was writen in Rust then it would crash anyway, since Rust by default crashes on memory unsafety.
The c++ code already checked for null pointer as mentioned in the last twitter thread in the video.
"The idea that a rust enthusiast would "prove a point" is the most believable thing in the world."
Well, other than what actually happened
@@rj7250a I feel like a kernel driver should probably have a panic handler that unloads or maybe restarts the driver with a count of number of retries. That way any unrecoverable errors (bar compiler bugs/unsafe block promises not being kept) will not bring down the system
@@minerscale I see you don't know much about writing kernel-mode stuff. Unlike a userspace application, nothing is tracked in kernel space, so there's no way to know how to "restart" or unload the offending driver... or anything that has been commingled with it. You have to the driver's shutdown and exit code; once it's done anything "bad", none of its data structures can be trusted, and by extension, the entire kernel, as in ring 0 it could've messed with literally anything.
3:50 This felt like a personal attack, I'm literally writing a to-do list application right now as one of my first apps.
Don't feel attacked, writing your own to-do app is a right of passage
Let's also mention that not only does the driver run in kernel mode, but it's flagged as running on boot. That is why this outage was so bad: Bluescreen because of driver -> Reboot, ah, this driver is marked as an essential part of the system that we can't boot without -> Bluescreen. Meaning them rolling out a fix will not fix machines automatically, an IT tech has to go over to every single machine and manually reboot in safe mode to have the fix actually applied.
that "real men test in production" meme was sick!
That got created right after the incident. But it's gold. 😂
One could say it was...Titanic.
"Everyone has a QA system, but not everyone has a production system"
I don't always test my code, but when I do, I do it in production...
@@seanburke424 That saying doesn't make sense to me.
WHY DIDNT I KNOW THAT 1:30 VIDEO EXISTS?
Pure GOLD
Please share a link to it!!!
It's called Making of WrestleMania: The Arcade Game it's on YT
Some memes never get old
So apparent Crowdstrike Falcon broke a Debian image about three months ago, but because Linux doesn't actually force software updates, it fucked the VMs of a few dozen nerds who reported the issue and rolled back to the previous image before the entire global ecosystem went down.
Seems like there's a few lessons to be learned here.
Interesting
Classic small stick that holds the entire global infrastructure from collapse
wait wtf
This update was not "forced by Windows". It wasn't even done by Windows. CrowdStrike updated the rules itself.
The only lesson you need to learn, is to shut up and update your windows system as soon as possible or else we'll do it for you!
- Microsoft.
In my experience here's how problem solving with code works.
"I want to solve this problem with code. Here is my plan"
"Let's write the code now."
"Testing the code. Oh no, there are bugs."
"I fixed the bugs."
"Oh wait, what's this?"
"This problem has to do with stuff I can't just fix, guess I'll work around it."
"I hate my life, this is really hard."
"This works, but it shouldn't. It looks ugly and I hate it."
"Whatever, it's working."
*pushes code in production*
*crash*
"This works, but it shouldn't."
and that's where you ask for your friend's device
The fact that QA doesn't seem to be a thing anymore is mind-boggling.
what do you think? we are in the Agile era now. Fail fast, fail often, QA is not needed.
"It's an organization failure" - A great programmer once said.
Because afterwards he was fired for not being a team player.
And a few rogue bank traders
This goes to show that outsourcing to one single third party for Kernel intrusion detection isn't the best idea ever, lol
or having universal automatic updates pushed to your machine.
So you want Norton, and McAfee, and Kaspersky, and CrowdStrike, and ... ALL installed at once ?
@@JxH More like some companies use product A, other companies use product B, not a single using all at once. To use an agricultural analogy, you want a security polyculture, as monoculture is vulnerable to disease.
@@lachlanmckinnie1406 The great clownstrike famine of `24.
m try
It's boeing all over again, engineers and QA replaced with suits.
Yeah. From reflex, I compared it to that disaster when explaining to others.
@@csibesz07 crowdstrike is now blaming businesses for not having disaster recovery!
replaced with slaves.
Also done in India in a “low cost engineering center”. Lunch time Friday roll out of updates…
@@allangibson8494Hopefully after this fiasco and trump’s being president, the damn suits can stop outsourcing important shits
I worked at a small Dot Com in the early 2000's. We had a QA process for pushing changes to the production web sites.
After the QA department had tested a new release, the QA manager manually signed a form that was printed on a sheet of paper, then that sheet of paper was handed to the sysadmin responsible for deploying changes to production.
Seems like a foolproof process?
Nope.
After working there a few months, the QA manager told me that the producers (product owners) were printing out those forms and forging the QA manager's signature.
We had no idea we were pushing untested code to production, yet until we found out about this we were being blamed because the production web sites were unreliable.
worked this year in a company that had their QA not look at the code at all, have them just test the functionality by clicking shit on the website, than send to the client (which doesn't understand code) to test it by also clicking shit around and if QA and the client said ok it was pushed to production.. Once I said code needed to be reviewed I lasted another 2-3 weeks before getting fired! fuck those guys, I hope that reporting their asses actually made something happen, but I doubt it
Interned on a major national telcom company as a Security Business Partner, the company had quite a rigid pentesting system where every new system or update requires a form that requires 2 written signatures, one from the higher ups of the cybersec team that confirms that the new asset is good to go for prod and one from the dev team. Turns out some dev teams (the company had multiple dev teams for different projects) just pushed to prod anyway without ever having this signed form or even requesting the cybersec team for one.
It's still a more robust system than most software companies employ these days. Somehow Agile is thought to mean in a lot of teams, it if compiles, it's good to go.
I'm a retired IT guy, part of a team that did global pushes quite regularly. While a flaw in one of our pushes might "only" take down our presence on the web, there were layers upon layers of pre-push testing, staged releases, and so forth. I remember the pucker factor each and every time we did a "for real" push. I empathize when I hear of D'oh!!! misadventures.
Regarding option 3: just wait to see if in 2025 you start hearing "The new government requested data that unfortunately was irrevocably lost during the Crowdstrike debacle."
Wouldn't be surprised to see that happening just in 2024 itself
😮! well not that 😮
Funny how that only happens to government systems
or Secret Service internal communication history was lost during Crowdstrike situation... as they say: don't let crisis go to waste...
don't tell me they are going to fly a plane straight to a server and blame the asians and their budhism...
What's crazy is that the update didn't even change any executable file. A change to a data file should not be able to crash the entire program and even operating system.
Not true a misconfigured config file yaml json toml files regularly cause parsing crashes however it’s unacceptable that a tool like this isn’t resilient to fail safely and gracefully. It’s running as the windows root or in the 0 layer perhaps it crashed it detected itself as a threat or the Os ? Unsure but static config can definitely causes crashes unsure why the bsod was happening unless the OS runtime requires this service to be running or fail this way which would be weird.
One level’s data is another level’s code, sometimes.
@@AIrtfical
it's exactly what you said actually. The program forces itself as a requirement for windows to be functional
@@AIrtfical It's a boot-start driver. If any boot-start driver experiences an unhandled exception, the entire boot sequence fails. If Windows detects and disables a bad boot-start driver (I don't know if it can), the system would be running (yay), but it would violate company policy by running without a required software (uh-oh).
Kernel-level operations have to crash the system when encountering an error, because not crashing can lead to far worse outcomes when dealing with direct memory access. It is by design, and smart design at that.
Now you can argue that not being able to boot without the faulty driver instantly after is not the smartest design, but that's on Crowdstrike for flagging their drivers are boot-start drivers.
"Failing upwards" seems to equal "They ~sure~ look great in a suit, let's promote them!". I've seen this over, and over, and over, over the last 30+ years, and it never ends well. It usually goes one of two ways:
1. The person in charge of a thing ends up being so bad or disinterested in their job that some really important thing ends up spectacularly failing even though they avoid blame (ie. today's example), and they stick around to screw up the next thing they're put in charge of. Occasionally they suffer the consequences of their ignorance, but by then the organizational and repetitional damage is done.
2. They muck around for a few years, cluelessly rising on the org chart until they shuffle off to some new employer who's even more impressed with their fashion sense, usually leaving behind a two-comma morass of overdue projects, impossible deadlines, expensive and inappropriate software subscriptions, disgruntled technical staff, and the like.
More that they know how to talk. The distance one can get simply by confidantly bullshitting your way through life is incredible.
I'm convinved that people need to be a certain level of psychopath to be "leaders" and it has nothing to do with their competence
Does anyone remember on what basis Israel chose their first king?
... That guy would look good in a crown
The thing to understand is that the C level doesn't work for the company or for the customers. They work for the shareholders. So CEOs who make obviously and openly stupid decisions outwardly are often just in effect cooking their books by sacrificing everything else to cut expenses and deliver a quarterly return. And then they bail with a great resume and a bunch of money before everything implodes. Or sometimes even after it implodes, because shareholders don't care and can easily move on to the next legacy brand with their gains. They know when to get out.
This practice of corporate looting that pervades America started pretty much with Jack Welch who gutted GE while managing to earn an entire cult following for doing so.
@@XIIchiron78Someone that gets it.
When I was in IT, we would release security updates to IT computers & servers & volunteers a week before releasing to the rest of the company.
What's worse is that Crowdstrike updates bypass staging policies. So even the smart companies that run critical software updates in their own test systems first to make sure they don't break anything before updating all computers still got the CS update forced upon them. So not only did they ignore their own staging and testing policies, they also ignored everyone else's staging and testing policies.
Yeah, the problem seems to be that those staging/testing policies apply to new versions of the sensor, but not to the data definition files. Which might be ok in theory if they were actually bulletproof against bad data files. But no matter what, they shouldn't have sent out the update to all their clients at the same time. Even if they sent it to a few thousand and waited an hour before sending the rest, it probably would have been enough to prevent this huge disaster. Just bad policies on top of bad policies
The prosecutor: Show me on this graph where did Crowdstrike touch you?
Windows: "points at the kernel and starts to sob"
The prosecutor: I have no further questions, your honor.
underrated 😂
😅😂😅
Hiring George Kurtz for your C suite seems to be a bad idea.
He might as well retire after this.
@@libertybelllocks7476 , that's the problem: he probably will get hired as the CEO somewhere else if he wants to be. Give it a few years, and he'll be fine. Instead of, you know, being poor and unemployed, like he deserves.
He is the founder too
@@loggjohnable , which makes it even worse.
And yet they hired him for their C++ suite
That stock footage of the smiling people all flipping off the camera is golden
I think it was personal for the blonde in the background. 😂👌
By the way, Friday 26th is "Admin appreciation day", where you can thank your system administrators who probably spend their weekend reading up on the issue and rebooting all the machines in safe-mode to remove the problematic config file.
I worked in software QA for years. Insane that they literally didnt have a battery of various os configurations setup to test their builds on either in real or virtual forms before live updating. 😮
It's also crazy that apparently a lot of companies bought and deployed the CrowdStrike software without having their penetration testers penetration test it first.
"Nah, the marketing guy from CrowdStrike said that they did that test."
"Did you also ask whether their test was successful?"
"Yes, but suddenly there were free bottles of Champagne and free ladies everywhere..."
I opt for multidimensional lizard overlords, because incompetence is scarier
It does make sense, they like to do test runs before the main event.
This is probably why conspiracy theories have the following they do, in the face of the more likely reality of incompetence.
Why not both?
@@KatR264 That's a significant part of the reason. People are distressed by chaos, so they look for patterns and signs to explain things away, and also enjoy feeling like they know more than others. Put those together and you get conspiracy theories that both explain chaos and strange events and let them feel superior for 'seeing the truth'.
4:16 this Stockton Rush OceanGate meme is unhinged 💀
darkest image☠
"...willing to die on that hill" 💀
@@kv4648 More like valley.
Thanks for context. I thought it was a nuke, rather than a sub. Too tired tonight, I guess.
I immediately knew this was a management/structural problem not a simple IT/QA "standard" miss. So not at all shocked by that being one key takeaway lesson from this.
This boggles my mind as an IT professional. I was part of a team that deployed patches and software for years. This included OS deployment patch deployment, software deployment the whole thing on both Workstations and Servers. We tested our patches extensively before pushing them out to the entire population of the environment. This 1st included a sandbox environment, then a select user / system environment, then we would stage our patches out over several hours so if something happened we could back out before catastrophe struck. And honestly sometimes we would find problems with the patches, and we would be able to immediately stop, suspend and even back out.
Yes we would use 3rd party vendor solutions to help with this, and any time we changed ANYTHING we would follow our testing procedures and matrix, normal business. We would never shirk our procedures to test 1st, then deploy. To me this is a total failure of IT Governance and failure to maintain standards. (IT Governance is setting and maintaining standards and policies for the IT Infrastructure)
Also an IT professional. You must be very lucky and VERY sheltered, because way too much of the industry works like this nowadays. It's the kind of thing that happens when you let the normals worm their way in. They immediately make a run on all the leadership positions that all the competent staff don't want to have to do anyway, and then they start getting rid of every policy, procedure, and precaution that could potentially stand in the way of their yearly bonus. Eventually, that shit metastasizes all the way up to the C-Suite, and that's when the seriously unethical and even illegal shit starts happening. I just got kicked off a project this month due to refusing to perform a task that the customer leadership made an extremely public show of ordering us not to touch. The PR hire who was told to take it over waited for months for their leadership to be sequestered in a multi-week meeting, then went psycho on my entire department until my management gave in. There was never any complaint that I was wrong, that I caused any problems, or that I crossed any lines. The official reason is that I was "seen butting heads" too many times. Meanwhile, this guy has almost completely destroyed one application, and is very likely going to tank an upgrade for another, MUCH more vital one by the end of the year.
tl;dr....Stay where you are. NEVER leave that company.
@@TheSacredDude Alas I retired from that job and no longer have to fight any of those battles
The school of "Hey it compiled, it must work." I've been coding for almost 40 years. Yeah, I'm old. It drives me nuts that we do not learn lessons. Company hiring a guy who thinks delivering and using software is testing should have the entire C-suite fired. What happened to the concept of continuous integration, automated testing? Bosses are always too cheap, arrogant, impatient, whatever to put money into testing. And clients, to be fair, are also disinclined to plan for and budget testing.
There are languages where that's far closer to truth. Of course some people complain bitterly when GHC says their program is incomplete rather than produce a broken executable. Ada in particular was designed with this goal, published in 1983, but it likely never will get the huge marketing campaigns Rust or Java enjoyed.
Sheesh I love opening TH-cam to a fresh Fireship 🥺
a huge part of the blame must go to the CTOs of the corporations. they are the ones who are "testing in production" by allowing auto updates to run on production servers, and without a working DR plan.
it is gross negligence that any change to production is not run in a testing environment first.
"Real Men Test in Production", such a great mem...er, process.
We've started calling the practice of deploying to Production without testing ... CrowdStriking
4:10 well it's both the employee and an organization issue. So many "developers" write bad codes (and that's such a gentle way to put it) and have zero professionalism about it. And organizational of course as knowing that, you have to create safeguards around deployments
I use arch btw
Blahahah
why hospitals and airlines dont run arch, is mind-boggling.
Dude, me too!
“It runs on my machine”
I test in prod btw
4:15 that version of "real men test in production" is ... WOW!
First test, you must.
Production testn't, you don't.
-Yoda, coding of art
Never a Friday you release
if ( nullptr == ptr) thou shall write, the unintended equal operator in comparison to avoid (those are called "Yoda conditions" btw.)
Extra little detail: There wasn't so much a logic problem in the channel file... the channel file was null. Not zero size, but full of nothing but null bytes. And their kernel module apparently does ZERO checking for validity before trying to work with such files. Should be criminal negligence, but that is literally legally impossible since zero enforceable software standards of any kind exist.
I shudder at the day I let politicians describe how my code needs to be written. Yikes on that whole concept.
@@tetrahedrontri thankfully in the USA they've decided that judges are more important than experts when it comes to this kind of thing. Wouldn't want people knowledgeable in a field to make decisions in it.
@@NoConsequenc3
This saddens me cuz most congress results are not consulted before with a task-force of experts.
Also why license to a game sux than buying it steam vs Gog.
Bruce Willis (lack of reference of an old article, prolly mads up but made headline anyway because of lack of cross checking vs ahoy physical games videos)
Extra detail: most development on digital laws are from the states - that Bruce Willis iTune article lawyer called for making arguments on that article or any licence to a game instead of owning the game case.
I think from Eurogamer, but I know it from a hyperlink rabbit hole from chrome suggestions
But it's not just their negligence, it's also on the heads of people running those systems. When you have a critical system, one of the things you control is updates. Because every update is a potential disaster.
Back at university, we had a simple rule - if your code crashes, you're finished; zero points. Never assume anything. Just because a specification says that you'll receive two integers doesn't mean that you'll always receive two integers. Always fail gracefully. There should be no input that causes your program to crash.
The speaker casually rolls over "staggered roll out" as if it's just one of a laundry list of safeguards. But isn't this kinda the big one? Code errors happen. Stagger the roll out and you minimize the damage.
my hunch is that the juniors are left to fend themselves to make release
Love the skit with Bret "The Hitman" Hart lecturing the computer nerds about dereferencing a null pointer.
I was watching this video when I remembered that I didn't check for a nullptr before attempting to dereference my variable. Thanks for the reminder
Now switch to rust and that won't happen 🤣
@@Caellyan Switch to rust and your program wont compile
@@FemboyCatGaming better not to compile, than to compile and have things break in unforeseeable ways
@@DankMemes-xq2xm rusts borrowing and shadowing system is far more convoluted then c pointers
4:53
“Pre-planned in advance”
Bruh
That slap in the back about the null pointer is how my father taught me everything in life, it worked wonders, I'm definitely using it on all my children
Looks Bjarne Stroustrup pulled all his hair out while creating the language.
he went malding?
Are you sure he didn't get that from writing C code? So he wrote C++ while he still had some hair left.
He put his hair into c++
Very relatable
@@javabeanz8549nah, I’m pretty sure it was the ++ that did it.
"Well, not so fast" @2:41 pie in the face ABSOLUTE FREAKIN GOLD 🤣🤣🤣
Real men test in production… Insert OceanGate meme
I've seen several places today that still have the computers messed up. They are running but something else is going on. Files that they try to retrieve are no longer there. Customer histories wiped out. One place said the computers came back up on Friday and this morning the server fried.
Never attribute to malice that which can be attributed to incompetence. They simply never expected the file to contain null bytes so they never checked for it.
When their business model is entirely predicated on claiming they're less incompetent than the vendors of actually needed software on the same system, and will cover for those, this level of incompetence is gross negligence at best.
I had to Google Bret Hart's clip about dereferencing a null pointer to know if it's AI generated or not. It just looked so random, but I was relieved it was a real footage
1:30 THIS IS GOLD! Where do you even find this stuff?
Amazing. I love the internet.
The printer in Ring 0 is killing me.
because it is true
That's an artifact of the 90's an the bane of MSFT support - Gots to luv us some 3rd Party drivers - THEY SUCK.
That transition to the ad read was, well, brilliant!
The memes and visuals have been next level this video
"Real men test in production… "I couldn't help myself but LOL, thumbs up for that alone.
4:30 I think this is the same thing as “CrowdStrike didn’t check their code”? There is a CNA article which states that CrowdStrike “Skipped checks”. The article also mentions that the update “should have been pushed to a limited pool first”.
Also, rolling it out on a Friday. Programmer sins check list completed.
We had a bunch of bunches of autotests updates by our tester. And they were failing regularly, and there was a tiny code that was showing who to blame o) And we had full testing before going to production, and still had minor rare occurring bugs there.
Damn the transition into the ad was super smooth i barely noticed it
5:06 what do you mean "most of us"? LMAO
Ww3
0:38 Once upon a time, I posted that picture of McAfee on FB and it was more-or-less immediately taking down.
Sometimes I think that Anti-Virus / Anti-Malware companies were invented to make Microsoft look simply-excellent in comparison.
It probably was an organisational failure...
Like Boeing, the manifestation is doors blowing off etc... but the real cause is unwise organisational changes... to boost profits, personal and corporate... at the expense of quality...
I don't know if Crowdstrike has shareholders, but there will be pressure to increase profitability...
... outsource IT to India... employ under qualified, cheaper, staff etc... put pressure on managers to deliver... who put pressure on staff to deliver...
... and the manifestation is... global computer failure...
3:05 the perfect moment to include Miyasaki photo (Fromsoftware).
This is 10x better than all the other explanations I've seen up until now
I love how you just casually implied that most people in the world will be dead in the next two years.
Nuclear war.
The living will envy the dead.
AI is here. Most or all of humanity is about to become obsolete. Hell, even worse - they're just competition.
@@XIIchiron78 No it isn't, those are just next-word or next-pixel predictors. They don't understand anything and they're definitely not thinking. There's no "they" there to do the thinking. These are just glorified calculators. You could build one from vacuum tubes.
1:05 is the best stock footage I’ve ever seen 😂
Dave from Dave's Garage has the best description of why this happened.
Dave discussed two "Rings", 0 and 1. Here it's four "Rings", 0 to 3.
@@JxHWhile technically x86 has four security rings, in practice most operating systems use just two. Ring 0 for the kernel and Ring 3 or occasionally Ring 1 for all user code.
0:06 The blue ball is hilarious. Real or fake?
God these videos are getting polished, seamless ad transition
"But I can't test outside prod my data doesn't exist!"😅
I feel like the dev who made the mistake won't be punished. The entire fault definitely goes to the QA team
The dev shouldn't be punished. This sort of failure is an institutional one not an individual one.
the dev probably already got fired.. now, I have no idea if he'll get sued as well and if a judge would understand it
Nah the higherups 100% threw him under the bus.
When you push code to production on a Friday without peer review/QA process, you deserve to be fired.
Now I understand why Null Pointers are called as Billion Dollar Mistakes 💀
Okay that Brilliant ad fit perfectly. I didn't even see it coming
Cant believe brilliant is sponsoring this guy😂
Imagine if this happened to one of the computers that run the Matrix we live in
This is actually the big argument against simulation theory-an environment of that complexity running any length of time at all wouldn’t have glitches, it’d have bricked by now.
What really happened explained below:
Management doesn't actually read your PRs.
it's worse than that. the entire organization is based around maximizing profit without putting in the work.
😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅😅
You need to test the hell out of any changes to the codebase if a failure can wipe out millions of computers. Human eyes on a PR are not enough
It's probably a 20 year old pile of shit, with high turn-over, nobody wants to maintain old crap.
There probably was a problem report, but management couldn't read it because their account didn't have access because they reduced the number of licenses for the tool because it was too expensive.
1:30 bro this made my day
if your CI/CD process does not include a checksum validation against known good code as its last step before deployment then you run an outdated process. this validation step was difficult to implement into an existing container based micro services process but proved invaluable many times over. i would think it would be relatively easy in a monolithic build like CS.
Thanks for putting together the video I will be using in this coming Wednesdays postmortem.
One of the big issues is that crowdstrike tries to do two things differently than their competitors:
1. They want to be fastest to protect machines around the world from novel new malware techniques.
2. They want their sensors to be extremely lightweight.
There are two types of antivirus updates: Agent/Sensor updates, and Content updates.
Agent updates are slowly rolled out by an IT organization. (This allows IT to test on and brick say, 10 machines before they go and brick 10,000.)
Content updates (definition updates) are pushed to all machines, because what the bad guys are doing is constantly changing
Most EDR software vendors make major changes to kernel-level detection logic with Agent updates. Because of Crowdstrike's goals however, they push most of that logic into Content updates. That philosophy and design choice has come back to haunt them.
CrowdStrike is better than that. Lol, you mention just 10 test machines! 8500000 machines will you you a much better coverage! /s
Not sure whether the /s is really needed here 😉
The title is Gold.
I knew it's gonna be hard not to mention rust.
Having made a lot of money preparing for and avoiding Y2K side effects, I can say with confidence that Y2K has absolutely nothing in common with this single point failure.
I know absolutely nothing about coding or programming, but I keep watching these videos it makes me feel special
0:11 pyrocynical jumpscare reference