The CrowdStrike Crisis Proves The Software Industry MUST CHANGE

Continuous Delivery

มุมมอง 22 086

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 15 ก.ย. 2024
The CrowdStrike disaster was a failure of software engineering at the company. It was caused by a development process that was clearly inadequate given the risks inherent in the design and approach to this critical system software.
In this episode, Dave Farley explores in more depth how the CrowdStrike system works, what went wrong, and why it went wrong. He also explores what CrowdStrike could, and should, have done to avoid this failure. This shouldn't be dismissed with a shrug and comment about how "bad things happen sometimes" response.
This was an easily predictable failure and Dave explains why, and how, we as an industry, should do better.
-
⭐ PATREON:
Join the Continuous Delivery community and access extra perks & content! ➡️ bit.ly/Continu...
🎥 Join Us On TikTok ➡️ / modern.s.engineering
-
👕 T-SHIRTS:
A fan of the T-shirts I wear in my videos? Grab your own, at reduced prices EXCLUSIVE TO CONTINUOUS DELIVERY FOLLOWERS! Get money off the already reasonably priced t-shirts!
🔗 Check out their collection HERE: ➡️ bit.ly/3Uby9iA
🚨 DON'T FORGET TO USE THIS DISCOUNT CODE: ContinuousDelivery
-
BOOKS:
📖 Dave’s NEW BOOK "Modern Software Engineering" is available as paperback, or kindle here ➡️ amzn.to/3DwdwT3
and NOW as an AUDIOBOOK available on iTunes, Amazon and Audible.
📖 The original, award-winning "Continuous Delivery" book by Dave Farley and Jez Humble ➡️ amzn.to/2WxRYmx
📖 "Continuous Delivery Pipelines" by Dave Farley
Paperback ➡️ amzn.to/3gIULlA
ebook version ➡️ leanpub.com/cd...
NOTE: If you click on one of the Amazon Affiliate links and buy the book, Continuous Delivery Ltd. will get a small fee for the recommendation with NO increase in cost to you.
-
🖇 LINKS:
🔗 "How a Tiny Bug Took Down the World", by Michele Brissoni / how-tiny-bug-took-down...
🔗 "CrowdStrike's Official Explanation 1" www.crowdstrik...
🔗 "CrowdStrike's Official Explanation 2" www.crowdstrik...
🔗 "Microsoft Engineer's Explanation 1", Dave Plummer • CrowdStrike IT Outage ...
🔗 "Microsoft Engineer's Explanation 2", Dave Plummer • CrowdStrike Update: La...
🔗 "OS Protection Levels", Wikipedia en.wikipedia.o...
🔗 "CrowdStrike's CEO has Failed in the Same Way Before" www.aa.com.tr/...
🔗 "CrowdStrike's Official Explanation 3 - LATEST UPDATE:" www.crowdstrik...
-
CHANNEL SPONSORS:
Equal Experts is a product software development consultancy with a network of over 1,000 experienced technology consultants globally. They increase the pace of innovation by using modern software engineering practices that embrace Continuous Delivery, Security, and Operability from the outset ➡️ bit.ly/3ASy8n0
TransFICC provides low-latency connectivity, automated trading workflows and e-trading systems for Fixed Income and Derivatives. TransFICC resolves the issue of market fragmentation by providing banks and asset managers with a unified low-latency, robust and scalable API, which provides connectivity to multiple trading venues while supporting numerous complex workflows across asset classes such as Rates and Credit Bonds, Repos, Mortgage-Backed Securities and Interest Rate Swaps ➡️ transficc.com
#crowdstrike #softwareengineering #programmer

ความคิดเห็น • 325

@mrpocock หลายเดือนก่อน ⁺¹⁹⁶
So in summary, their code didn't check inputs, their unit testing didn't check invalid inputs, their integration testing didn't check for all deployment configurations, their release strategy didn't canary test reliably, and their management continues to prioritise cashflow over code quality.
@MichaelCampbell01 หลายเดือนก่อน ⁺³²
Sure, but other than that, it was flawless =D
@queenstownswords หลายเดือนก่อน ⁺¹²
There was a spoof video that went around some time ago (, I think from Atlassian,). The 'joke' was to fire the QA team - to save money. The joke is on us, the users.
@mrpocock หลายเดือนก่อน ⁺¹¹
@@MichaelCampbell01 I am sure the existing tests all passed, after they'd been edited or commented out to pass :)
@HartleySan หลายเดือนก่อน ⁺¹⁴
As ThePrimeagen said in his video analysis: "Make me the Senior VP of Engineering, and I'll introduce revolutionary ideas to your testing pipeline like installing the update on a Windows machine, and then TURNING THE MACHINE ON! If it blows us, well, now you know."
@timop6340 หลายเดือนก่อน ⁺⁴
@@mrpococksomeone realised that spending time modifying valid unit tests to pass instead of modifying actual code is a much faster way to work and that way they could release a new version already at Friday? 😅
@FrostSpike หลายเดือนก่อน ⁺³¹
"The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair." - Douglas Adams
@noControl556 หลายเดือนก่อน ⁺⁶⁷
About 6 or 9 years ago I heard a VP say, "Why do we need QA? Why can't developers just not write bugs?" She was basically laughed out of the room. In the years since software companies have totally abandoned the role of QA to the point where a basic null pointer can shutdown the airline industry because no one can be bothered to test kernel-level code updates.
@Meritumas หลายเดือนก่อน ⁺²
"saving" ... :-)
@HartleySan หลายเดือนก่อน ⁺⁶
For some reason, when you said "no one can be bothered to test kernel-level code updates", I just shuddered. I know that's what happened with Crowdstrike, but when I saw it in those words, I was exasperated all over again.
@timop6340 หลายเดือนก่อน ⁺⁶
VP thinks about next quarterly results with the savings. When things eventually crash and burn catastrophically, VP just fails upwards and enjoys a golden parachute while the workers scramble to fix everything VP's decisions have been messing up badly.
@Meritumas หลายเดือนก่อน
@@timop6340 and after that workers get laid off, or even before they manage to fix the mess… been there, seen that
@Marck1122 หลายเดือนก่อน ⁺³
That's what happens when human QA is replaced by a bunch of stupid automated tests which this channel also recommends
@Meritumas หลายเดือนก่อน ⁺⁵²
Yet, they automatically decline job applications from experienced software engineers that happen to be 40+ years of age....
@test143000 หลายเดือนก่อน
Simple search of Linkedin shows that most (or at least significant part) of their senior engineers are 40+ and there are engineers who turned 60.
@davidvernon3119 หลายเดือนก่อน ⁺¹
If you can produce evidence of this that is actionable. At least in the us.
@ehsnils หลายเดือนก่อน
@@davidvernon3119 It's hard to get evidence for something like that. They just claim "obsolete competence" in the rejection letter.
@nezbrun872 หลายเดือนก่อน
How old is Vincent Flibustier?
@slider799 หลายเดือนก่อน ⁺⁵
Yes this is because they would not be stupid enough to circumvent the protections that were in place and refuse to implement them. I have refused to do requested tasks for moral / ethical issues and had problems from it.
I have also had the other problem where people forced chances in code review by mob rule of which they failed to understand from their prior design flaws and then blame me when it went wrong.
The industry lacks accountability. Both in terms of companies in public but also internally in the jobs.
@RickTheClipper หลายเดือนก่อน ⁺³⁸
The clusters of competence vanish, be it Boeing, NASA, and IT.
The beancounters rule.
I lost count of how often I told the management that "I do not ship sh.t"
I lost the job, the software was a beta, and 80% ended in disaster, the other 20% went live with a 500 to 1000 percent budget overrun
Crowdstrike did not perform canary testing, I ask why
@Bozebo หลายเดือนก่อน ⁺⁴
That's why you do contracting. They can have you off the job but they're still paying you the rest of the contract either way :)
@justinlynch6691 หลายเดือนก่อน ⁺²⁴
We just had the 737 max. That is engineering 101 with robust standards around it.
As long as regulators refuse to require someone to put their name on it, aka a P.Eng stamp, and treat it like every other critical branch of engineering, this will continue.
@rodrigoserafim8834 หลายเดือนก่อน ⁺³
Regulations based on certifications are a racket. It solves nothing. You either are doing the right thing or you aren't. Signing something bad doesn't make it good.
What you have here is a failure of management to take responsibility for the cost cutting they impose on engineers. The law suits to Crowdstrike will do more for SE quality than any amount of regulations and certifications will ever hope to achieve.
@justinlynch6691 หลายเดือนก่อน
@@rodrigoserafim8834 that will never work. These big companies will wiggle out of everything. Almost nobody has been held accountable for the 2008 market cash, Boeing has continued to wiggle out of accountability, even Deepwater Horizon was a drop in the bucket.
The stick isn't big enough. I know it's not perfect, but the stamp has mitigated a lot of this stuff.
@calkelpdiver หลายเดือนก่อน
@@rodrigoserafim8834,
I doubt it. As a Software Tester with over 35 years experience in the industry I've seen my fair share of "JSTF" (Just Ship the F'er) moments when Development and Test have told senior/executive management that the software isn't ready to ship.
It's all about money, the bottom line. And until a company totally loses their shirt over release of faulty software this will never change. I've only once seen a major company lose it all because of executive pressure and ill thought out decisions. That company was Ashton Tate in 1989 when dBase IV was released and bombed. And it wasn't the Test group that screwed the pooch. It was the CEO and the President of the company that did the JSTF.
@sasukesarutobi3862 หลายเดือนก่อน ⁺²⁰
Ironically, the bug also illustrates one of the key problems in defining "malware": that developer intent is irrelevant to the functional effect of malware and/or software bugs.
@black-snow หลายเดือนก่อน ⁺⁵
Malware: you pay to get it off of your system
Antivirus: you pay to get it onto your system (and perhaps to get it off again)
@kevinmcnamee6006 หลายเดือนก่อน ⁺¹⁶
According to Crowdstrick's own incident report, the update that caused the issue was never tested on an actual Falcon sensor. It was only verified by some sanity check software called a "content validator" that had some "bugs" and didn't detect the problem. The first time the faulty configuration file was actually used, was in a customer system... and it crashed. it is inexcusable that a software company could release an update to their customers without actually testing it on a real device. Crowdstrike should be held accountable for the financial loss,
@vitalyl1327 หลายเดือนก่อน ⁺²
Yep. Unit tests suck. Integration tests must be the final line of defence, always.
@nezbrun872 หลายเดือนก่อน ⁺²
@@vitalyl1327 My list's longer than yours:
No full coverage unit testing.
No configuration testing.
No system testing.
No regression testing.
No rollback.
No staged release.
No internal or external change control.
Difficult to pick one of the Swiss cheese holes.
Maybe allowing third parties to update your business critical systems as they please without any change control?
@nezbrun872 หลายเดือนก่อน
"Crowdstrike should be held accountable for the financial loss"
This is an example of pure corporate negligence. Any expert witness at trial will rip them a new one.
Here's my list...
No full coverage unit testing.
No configuration testing.
No system testing.
No regression testing.
No rollback.
No staged release.
No internal or external change control.
@edwardcullen1739 หลายเดือนก่อน
@@vitalyl1327 Whenever someone has written unit tests, I find it easier to debug their code.
I am a fan of UTs 🤷‍♂️
@PaulSebastianM หลายเดือนก่อน ⁺¹³
I suspect this might hide a darker secret. They might have had a serious vulnerability that might have been already in play and they rushed to fix it and release the fix as soon as possible, which probably meant they skipped a few steps like testing.
@ultravioletiris6241 หลายเดือนก่อน ⁺²
Yes i suspect this too
@Evilanious หลายเดือนก่อน ⁺²⁰
Small correction: Usermode is indeed sometimes called ring 1 but incorrectly so. It is actually in ring 3. Admittedly ring 1 and 2 are barely used but the hardware does support them.
@davidjulitz7446 หลายเดือนก่อน ⁺⁴
Yes, Intel CPU's implement 4 rings. Usually, ring 3 is used for user mode. Ring 1 and 2 are omitted (as to my knowledge no one implemented them) and the kernel runs in ring 0. So from the abstract idea, it is still correct. Just call ring 3 to ring 1, no harm done.
@MonochromeWench หลายเดือนก่อน ⁺³
That is an x86 specific thing. x86 has 4 rings with User code placed in ring 3 the lowest least privileged level.The original concepts of Rings only had 2 rings. When intel implemented rings in the 80386, they added two intermediary rings that no one ever uses. x86 ring 1 is too privileged to be used for user code so ring 3 must be used instead. As x86 ring 1 and 2 are never used it is just easier to talk about things as according to the original theory with only ring 0 and ring 1. Of course garbage software like crowdstrike would be a good reason to use one of those unused rings. It can have a ring for itself and the kernel is protected from it.
@nezbrun872 หลายเดือนก่อน ⁺¹
Conceptually, speaking there are also rings -1, -2 and -3 for the hypervisor, system management mode, and the Management Engine respectively.
@jeffbaskin6851 หลายเดือนก่อน ⁺⁵
During the CrowdStrike outage, the neighbor across the street had a heart attack and died. His invalid wife, who had an emergency monitor, tried to help him and fell to the floor. She remained on the floor for over 12 hours because the emergency monitoring software ran on a Windows machine. I don't know if there was anything that could have been done for him, but she was needlessly suffering for what was likely someone's cost cutting measure.
@ContinuousDelivery หลายเดือนก่อน ⁺³
What a tragic story.
@jeffbaskin6851 หลายเดือนก่อน ⁺⁴
@ContinuousDelivery The wife is okay now. Their children have arranged hospice for her.
@edwardcullen1739 หลายเดือนก่อน
This is why anyone ever says "we're not writing flight control software", I struggle not to lose my isht.
You never know what the indirect consequences of a failure can be.
Must be professional at all times.
@eltreum1 25 วันที่ผ่านมา
@@ContinuousDelivery One of our customers does medical rescue helicopters. They were all grounded because they couldn't connect to their weather feed service or file flight plans with local air traffic control. Anyone that needed a medical life flight transport that day didn't get one.
@user-wg5hf5et9r หลายเดือนก่อน ⁺²⁶
Imagine if chip manufacturer made a small bug in microcode of CPU, slightly messing up voltage on cores, so they would degrade with time. That would be a disaster!
@delamar6199 หลายเดือนก่อน ⁺¹
LOL gotcha
@LordOOTFD หลายเดือนก่อน
Nah that was deliberate, Intel has been pushing the power limits of their silicon to complete with AMD since at least 10th gen. The TDP on the top tier of CPUs has been increasing since the easiest solution with their architecture is to just overclock it a bit more. They've finally found the limits of that strategy and it's biting them now.
@roganl หลายเดือนก่อน
@@LordOOTFD "Chomp!" the noise of a return getting swallowed...
@nezbrun872 หลายเดือนก่อน ⁺¹
@@LordOOTFD Pah! 320W dissipation in a 0.4 square inch die, what could possibly go wrong?
@slider799 หลายเดือนก่อน
What if i told you about 30-40% of the switches on the market don't do even remotely close to what their spec sheets says they will do?
@daverooneyca หลายเดือนก่อน ⁺⁴
When I was coaching at a large telecom company in 2010-2012, I was struck by how much the phone switch code focused on error and failure scenarios to the point that the "happy path" was almost an afterthought! The systems weren't without their problems, of course, but there was a ton of deliberate thought that went into "how could this go wrong". We should be thinking like that with every single line of code we write, IMNSHO.
@astronemir หลายเดือนก่อน ⁺²
Of course we should.
@mallninja9805 หลายเดือนก่อน
I find that's true for most software I write. The "do what you're supposed to do with known good input" code is easy. Finding & handling the potential failures takes a lot more work.
@SpecialK845 หลายเดือนก่อน ⁺¹⁶
Been a QA engineer for 10 years now and I can attest to the low level of care businesses emphasise on solid QA practise.
QA has become an afterthought, so the understanding has dropped over time. It’s now resulted in a low standard of QA and testing in the tech industry; but mixed with a need for quicker releases 😬
@Meritumas หลายเดือนก่อน
100%, always rush, always cut time and budget for proper testing.
@LordOOTFD หลายเดือนก่อน ⁺¹
The problem is that QA doesn't make money, it reduces risk and the potential loss of money which can occur when an incident occurs. This is much harder to quantify and doesn't fit on the line goes up chart, so the bean counters might look at the QA department and ask why they're paying all these people who don't make the line go up, forgetting that they're stopping it from going down.
@leerothman2715 หลายเดือนก่อน ⁺¹
Testing shouldn’t be a separate process after the code is written, but whilst the code is being written. The state of devops report shows that lots of small releases results in fewer issues than a releasing lots of changes in bulk.
@mallninja9805 หลายเดือนก่อน
Everything after "write code deliver (semifunctional) product take money!" has become an afterthought
@marcbotnope1728 หลายเดือนก่อน ⁺³¹
This is an management problem, developers are powerless to change this.
@Meritumas หลายเดือนก่อน ⁺⁷
100%, devs go to jail (see Volkswagen) or are blamed and fired. Management stays and collect bonuses.
@black-snow หลายเดือนก่อน ⁺¹
No they aren't. Whoever is asked to do the wrong or unprofessional thing always has the choice. Are there people who depend on their job by a great deal? Yes. Tons. Would you expect someone to greenlight a broken bridge design because they really needed that job? You can play the same game with pretty much any profession.
Of course this doesn't mean the opposite. Whoever knew it was shit is to blame plus everyone that should have ...
@marcbotnope1728 หลายเดือนก่อน ⁺⁵
@@black-snow if you ever worked for "American management" you know that you will be instantly fired if you "do the right thing".
@robertluong3024 หลายเดือนก่อน ⁺²
@@black-snowI like how vindictive you seem about those incidents when we have so many examples of people being thrown under the bus.
Remember how many stock market collapses where only one person went to jail and that person was a low-level worker.
The company is liable, not the individual.
@timop6340 หลายเดือนก่อน
@@Meritumasand at worst they enjoy the golden parachute while failing upwards in their career.
@test143000 หลายเดือนก่อน ⁺⁴
This crisis proved that commentators who positioned themselves as experts in software development had little knowledge of system software development. I was surprised as people with titles of former principal engineers or PhD in computer science didn't have even basic ideas about system programming.
@AlecBickerton หลายเดือนก่อน ⁺⁴
I'm screaming for the last few years. this is an obvious consequence of the way we've been forced to work for years,
@rodrigoserafim8834 หลายเดือนก่อน ⁺⁵
I am so tired of this idea that SE (Software Engineering) needs to "grow up" or "catch up to the big boys". Not to discount the massive failure that was Crowdstrike, but its not like Intel didn't bungle multiple series of CPU with a hardware bug. Its not like electronic devices aren't failing all the time, or the car industry never had to do a mass recall. Even Boeing is having issues with passenger planes.
Try to have a Electronics Engineer build a custom made TV for every single different person and you will see the bugs skyrocket as well. Mass production amortizes quality costs. That is the only reason those industries have more "quality". Because the cost-benefit analysis pushes it.
The issue is that 90% of Software Development isn't Software Engineering. We are building custom Apps that are going to run on one single service and are different for another. We are building CSS or some command line tool or some nice to have feature meant to show off. That is programming, that is not SE.
SE has source versioning, design patterns, coding tools, static code analysis, unit/integration/e2e testing automation, CI/CD. If you look at the entire range of processes SE has built, we are way, way above and beyond what other industries have for verification processes. If management decides to avoid or ignore them to cut costs, that is a management problem, not a SE problem. We can have as much quality as you wish, but you need to pay for that quality in time to delivery and engineering time.
And this is exactly what seems to have happened with Crowdstrike, your whole spiel of "push to production asap" devops mantra caused the issue here. Because that is not adequate for a critical piece of software. But management wants to push things fast to "stay ahead of the competition" and "use devops". And here we are.
@Fred-yq3fs หลายเดือนก่อน
And Boeing's space capsule too. 2 Astronauts have been stranded on the ISS for 50+ days instead of their 8 days demo mission. They might get rescued in 8 months. Boeing is a shadow of itself.
@yogibarista2818 หลายเดือนก่อน ⁺¹³
MS did try to push 3rd-party kernel drivers back into user mode and provide the necessary access via a dedicated marshalling API, but all the AV companies made such a stink about it - claiming (without evidence) that Defender would not do the same and thus have a commercial advantage - that the EU took up their cause and blocked MS from doing it. Thus leaving the situation where access is provided to the kernel essentially on a "trust" basis, which obviously can't be verified by the O/S in cases such as this, despite the requirement of certified drivers, as replaceable code/data is being used.
@vitalyl1327 หลายเดือนก่อน
All the antivirus companies are just a security theatre and a scam. Microsoft should have never caved to them, AVs must be pushed out of the market.
@astronemir หลายเดือนก่อน
Sometimes Microsoft just needs to pay the EU tax and be more like Apple. EU is doing these things to collect “tax” from big tech that otherwise all goes to the US or elsewhere. Doesn’t mean it’s right
@berkes หลายเดือนก่อน ⁺¹
Given the evidence we have now, that was little more than "malicious compliance" by Microsoft
@mallninja9805 หลายเดือนก่อน ⁺³
The EU ruled that Microsoft must allow the same access to 3rd party AV devs as they allow for their internal solution. They didn't tell MS that they couldn't invest some R&D into a more robust kernel (as opposed to an AI spyware screenshot tool). They didn't tell Crowdstrike that they had to use regex's on invalidated sets of parameters in the kernel. The "Oh noes government made it bad!!" take is an absolute red herring.
@vexorian หลายเดือนก่อน ⁺⁶
So let's check something out.
- We all agree there were failures in the QA process. We should probably mention that half the QA team was laid off earlier this year under the pretense of a pivot to AI.
- Crowdstrike claim that these templates are generated by AI. They were even spotlighted by nvidia as a "no code" solution in their convention. Was the original bug an AI hallucination that was not detected correctly by the QA process due to budget and staff cuts related to AI hype?
- Are these ring 0 changes being released untested because there's some assumption that AI is magically free of human error?
I suspect there's plenty of angles in which we should ask ourselves how much of a factor AI hype played here.
@astronemir หลายเดือนก่อน
Interesting conjecture. One can imagine creating intentionally poisoned datasets to make AI write vulnerable code.
@growtocycle6992 หลายเดือนก่อน
Why is no one else talking about this!?? Crowd strike had been hyping up it's deployment of AI and dropping it's workforce of human engineers all year
@ehsnils หลายเดือนก่อน ⁺²
As I see it - ring zero shall have a minimal codebase and be updated with care.
Also - if you are an OS maker - don't let ANYONE else into ring zero. A driver can always request access to I/O and memory through defined system calls and the kernel can then decide if it's a permitted access and deny or kill the caller if it's out of bounds.
@Fred-yq3fs หลายเดือนก่อน
Yup, access to kernel mode is a dead end. Europe can go eat sand.
@ulrichborchers5632 หลายเดือนก่อน ⁺¹
That describes very well what went wrong at the technical level and the obvious absense of quality control in the release procedure. I think that not much time would have been lost by first releasing the 291 "patch" on a staging host with the OS which most of their customers are running and to monitor a health check. They would immediately have seen the crash. There must be tools for that, even on Windows.
BUT something VERY important is missing here: It seems that there is an agreement between MS and the EU from 2009 who forced MS to have this backdoor for Kernel access. This is because MS was accused of not allowing others to provide security software for their OS because of their own product, Defender.
This acting by the EU obviously ignored the technical problem which we have now observed. So who did not tell the EU which problems this could cause? Did they force MS into this agreement without any technical knowledge? Or who remained silent back then, and why?
On top of that: You sort of mentioned that this software is certified and unchanged because it has this backdoor of importing executable code, so that the thread could not be detected. BUT: How did the software get certified in the first place? Wasn't there any code review or why didn't they detect or report that part of the software which imported arbitrary executable code?
There are three options here:
1) There is technical incompetence just everywhere, including potential technical advisors of the EU and internal techical staff at MS, doing code reviews or similar of potentially problematic external software with unlimited privileges, and not finding this backdoor.
2) People got paid to remain silent or to give misleading technical advice on behaf of people wanting to force MS to provide that access so that others could invade this space with new businesses (taking the risk of the event we have now observed)
3) A complete failure of responsibility. Just an example: I know that there are people doing security audits at the largest and most important data centers in Europe by PHONE, never visiting the place once.
We do have a much larger problem, because we have completely lost control with respect to quality and responsible behaviour, even for critical systems and infrastructure. There are obviously more people to blame for careless acting. It is not wise to always have free competition as the top priority. This thinking is dangerous, even if we think that is has something to do with being fair. It does NOT but encourages exactly the opposite.
@eltreum1 25 วันที่ผ่านมา ⁺¹
They had tests but they were not good ones. You can automate windows testing and even GUI widget operation with scripted keyboard/mouse inputs to simulate users doing various activities. This was a kernel driver and easier to test. They could have deployed it on a VM as part of the auto test pipeline. This CrowdStrike story seems like it was a perfect storm of incompetence internally and externally or the most well executed industrial sabotage job in history.
@rui1863 หลายเดือนก่อน ⁺¹⁰
The scrum and sprint processes just produce bad code. As a DBA, sprints drive me crazy. Developers develop for the sprint instead of a proper solution. The result is schema that are poorly designed. Instead of proper design what is deliver is something that "works" within a sprint but is badly designed which just continues to compound. Programming with a stopwatch and "sprint" coding just is a bad programing environment. Why, why, do we do this?
@leerothman2715 หลายเดือนก่อน ⁺²
Well if you mean most companies don’t implement best engineering practices whilst they ‘think’ they’re agile I agree. However that’s not a problem with an agile framework just a bad implementation of one. Sprints don’t lead to bad design, lack of good development practices does.
@rui1863 หลายเดือนก่อน ⁺²
@@leerothman2715 Just look at the terminology; burn rate, sprint, etc. Humans are not meant to be stressed every week after week. We are better suited for cycles, low and high stress instead of the constant stress of sprints week after week. I hate scrum; however, I've very agile -- I just can't be agile day after day after day after day. Plus, if you take the time to design something it is much less effort and better than the continual treadmill programing style of today. Scrum-agile is soulless. You can't create a framework for being agile; that's an oxymoron.
@leerothman2715 หลายเดือนก่อน ⁺¹
@@rui1863 Agreed. I’ve heard it described as ‘the language of violence’. ‘You must complete all the stories that you committed to during sprint planning’, ‘you have to increase your velocity’, ‘if you don’t deliver this functionality then you’ve failed the sprint’ etc etc. It’s often used as a shitty stick to beat people with. I still think that the main problem is a bad understanding of what agile is. I’ve posted on here before about Martin Fowler wrote a blog years ago and labelled it as ‘flaccid scrum’. Some think that because you have a sprint board, use story points and have a standup every morning then they’re agile. If you’re not doing the whole engineering excellence thing then you ain’t agile. I much prefer kanban to scrum personally. No planning an iteration worth of work, just the highest priority item, no sprint goal. As for the use of the word framework, I’m just quoting the scrum alliance definition. ‘As an agile framework, scrum provides………’
@colinmiddleton8127 หลายเดือนก่อน
Thanks for this video. It's really nice to get a clear and easy to understand the issues behind the CrowdStrike disaster.
@SubTroppo หลายเดือนก่อน ⁺⁴
There are lots of channels like this where an expert points out the internal issues in any given technical situation but the political, legal and ethical issues are mainly left to someone else to grapple with. Engineering of any type exists within a network of rules, and if those rules allow failure-ism to creep in, disaster is more likely to strike. The failure here is Microsoft's as it allows outsiders to manipulate the kernel. I liken the situation to a railway operator allowing the signalling to be sub-contracted because it will not (or cannot) retain the experts to do the job and the sub-contracted company or companies fail too for the usual reasons. So you end up with cascading failures encourage by limited liability.
@pierrelautrou1210 หลายเดือนก่อน ⁺⁵
I think the answer is cost
Writing and maintaining good quality tests can be expensive and businesses usually prefer spending their money on new features rather than on testing as they usually don't see the value of testing. I believe it's our job as software engineers to not accept this and try to improve practices but it's not easy, until there is a catastrophic failure. That being said some companies will always prefer to take the risk of a big hit some time in the future than constantly investing to prevent it.
Over time it can also happen that reasons why some quality checks were put in place are forgotten and those quality checks are progressively abandoned.
In that regard I don't think that our industry is that different from other engineering industries. Taking the aircraft manufacturing industry for instance, safety practices and procedures are usually implemented after there's been a crash and over time can be loosened or ignored leading to other crashes. However in our industry we rarely think about the lives that can be lost as "we only build software".
@nezbrun872 หลายเดือนก่อน
"I think the answer is cost"
Indeed. $5.4bn.
@Bookstorewalla หลายเดือนก่อน
Businesses spend little money on 'new features." Nope. They much prefer to siphon money from QA and qualified engineers to the C Suite. And we all know how that's working out!
@leerothman2715 หลายเดือนก่อน ⁺¹
Maintaining automated tests is a lot cheaper than fixing bugs and spending half your time debugging. Teams that follow XP practices spend more time writing new features because they not fixing bugs.
@quantumangel หลายเดือนก่อน ⁺²
It's not a matter of "producing software seriously". It's a matter of *not using Microsoft trash on critical infrastructure.*
@godblessCL หลายเดือนก่อน ⁺²
It seems the problem is more complex than quality control.
@mgbrown09 หลายเดือนก่อน ⁺¹
Companies are incapable of doing the appropriate QA because it's always cheaper not to and there is always a competitor that's prepared to take more risks (often because they are unskilled). Customers don't have the know how or resources to protect themselves from this either. It's the whole reason why the construction industry in the UK has building control officers.
@anonymousyoutube7259 หลายเดือนก่อน ⁺⁶
The event also showed the weakness in the customers' deployment practices. They all just blindly accepted the code change from CrowdStrike. They allowed their systems to get the update directly from CrowdStrike without internal canaries. Will practices at those companies change? Can they? If the software is written such that it automatically receives updates from the manufacturer's site, can companies protect themselves from such errors?
@lost-prototype หลายเดือนก่อน ⁺⁴
This was actually my first thought when the whole ordeal began.
This and also, how Windows is often much more difficult to maintain or "recycle" the way say a more headless and configuration driven Linux container setup would be.
Most container based software setups would just say "redeploy the previous version", and that would be it.
But not in Windows land where VMs become precious and one of a kind, mandating possible days or even weeks to rebuild a machine or VM.
It doesn't make it Microsoft's fault. But I do think some of the blame on how intensive the mitigation was lies at the feet of the very convoluted nature of Windows and Windows oriented systems.
@black-snow หลายเดือนก่อน ⁺³
At this point I wouldn't be surprised if you could just snatch the domain and turn half the world into toasters with some little channel file update.
@jeffwells641 หลายเดือนก่อน ⁺⁴
My understanding is Crowdstrike does not allow customers to manage updates because they aren't a traditional anti-virus. Customers have asked for this feature in the past and been flatly denied.
So the only blame on the customers here is trusting their AV company.
@JeffCaplan313 หลายเดือนก่อน
@@jeffwells641 Which AV do you use, Jeff?
@JeffCaplan313 หลายเดือนก่อน
@@jeffwells641 also, nice trick
@NilsElHimoud หลายเดือนก่อน ⁺²
That's a good point: The failure was not hard to detect ...
@_winston_smith_ หลายเดือนก่อน ⁺²
There is a huge difference in cost between software that is "good enough" and high quality reliable software. Most people cannot tell the difference until a disaster like this occurs. Very few are willing to pay for all the extra effort. That said, the level of incompetence and/or negligence on display in this fiasco is difficult to comprehend.
@ContinuousDelivery หลายเดือนก่อน ⁺¹
I think that if you adopt the right ways of working, it is not more expensive. That is what Continuous Delivery, as a practice, is all about!
@henryvaneyk3769 หลายเดือนก่อน ⁺²
The Management at CrowdStrike did not have the backs of their Developers. That is quite obvious from their Root Cause Analysis Report. They basically blamed it all on their developers.
@kevinmcfarlane2752 29 วันที่ผ่านมา
That’s not the impression I formed from that. It looked clear to me that the overall processes were at fault. And that’s down to management.
@eltreum1 25 วันที่ผ่านมา
Well, it was the dev's fault. A suit didn't write and check in the bad code calling a function with incorrect inputs with no validation in the app to protect itself and essentially blindly trusting inputs on a security product of all things. They had a test suite but devs built it poorly and intentionally setup in a way to ignore the error that caused the failure as they were phasing in features on their roadmap. The devs forgot to go back and update the tests when the feature went live and became mandatory instead of optional to operate that specific function that failed. Suits micromanage time and money and the paperwork torture devices. Devs write the code and tests and deployment pipeline scripts. The CI/CD and immediate release cadence could be a project management decision, but they could have left it in their devs hands to define and just rubber stamped it because they are relying on their experts suggestions, we will never know.
@lost-prototype หลายเดือนก่อน ⁺⁸
The blame is with leaderships plump with middle managers and VPs who tend the org chart around them like a nest made up of whatever they can access with their missing or dwindling relevance.
Non technical leadership strangles organizations until even the rank and file are coopted into the politics and accounting.
Find the leaders in your org chart who are trying to convert everyone around them into accountants and project managers, and fire them in a hurry.
Then, let the technical people BE TECHNICAL with TECHNICAL PRODUCTS.
🤬
@KawazoeMasahiro หลายเดือนก่อน
Something incredibly powerful about this failure is that a lot of people in and out of the industry knows about it. This lead to me and people I know starting to use their name to describe critical failures. Like: "They just got Crowdstriked." And people know exactly what I'm talking about.
@Immudzen หลายเดือนก่อน ⁺²
Crowdstrike seems to have lower quality standards that my group does writing simulation software. What Microsoft needs to do is much more heavily restrict kernel space and stuff like crowdstrike should run in user space.
@astronemir หลายเดือนก่อน ⁺¹
Imo part of getting access to ring 0 Should be an independent analysis of the QA and development processes. Have someone like Chris Farley work with the devs and you’ll find out in 1 week if that company is ready or not.
@ofiraz หลายเดือนก่อน
I fully agree with everything mentioned in this good video.
I would add to it the responsibility of the SW consumers to not deploy SW automatically, but rather run it on a test environment first, and deploy it to production machines in phases.
@THEMithrandir09 หลายเดือนก่อน ⁺³
The wild part is that you are still downplaying the issue. Hospitals shut down, and that always leads to deaths. People (very likely) died because of this.
@KeeperKen30 หลายเดือนก่อน ⁺²
Entirely too many non IT people making IT decisions.
@kddubbco หลายเดือนก่อน ⁺¹
Crowdstrike has blamed the release of the bug on a bug in their automated testing process that let the bug mentioned here through it's validation. I would guess they they did have sophisticated automated testing before the release of ring 0 updates but their testing process itself had a bug and didn't fail appropriately. For me the more interesting question, is how much do you test your auitomated testing process? Every level of development is important to have rigor and examples like this show how each part of the development process can't be ignored or taken for granted.
@mwwhited หลายเดือนก่อน
The only way large companies will take software development more seriously is for criminal and civil liability to return to the C-suite as well as making shareholders pay dearly after EULAs are invalidated due to negligence.
@ndewet หลายเดือนก่อน ⁺⁴
Uncle Bob, in his one presso, suggested, or predicted, that things will only change in the software industry when there is an event that causes significant loss of life.
At the boardroom level we might be hearing of tales of "From DevOps to DevEx" or GenAI and similar stuff, but those are vague in the extreme and in my mind act as a summary of where minds are at (even AFTER Crowdstrike).
@bulwarkjm2 หลายเดือนก่อน ⁺¹
@@ndewet You took the words right out of my mouth. When the 737 MAX incident happened, I thought of Uncle Bob's prediction, and thought that was it, but alas, crickets.
@leerothman2715 หลายเดือนก่อน
@@ndewet Well there’s certainly been loss of life due to bugs. NHS systems didn’t sent out letters to patients to book follow up scans for breast cancer. You can guess the outcome for some of those patients. Toyota had break failures in 2009 which resulted in deaths, caused by a software issue. Several suicides have been linked to the horizons scandal. Self driving car caused the death of a pedestrian 2018. I’ll stop there. 😢
@eltreum1 25 วันที่ผ่านมา
They put out a tech brief of the failure. The Kernel null pointer was a side effect of bad code and a hacked test that hid it. They passed a function an array of 20 things when it was expecting 21. Their testing was limited with mock data fixtures and it treated the 21st object as optional and ignore if its missing. When the function was changed to make it mandatory for a new feature the test and mock data was not updated. It was actually modded to stop failing tests until the feature was ready to test and forgot about it. They had a 1 tier test/deploy strategy. If it passed mock unit auto-deploy it immediately to global CDN, there was never any real integration or live run testing or phased rollout to mitigate unexpected failures. Agile is a failure, TDD is a failure, obsession with type hints is a failure. Whatever methodologies they come up with next will fail too because humans do not run like computers. Now the government is mandating Rust for certain software because some expert told them it would solve all their problems and be completely secure and un-hackable rofl.
@roganl หลายเดือนก่อน ⁺¹
Their management speak ridden RCA document should make your eyes bleed with all its obfuscated bad practices, from dev through deployment.
@ulrichborchers5632 หลายเดือนก่อน ⁺²
The EU forcing MS to allow kernel access by third party software is like forcing me to allow brain surgery on me by anyone who claims that I need this treatment.
To then certify such a software with a backdoor from a company with low quality standards is like then allowing a doctor for brain surgery on me who has never learned to do this ... and who then brings his students to execute the operation ... because somebody else has found a way how to damage my brain. So they do that to me, whether I like it or not, for my health and without any professional care or precautions in the interest of my survival.
@mallninja9805 หลายเดือนก่อน
The EU said MS has to allow 3rd party AV developers the same access as they allow for their internal developers. From an antitrust perspective, that is a good thing. They didn't say MS has to prioritize developing AI spyware & start-menu ad delivery over a more fault-tolerant kernel. This whole "Oh it's the EUs fault!" is a bad, bad, bad take.
@ulrichborchers5632 หลายเดือนก่อน
@@mallninja9805 My intent was to point out that the root cause of the problem is not a technical one. I did not say that it was the fault of the EU, but that running 3rd party code in the Kernel is like unauthorized brain surgery. To just throw away encapsulation because of political or economical reasons is maybe not the best approach. The execution of the political decision obviously did not include any technical considerations. These obviously were thrown overboard (by multiple participants). It is a system failure, or at least a failure at the software architecture level, and not about who is to blame, please don't get me wrong there.
@raymitchell9736 หลายเดือนก่อน ⁺¹
Will it change things? Probably not... that's a sad way to open a video, but I think you're right. What I think it will do is start a lot of finger pointing and then everyone tightens up and becomes risk averse. This might end up improving things after all those storms blow over, so who knows. Exciting times.
@vitalyl1327 หลายเดือนก่อน ⁺¹
Software engineering is an engineering. Obviously. So why we're not regulating it like the other branches of engineering? Why software engineers can work without any certification, why they do not have personal liability for their incompetence? High time software engineering is regulated to the same level as civil engineering or medical professions.
@vicaya หลายเดือนก่อน ⁺²
Isolation is the job of OS. No amount of human process can prevent such disaster if the NT kernel is such a PoS without an equivalent of eBPF mechanism of Linux. I'd primarily blame Microsoft for this disaster. It's remarkable that the consumer OSes (Windows and MacOS) are so antiquated compared with modern open source Linux kernel, which invariably will take over the world 😀
@mallninja9805 หลายเดือนก่อน
The Year of the Linux Desktop is gonna happen any day now...
@rzabcio3 หลายเดือนก่อน ⁺²
"Probably not." The most dark words spoken here. And probably true.
@LastMomentMan หลายเดือนก่อน
Before 25 years, I studies in my first network certificate MCSA that I should test any Windows update that I will install it in the network computer in a test computer first.
After I make sure that it's OK, and it's not creating any problem, I apply it to the network.
I am surprising today that a company with Billions of Dollars, do not do such simple excersice.!
@briand1337 หลายเดือนก่อน ⁺¹
Crowdstrike certainly epitomized the mantra of move fast and break things
@mallninja9805 หลายเดือนก่อน ⁺¹
I always thought that quote really sums up the absolute disdain of sociopathic tech bros. They imagine themselves to be the smartest person in whatever room they're in have zero respect for anyone else and view themselves as better than the other 8 billion plebs on earth.
@GhassanPL หลายเดือนก่อน ⁺¹
And yet, when I propose a government agency to oversee the software industry, an analogue of the FSIS/USDA/FDA in the US, I get laughed out of the room.
@mallninja9805 หลายเดือนก่อน
Half of the US government is made of a party whose mission is to actively sabotage US government by hook or by crook.
@nexovec หลายเดือนก่อน ⁺²
I doubt anything will happen. The same "ah, a very bad thing I don't understand has happened, let's talk about it for a week and then move on" public reaction will occur.
@laneromel5667 หลายเดือนก่อน ⁺¹
I would be shocked if CrowdStrike existed in 10 years, they will be sued out of existence.
@rogerdeutsch5883 หลายเดือนก่อน
14:08 "An important perhaps central aspect of an engineering culture is to ask the question, 'What could go wrong?'"
Few companies ask this question, because they don't want the detailed answer, because it would mean they have to assign resources to solve those problems. It's much easier for them to never ask that question. Which is why most companies (esp Software Companies) do not have an "engineering culture". But, also, most of them think they do have an engineering culture bec "they're creating software, aren't they?"
@SolidCollegeTry หลายเดือนก่อน ⁺¹
Honest question: What do you mean by thorough and fast? How can you be comprehensive without taking time to deeply understand what is going on? Most situations I have been in, although limited, have been met with a large amount of learning needed to understand and be able to make informed decisions. In consulting, this is very difficult and most client don’t understand that it takes time to be helpful because their mess take a long time to untangle.
@ContinuousDelivery หลายเดือนก่อน ⁺²
The best way to achieve "Thorough and Fast" is to build testing in to the dev process with TDD. You start with the test which specify what the code should do. The effect of this is that it makes you specify more, and that deals with the thorough part, you run all of the tests on every commit and that applies the pressure on us to make it fast.
@tiberiusmaximilian5591 หลายเดือนก่อน
A file that only contains zero-bytes is not programmed by accident or even produced by some software. The lack of a checksum test before using a downloaded file is the main defect. Most likely, the file got corrupted when transferring it from the CrowdStrike facilities to the downloading site. When copying a file, the receiving end allocates / memory maps the right mount of bytes on the file storage, before receiving the bytes and writing them into the file. Another process copied that file before even the first byte was written to file. Transfer via network takes time, but copying a file locally is fast.
I am sure that CrowdStrike did the right programming also extensive testing. But the distribution process was never tested.
@tigerscott2966 หลายเดือนก่อน
I've been using Linux for 18 years and I have
Never had a blue screen or a black screen
Error.. if something goes awry, the system freezes and I just re boot and go back to where
I STARTED.
@WhyteLis21 หลายเดือนก่อน ⁺¹
I believe, the moral of these incident is it will happen again and again.
It is just the true nature on our society and economic world wide keep depending on more and more the our digital technology. Errors will slip through in my opinion.
@mallninja9805 หลายเดือนก่อน
Sure, if OS vendors keep prioritizing global AI spyware over fault tolerant kernels, if kernel-mode software vendors prioritize deployment over robust test suites, if cloud vendors execs don't go to jail when their company deletes $135B in pension funds (oops I'm mixing incidents) - then yes, it will happen again and again. Next quarters penny is worth way more than next years stability! Capitalism, baby!
@FlaviusAspra หลายเดือนก่อน
I loved the end when you started to ask all those hard questions like where was the canary, etc
The crazy part is, if you have real experince, it's easy to set up the SDLC such that you get 80% there with little effort.
They probably put into it 300% effort to get 30% there. Testing wrongly is easy. Testing ineffectivly also. Or testing doubly.
@kellymoses8566 หลายเดือนก่อน
Crowdstrikes kernel module should have reverted to a working checkpoint after crashing. Windows kernel should have reverted to a working checkpoint after crashing. Kernels should use 3 rings of security so that there is the userland, kernel, and executive so that errors in the kernel don't crash everything.
@mcsee หลายเดือนก่อน
Amazing Video!. I think we can deal with these problems using fault tolerant or redundant systems like the Voyager software which is 50 years old but has .... redundancy
@tigerscott2966 หลายเดือนก่อน
The big issue is:
It is not safe to use Windows across a network
Or on the internet. Sandbox everything and
Use a Virtual Machine.
@davidjulitz7446 หลายเดือนก่อน ⁺¹
Looks to me, that at least a little bit of testing should have revealed this issue. Probably they rushed this out without proper testing or there was confusion with the files tested and the ones finally released. However, this is a risky driver design, they should consider running as much code in user mode as possible.
@attribute-4677 หลายเดือนก่อน
Scrum and Agile have turned software engineering into a back-whipping sweat shop where performance is judged by velocity and story points. Meanwhile, the product owners try to cram as many features into a release as possible while the scrum masters “babysit” and ask the same obvious 5 questions every standup. “What’s taking so long? What blockers do you have? Can I help? Have you asked for help?” YES. End the damn meeting so we can get back to work. It’s maddening. “Squeeze the blood from the rocks”.
@haxi52 หลายเดือนก่อน ⁺¹
18:22 I really really really hope that there are no nuclear power plants running on Windows. In any way shape or form.
@notthere83 หลายเดือนก่อน
As long as companies aren't heavily penalized by governments and C-level execs and middle management (both of whom are usually responsible for engineers cutting corners) possibly facing prison sentences for severe issues like what we saw at crowd strike, there's no incentive to change things.
Because obviously, the free market is full of companies that produce rushed, buggy software. So it's not like you have the choice to go with a company that's maybe a little more expensive but emphasizes quality. (Which reminds me of how larger organizations often seem to pick who gets a contract. Total cost. That's it.)
@TheEvertw หลายเดือนก่อน
So, CrowdStike had a hack that allowed them to write Ring 0 code with processes only suitable for Ring 1. Great idea. Then they had no way to undo the inevitable failure. And they didn't test the code.
I will be sorely disappointed if they won't be hit with malpractice / negligence law suits that will bankrupt them.
@ProgrammingZombie 23 วันที่ผ่านมา
Maybe the IT industry should think more about giving a third party unfettered control over all their hardware, especially when the CEO of said company has presided over similar lapses in policy in the past.
@disgruntledtoons หลายเดือนก่อน ⁺¹
Bugs will happen, but there's no excuse for releasing them.
@alfredomoreira6761 หลายเดือนก่อน ⁺¹
This video has some wrongs.
There is no code in their sys files, only data.
There was at least 2 bugs in their validated ring 0 code that was released back in february. And the data of the sys files released in july triggered the bugs. So the issue is not in the architecture, but on the code and lack of proper testing. They lied about e2e testing, since their tests only tested mocked data, not the actual sys files released to production.
@npc9710 หลายเดือนก่อน ⁺²
i have never understood why medical devices are running windows. somebody, explain it to me please 😢
@dawnrazor หลายเดือนก่อน
291, the new number of the beast 😈. Nice vid. I’m not sure that the occurrence of this event means there is a problem in the industry. It’s just a one off by a company that should have known better given the high risk status of their code. If there was a problem in the industry then we’d see this happening more often. I suppose there could be near misses that we’re not aware of, but at least now that this has happened, it will make people sit up, take notice and hopefully force a bit of navel gazing and correction. I do hope that crowdstrike are forced to pay compensation to all affected clients even to the point of making them bankrupt. That will send a clear message to other corporations that have bad practices intended or not , that this is not acceptable.
@kellymoses8566 หลายเดือนก่อน
I really want to know what conversations Microsoft is having with Crowdstrike.
@danielwilkowski5899 หลายเดือนก่อน
@Continuous Delivery Hi Dave! I think your channel is awesome! But I'm having one question. On one side, I see you're suggesting we should be doing all kinds of tests, performance tests, stress tests, all kinds of stuff, which makes sense! But on the other hand, I'm hearling about working in agile way, and actually only solving problems of user (for example based on user stories). If I'm developing a software for a small-medium company, which doesn't really care much about few minute downtime or slower applications; do you think I should still worry as much about performance and stress testing?
@therealsailorfred หลายเดือนก่อน ⁺²
A slightly more charitable explanation is that the proper 291 patch was tested, dogfooded, etc., but that the package that was actually pushed was not the one tested.
If there is an internal format, like a zip file that gets all the scrutiny, and then there's a final packaging that created the nul bytes version, it could explain it.
There's still no excuse for not testing that final version, though.
@ContinuousDelivery หลายเดือนก่อน ⁺¹
That may of happened, as well, but that wasn't the cause according to their most recent report that dropped as this video was released. They had a function that took 21 params, and only sent it 20!
@therealsailorfred หลายเดือนก่อน ⁺²
😱Thanks for the update. "...The Content Configuration System has been updated with new test procedures to ensure
that every new Template Instance is tested,..." confirms the lack of adequate testing.
@Fred-yq3fs หลายเดือนก่อน ⁺¹
@@therealsailorfred yup, input validation 101. And in kernel mode! Cowboys! And that's not fair to most cowboys.
@TheEvertw หลายเดือนก่อน
"Probably not"
We have had many, many SW disasters before. At least this one (probably) didn't kill anyone.
@AndreaGriffini หลายเดือนก่อน
What is the difference between adding new code to ring 0 and just having ring 0 code read and use data that is invalid (e.g. making the ring 0 code looping forever and being stuck while holding some fundamental lock)? What about making a mistake in considering some activity as suspicious because of a new rules file, and blocking something fundamental for say use the network? You get a bricked system anyway, impossible to fix without physical presence.
Code signing buys very little even when not cheated... data is very powerful too.
@stephendgreen1502 หลายเดือนก่อน
Too much need to keep changing is what broke it. It needs to be frozen and then pruned clean so little change is needed. Pruning out whatever forces further change.
@animanaut หลายเดือนก่อน ⁺¹
is there an update on why the delivered content files contained all but zeroes??? makes me think something in transit must have gone wrong. you can have a perfectly working pipeline but when the artifact gets altered after the fact the pipeline must be extended to cover the new possibilities to modification of an pipeline artifact
@ContinuousDelivery หลายเดือนก่อน ⁺²
There is a new more detailed report that arrived just after we had uploaded this video, that describes the failure in more detail. It is linked in the notes to the video.
It says that one of the functions took 21 params, and only 20 were supplied, and the code didn't check them. The testing was based on a correct call with all 21 params, which is why it worked, it wasn't testing what was in production.
@animanaut หลายเดือนก่อน
@@ContinuousDelivery thx, will have a look. but the thought of a function taking 21 params sends chills down my spine
@animanaut หลายเดือนก่อน
just read the 12 page document and the all zero file was not (at least explicitly) addressed. but i might lack the knowlege to see any implicit mentions. from what i got from the file (worthwhile read btw): its the perfect cheese hole analogy. a hole was present in every layer of cheese (compile, test, deployment, runtime) to make this edge case fall through the crack.
@snorman1911 หลายเดือนก่อน ⁺¹
Come on bro, they were continuously delivering.
@JonathanSwiftUK หลายเดือนก่อน
The testing process is actually flawed if you don't have tests for things that can happen. Why did nobody pass it a corrupt patch file in testing? If they had done that they would have realised. It's attention to detail and thinking about ALL possibilities.
@rodrigoserafim8834 หลายเดือนก่อน
Not to excuse the magnitude of this failure, but hindsight is 20/20. Its very easy for us to sit here, *after* being told what the problem was, and go "obviously that specific chain of events should have been tested". But thinking about *ALL* possibilities is only useful as a expression of speech. You can think of *many* possibilities. If you are really, really good, you can think of *most* possibilities. But never *all* possibilities.
Quality is about not making the same known error again, not necessarily to avoid all possible errors.
@ScottSullivan หลายเดือนก่อน ⁺¹
Ring 0 and ring 3 - not ring 1. The Windows operating systems implement user mode as ring 3. Even on hardware that supports ring 1 and 2, Windows only implements ring 0 and 3. Windows NT was a little before my time, but I’ve read that is where the precedent of only implementing 2 of the 4 modes comes from.
I’ve worked for big AV companies and I find it astonishing that this issue wasn’t caught before being shipped. I suspect that there is more to the story than “they don’t test drivers before being shipped”. Of course they do. It makes more sense that this was a bug, in the driver, that went undetected. It probably existed for quite some time.
My understanding is that it was a data file, a file that was supposed to contain the “description” of malicious file (if it was something like traditional AV) or malicious behavior (if their PR hype is to be believed), but the automated process that generates these data files generated something that was empty. Should they have detected an empty data file before shipping it? Yes, they should have. Should they have identified the bug in their driver? Yes, they should have. Is it reasonable for us to expect CrowdStrike, or any software company, to produce software with zero bugs? No, it is not reasonable. Moreover, it is not possible.
Dave, I’ve really enjoyed your video and your channel. I tend to agree with you on almost everything you teach. On this one, I think that you are mistaken about what we observed means. Certainly, as engineers we should strive for safe, sound, repeatable, maintainable, and deterministic work product; however, which engineering discipline is perfect. Ask Sally Ride if those rocket scientists never make a mistake. Tesla practices the methodology you preach, and I think their engineers and QA team are top notch, yet they have had issues. I believe it was a software issue related to regulating current that led to many of their cars catching fire (I might have the details wrong there, but it is indisputable that they were catching fire).
This incident is not an indication of an industry that doesn’t take their discipline seriously. Like exploding rocket boosters on the space shuttle or battery powered cars burning in the street, this is just an unfortunate tragedy. Like those other issues, we will examine what happened and take measures to ensure it does not happen again. Alas, without fail, there will be another incident. Because we are human.
@ContinuousDelivery หลายเดือนก่อน
Yes, I took my nomenclature from Dave Plummer's video on the topic. Even though I knew that there used to be more layers. I assumed (maybe wrongly?) that Microsoft had changed their terminology since Dave was a Windows OS developer.
@arsnb9m907 หลายเดือนก่อน
I think there is another potential explanation of how this happened than dereliction of duty. Uncomfortable as it is to consider, it is also possible that the patch was released intentionally as an exploit, for a nefarious purpose that we currently aren't privy to.
@ContinuousDelivery หลายเดือนก่อน ⁺²
I generally believe in accident rather than conspiracy, because "They" aren't usually smart enough to make the conspiracies work 😉
@kjdtm หลายเดือนก่อน
lol: INTRO: "PROBABLY NOT !" like literally what went through my mind when Dave sad it as well..
we humans don't even care about our environment ! why would we care about SW quality if people keep buying our products anyway....
@tigerscott2966 หลายเดือนก่อน
CrowdStrike was all about the 💰 🤑 💸.
It's deplorable that these companies still
Have no Linux machines in-house. Server
And cloud operating systems are FREE...
@garronfish8227 หลายเดือนก่อน
I hope someone from the inside of the company tells us what exactly went wrong. I guess that it has something to do with automating testing of code so it probably did pass some tests and then the failure of the automatic tested was not picked up.
@markeggers8356 หลายเดือนก่อน ⁺²
The CEO learned nothing (again), and the PR department will dance until the bad publicity dies down. The courts are too technology-ignorant to apply significant (if any) punishment.
It's telling that the board hired this CEO after his CTO performance at McAfee.
None of this is rocket science. However, it does cost money. As long as the damage done costs someone else money (and not them), the minor fines (if any) will be considered operating costs.
See every data breach ever.
@seeibe หลายเดือนก่อน
Microsoft certified that backdoor. This is entirely a Microsoft failure, as they've proven their whole certification process to be inept.
@soppaism หลายเดือนก่อน
Quite often the tests are also software, but who tests the tests? I'm afraid that in practice a lot of it just relies on the low probability of tests consistently breaking in a way that would allow them to pass unexpectedly.
@EzFastPaws หลายเดือนก่อน
This also bothered me about testing quite a lot. But when I started to write tests, they turn out to be quite simple and transparent pieces of code. It's not impossible to make a mistake there, but it's so much less likely to do so in tests then in tested code. So at the end of the day, they are not an ultimate answer, but still a quite helpful tool
@delamar6199 หลายเดือนก่อน
In my career working at multiple companies as an embedded software developer I found one thing always remarkable. And this is the difference between software and hardware engineers. Hardware engineers are generally much more detail and responsibility oriented. Designing, analyzing, simulating, measuring, testing, optimizing again and again and again. Documenting each step and meticulous house holding of the processes and their labs. Software engineers are much more easygoing and laid back so to speak. At basically every company it was like chaos vs accuracy... And I don't exclude myself here. But this is also part of the bad quality in software lately imho. While we SWEs routinely played ping pong at the end of the day the hardware engineers pulled their heads together over measurement equipment.
@TonyWhitley หลายเดือนก่อน
Hardware design is thousands of times less complex than software though and similarly the capacity to change it after the original design is complete is much more limited. (Speaking as another embedded software engineer.)
@delamar6199 หลายเดือนก่อน
@@TonyWhitley Well that depends on the products I guess but in my case (enterprise AV equipment) Hardware design is thousand times more complex. And I don't talk about mechanics only. I talk about PCB design. layouts, heat flow, etc. meeting certification standards to be able to sell in certain countries, production processes, supply chain. All that stuff is part of hardware design. Software is a dip shit really compared to that imho. Building a washing machine or something like that is a different story.... :D
@TonyWhitley หลายเดือนก่อน
Ah, I spent my career in mobile phones. The hardware is complex (though a lot of it is firmware nowadays) but nowhere near the complexity of the software. PCB designers have the luxury of starting afresh each time 😇
@mallninja9805 หลายเดือนก่อน
@@TonyWhitley Consequently, I don't think most of us rise to the level of 'engineer' - we're hacking together solutions on top of mountains of bandaids and technical debt.
@black-snow หลายเดือนก่อน ⁺¹
Thanks Dave!
@NineInchTyrone หลายเดือนก่อน ⁺¹
CrowdStrike. FORD Boeing. GM. Harvard Business School management
@GackFinder หลายเดือนก่อน
The main issue is that CIOs/CTOs of companies whose bottom line relies on their 24/7 services are so utterIy incompetent as to decide to invest in an ecosystem that requires changes to ring 0. Not onIy do I think these CIOs/CTOs don't understand what ring 0 or kerneI space even means, they probabIy aIso ignore any advice from the more competent peopIe that they manage.
@WhyteLis21 หลายเดือนก่อน
This video just keep reminding me, "Should we trust any of you software developers or software companies anymore?"
Any software developers or companies want to weight in? Be honest about it, too. Lol.
@growtocycle6992 หลายเดือนก่อน
I get the impression that automated testing IS the problem....
@EzFastPaws หลายเดือนก่อน
As far as I know, the problem was to be found only after rebooting the system. And automated testing usually doesn't get associated with such level of testing. If you mean that having automated testing was the reason for cutting out other types of testing, and neglecting canary releases, then I believe company's decisions are the problem. Although, since it's an invalid pointer problem, and it's actually happened, I would question the quality of their automated tests as well
@NineInchTyrone หลายเดือนก่อน
Has he covered the Horizon debacle ?
@Keymandll หลายเดือนก่อน
LOL. Cloudstrike is to be blamed for bypassing security controls and best practices. Cloudstrike can not be considered as a security company anymore, just another money making machine.
@szeredaiakos หลายเดือนก่อน
I release (or at least used to) untested code all the time. Nothing interesting ever happened even if a bug was introduced... Tho, decoupling and error boundaries where respected.
Testing will NEVER ensure a resilient bug-free code.
If you isolate your sub-systems you don't need each of them individually to be resilient or bug-free.
Testing implies someone with a big wrinkled brain to ask the right questions.
Isolation requires someone with IQ80 or manager with a marker to draw a thick black line.
- But testing as a design tool and an assurance for not repeating the same mistakes over and over is priceless.
@EzFastPaws หลายเดือนก่อน ⁺¹
Why do testing if it doesn't guarantee the perfect working code - is like - why care about your health if you gonna get sick anyway
@szeredaiakos หลายเดือนก่อน
Precisely. It will decrease the chances of your system getting sick. But if you spend all your time not getting sick you become incapable of living.
In other words, it will take 15 to 20 times more time to build, extend, and maintain a moderately complex software system.
Ex: take a func which compares 2 numbers. How can it go wrong? You test all possible primitive combinations,... check the prefetch behaviour, do performance tests,....
You can run hundreds of tests on the most simple algorithms, and yes, some people actually do that. I've seen it... including testing for the same things 3-4 times over.
@EzFastPaws หลายเดือนก่อน
Well I hope everybody who thinks the same and chooses to ignore Crowdstrike's lesson is talented enough to do so
@qbasicmichael หลายเดือนก่อน
User mode is ring 3.
@theoceanman8687 หลายเดือนก่อน
We have to boot out the bloody MBA's out of IT and Software.

ต่อไป

เล่นอัตโนมัติ

How To Avoid TOXIC Team Culture In Software Development