CrowdStrike Exposes a Fundamental Problem in Software

แชร์
ฝัง
  • เผยแพร่เมื่อ 6 พ.ย. 2024

ความคิดเห็น • 579

  • @ArjanCodes
    @ArjanCodes  3 หลายเดือนก่อน +1

    ✅ Get the FREE Software Architecture Checklist, a guide for building robust, scalable software systems: arjan.codes/checklist.

  • @ropro9817
    @ropro9817 3 หลายเดือนก่อน +164

    I think the even more fundamental problem here is the security software mono-culture. I know CrowdStrike is big, but honestly, I was surprised when I heard in the news how broadly sweeping the impact was across companies and even across industries. If everyone's using the same software, that provides a ripe attack vector for hackers. 😒

    • @FrankmoonDusty
      @FrankmoonDusty 3 หลายเดือนก่อน +17

      This a 100%. We need to move away from tech monopolies like crowdstrike, and Microsoft by extension.

    • @penfold-55
      @penfold-55 3 หลายเดือนก่อน +11

      It's a much bigger issue... Most of the biggest companies are held in a very small space, in northern California, US.
      Microsoft, Google, Nvidia, AMD, Intel, Facebook, Amazon, and so on.
      The issue is that Europe and Asia are just so far behind America

    • @alexivanov4157
      @alexivanov4157 3 หลายเดือนก่อน +1

      Bravo! This is the main point from the issue!

    • @username7763
      @username7763 3 หลายเดือนก่อน +5

      It isn't entirely a mono culture. All it takes is for one layer in a massive distrubuted system to all be on the same thing. The problem is the crazy complexity of today's IT systems. Everything is a damned service that requires it's own cluster and infrastructure.

    • @robertbutsch1802
      @robertbutsch1802 3 หลายเดือนก่อน +11

      No enterprise IT folks in their right minds are going to say look, everyone else is using the best threat protection software in the business. So lets use Acme Security Software so we’re not promoting a mono-culture.

  • @MysticCoder89
    @MysticCoder89 3 หลายเดือนก่อน +22

    Your kernel is crashed. No malicious code can be executed. Your computer is completely protected now. Thank you for choosing our company!

  • @AMMullan
    @AMMullan 3 หลายเดือนก่อน +180

    So we have the CrowdStrike option ENABLED so CrowdStrike won't release the latest version of their software to use (we stay 1 version behind) - apparently they don't actually even check for this so we got it anyway. Absolutely shoddy development :(

    • @TheGreenRedYellow
      @TheGreenRedYellow 3 หลายเดือนก่อน +1

      Wouldn't you get it in the next release, so technically you are not immune to this update, unless you manually deploy it.

    • @gcaussade
      @gcaussade 3 หลายเดือนก่อน +6

      @@AMMullan wow that's interesting information. It's interesting to see what happened. Did they have an emergency release? Maybe they felt there would be a breach if they didn't release something right away? So many questions. I have a hard time believing there are so many incompetent organizations around the world. If these companies were choosing to be one version behind, specifically to avoid something like this, then how did this happen?!
      That's crazy!

    • @gcaussade
      @gcaussade 3 หลายเดือนก่อน +11

      @@TheGreenRedYellow what do you mean? You would assume that people would report the BSOD and they would stop the roll out. The problem with being one version behind is that you're not getting the latest protection. But, I could see doing this to avoid this exact situation

    • @AMMullan
      @AMMullan 3 หลายเดือนก่อน +4

      @@gcaussade yeah they killed that update so anyone not getting the latest update wouldn't have received this at all 😕

    • @TheGreenRedYellow
      @TheGreenRedYellow 3 หลายเดือนก่อน +2

      @@gcaussade it is really about how many updates did they release. Like what if they have released 2 updates within same day?

  • @kwas101
    @kwas101 3 หลายเดือนก่อน +33

    It's partly about $$$ and partly about how everything nowadays is expected to happen with speed. Back in the day (30 years ago) I worked for a bank. We maintained a very large enquiries counter system. Before anything got pushed out to branches, it was tested for weeks. We had dozens of test engineers and they would run through every conceivable action. Then and only then a release would happen to a local branch. This would be tested in the wild for a week. Then a small group of branches for two weeks, then a larger group, then finally the main group. The result was that very few (if any) show stoppers made it to production. This meant a slow cadence of releases though. Also this was a large project with extensive management backing, so the cost was not really a factor (within reason).
    This type of behaviour would never fly today. Everything has to be done on the cheap, with minimal testing, just "get it out there". I call it the "just get it f**king done" attitude - this is very common nowadays, especially among MSPs.

    • @gzoechi
      @gzoechi 3 หลายเดือนก่อน +1

      It's not necessary to go that slow. With proper CI/CD practice this would work as well at high speed.
      You still need to put a lot of work in to get proper quality.

    • @enadegheeghaghe6369
      @enadegheeghaghe6369 3 หลายเดือนก่อน +2

      If you spend weeks testing your Cyber security software, you will get hacked for sure before you deploy it.
      Hackers are a lot more sophisticated now compared to a decade ago

    • @manoo2056
      @manoo2056 3 หลายเดือนก่อน

      the issue is thar in the short term the "get it the fack done" acctitud "saves money" but in the long term it explodes. I see like entropy rising and then some feedback to bring back equilibrium. Let's hope we survive that feedback !! XD

    • @michaelwills1926
      @michaelwills1926 3 หลายเดือนก่อน

      @@enadegheeghaghe6369hackers rise with the level of tech. Besides you still should sandbox any release even zero day patching

  • @whatcouldgowrong7914
    @whatcouldgowrong7914 3 หลายเดือนก่อน +24

    People seem to be overlooking the glaring fact that they pushed an update that was corrupted or checksum failed which means there was a wide open vulnerability that allowed man in the middle exploits or injecting code with modified files directly into the Kernel….

    • @vitalyl1327
      @vitalyl1327 3 หลายเดือนก่อน

      Keep in mind that Clownstrike is a scam company selling a "cybersecurity" snake oil. Just like all the other antivirus companies. There is zero value in their product. They have no incentive whatsoever to do the right thing, because they're consciously scamming their customers anyway.

    • @fransstar8731
      @fransstar8731 3 หลายเดือนก่อน

      I see a lot of answers/recommendations, but what surprises me why CrowdStrike is working in Linux and not in Windows. Apparently Microsoft needs an extra driver to let CrowdStrike working. I think it has all to do with the different structure between Microsoft and CrowdStrike. I think it is time that Microsoft should change its total stucture like Linux. This is whole thing is to blame to Microsoft. It is clear ethical hacking can only be done with Linux and not Microsoft. Wake up people. Linux has sudo Windows not this was and is the main issue. Awaiting comments. Thanks.

    • @whatcouldgowrong7914
      @whatcouldgowrong7914 3 หลายเดือนก่อน

      @@fransstar8731 They tried to and was blocked by Europe. At the very least Microsoft need to revoke their WHQL and prevent changes after the fact

  • @ying-ym8ut
    @ying-ym8ut 3 หลายเดือนก่อน +20

    The CEO of CrowdStrike, George Kurtz used to be the Chief Technology Officer of McAfee in 2010, when a security update from the antivirus firm crashed tens of thousands of computers.

    • @KC-uf1rg
      @KC-uf1rg 3 หลายเดือนก่อน +1

      He upped the ante now 😂😂😂

    • @Discoverer-of-Teleportation
      @Discoverer-of-Teleportation 3 หลายเดือนก่อน

      😂😂😂😂

    • @yoyolim538
      @yoyolim538 3 หลายเดือนก่อน +4

      We got crowd struck

    • @Jace-yt2zm
      @Jace-yt2zm 3 หลายเดือนก่อน +1

      Crowdstrike dropped the ball and brought down a big chunk of the world’s commerce and business. While CEO George Kurtz is thoroughly enjoying and consumed by his race-car-driver-lifestyle in events all over the globe!

  • @James-hb8qu
    @James-hb8qu 3 หลายเดือนก่อน +17

    My career has been leading engineering organizations. This is not a new issue or a unique issue. Bad driver code crashes systems. Because of that, the industry has created well known and effective ways to prevent problems. You've listed them.
    The issue here is a company with wide spread driver releases that failed to follow those practices. The free market has created a process for handling that and it is called competition and consumer choice.

    • @joansparky4439
      @joansparky4439 3 หลายเดือนก่อน

      markets that prohibit or undermine competition via rules that are being enforced by the market authority (#) do not give the consumer the chance to chose a different supplier
      #) goal is to give one or a few control over the supply, so it can be kept below demand, which guarantees the the consumer always pays more than it cost - which is what profit is. Or in other words - real free markets would trend towards zero profit for all involved due to competition.

  • @on_wheels_80
    @on_wheels_80 3 หลายเดือนก่อน +51

    The Crowdstrike disaster hasn't struck because they needed to move fast, but because they obviously haven't tested this specific update on a single Windows machine. Because if they did, they'd immediately noticed it would crash. And they made a similar mistake already in April. That time it could be somewhat forgiven because it only occurred on two distributions of Linux which hadn't been in their test matrix.

    • @1DwtEaUn
      @1DwtEaUn 3 หลายเดือนก่อน +5

      yeh, you think they'd have at least one of every supported OS in a test lab and rollout to that first, is 30 minutes before global rollout that big of a delay.

    • @d3stinYwOw
      @d3stinYwOw 3 หลายเดือนก่อน +2

      Or they have such machine, but their testing might be influenced by local changes, or flaky.

    • @The_Ballo
      @The_Ballo 3 หลายเดือนก่อน +2

      that, or they did it on purpose

    • @Travolta12e
      @Travolta12e 3 หลายเดือนก่อน +3

      Wasn't the .sys file just a bunch of zeroes? I wonder if it was either a compilation or distribution problem that somehow corrupted the file, but the original file was working as intended.
      I mean, no matter how incompetent they are, it's naive to think that they just push new files to production without a minimal testing first.

    • @stickman1742
      @stickman1742 3 หลายเดือนก่อน +1

      How can you say they didn't because they needed to move fast? One of the most common reasons why software is put out without enough testing is because they are trying to move fast. You may not thing they needed to move fast, but internally they may have felt pressure. These kinds of drivers normally have to go thru certification tests to be put into Windows, but updates can bypass this to get out more quickly.
      Don't underestimate the ability of companies, including very big ones, to take shortcuts whenever possible. Not too long ago I spent some time working for a huge financial company that has more money than most companies would know what to do with. They are supposed to have a complete system just for testing to protect everyone's financial data, but they didn't really want to put in the money or effort. Could they afford it, of course! They just wanted to skip a few things, would probably make their quarterly report look a little better. That test system was never working so everyone had to run tests using people's actual financial data. They would just hand out people's real financial records saying "You're not supposed to see this but we don't have any test data" to any employee just to get the job done. This is the attitude of the biggest institutions running this country.

  • @metamadbooks
    @metamadbooks 3 หลายเดือนก่อน +27

    But you can have it both ways: it's called rolling updates. You don't deploy software to a billion endpoints in one go.

    • @JeanPierreWhite
      @JeanPierreWhite 3 หลายเดือนก่อน +4

      Correct. This was dumb.

    • @amyhaynes3019
      @amyhaynes3019 3 หลายเดือนก่อน

      Right

    • @Julio-ek1lw
      @Julio-ek1lw 3 หลายเดือนก่อน

      I disagreed with your comment, the number of deployments doesn’t remove the dichotomy

  • @lumeronswift
    @lumeronswift 3 หลายเดือนก่อน +5

    Something that needs to be more highlighted from this issue is that companies have in recent years been offloading their IT resources but are still adopting external, overseas-managed (i.e. managed in the US) solutions. Companies should always have an in-house team ready to respond to system failures. Informed, careful companies would only have had a couple of hours of downtime...

  • @ChristianSteimel
    @ChristianSteimel 3 หลายเดือนก่อน +61

    Most surprising is that PCs still don't use A/B installs of the OS, where you use one copy and update the other copy, then switch over to the updated copy, and you can switch back if the update failed for some reason. With disk space so cheep, you'd thing every Linux/Mac/Windows PC would use that by now. In Linux at least you can revert to a prior Kernel version.

    • @incandescentwithrage
      @incandescentwithrage 3 หลายเดือนก่อน +7

      Yeah but the same thing happened with Crowdstrike on Linux previously, causing a kernel panic.
      If you hook into the kernel, changing kernel isn't going to help.
      A/B is what happens with OS feature updates on Windows already.
      Nothing preventing people using backup software on the daily.

    • @JeanPierreWhite
      @JeanPierreWhite 3 หลายเดือนก่อน +9

      Bingo. Windows is not ready for critical functions. Microsoft have had over 30 years to develop resilient OS. Time to give up on them and go to Linux systems that support immutable OS's and atomic releases.

    • @askii3
      @askii3 3 หลายเดือนก่อน +6

      SUSE MicroOS is essentially capable of such A/B installs. It does snapshots of atomic transactional updates where it can automatically rollback the update on failure. This is what JeanPierreWhite (above) is referring to with immutable Linux distros with atomic updates.

    • @gzoechi
      @gzoechi 3 หลายเดือนก่อน

      I found that stupid 30y ago.
      NixOS does that quite well though.

    • @gzoechi
      @gzoechi 3 หลายเดือนก่อน +2

      ​@@JeanPierreWhiteThey never even tried to approach the problem. If it had been 300 years they wouldn't have made any more progress on that front

  • @samarbid13
    @samarbid13 3 หลายเดือนก่อน +30

    This is a reminder of how fragile our IT solutions are. Imagine a solar storm occurring and the devastation it would cause! We need a plan B for critical infrastructures to always be in place!

    • @henson2k
      @henson2k 3 หลายเดือนก่อน

      We need operating system that can disable drivers on reboot

    • @JeanPierreWhite
      @JeanPierreWhite 3 หลายเดือนก่อน +1

      Its a reminder of how fragile Windows is. Notice how it was only Windows computers that borked?

    • @gzoechi
      @gzoechi 3 หลายเดือนก่อน +1

      How does this increase stock prices in the short time? Yeah, not gonna happen.

    • @zackang4731
      @zackang4731 3 หลายเดือนก่อน +2

      @@JeanPierreWhite Because it's a software written specifically for Windows? A major bug written in the Safari browser can potentially cause the same problem to ALL Mac users, and no Window computers will be affected, because the browser is no embedded in the system the same way Safari is in the MacOS

    • @STCatchMeTRACjRo
      @STCatchMeTRACjRo 3 หลายเดือนก่อน +1

      @@JeanPierreWhite they could have released a buggy update for non-Windows os as well.

  • @ProfessionalBirdWatcher
    @ProfessionalBirdWatcher 3 หลายเดือนก่อน +2

    My rage at everyone downplaying this for CrowdStrike is immeasurable. This is a billion dollar company, with a B, trusted by critical government, public, and private services and they shafted each and everyone. The lack of outrage from our authorities is absolutely disgusting. Speaks a lot to the state of cybersecurity and tech in general

  • @wernerlippert5499
    @wernerlippert5499 3 หลายเดือนก่อน +7

    Humans tend to think they can sacrifice quality for speed, which works for some time and then fails miserably. It's a bit like the uncertainty principle, there is a fundamental limit that cannot be cheated.

    • @stickman1742
      @stickman1742 3 หลายเดือนก่อน

      We are pushing towards these kind of bad events pretty quickly. Software updates are being pushed out constantly in an effort to move ahead as fast as possible. It wasn't that long ago that this was not the way. Updates were treated very carefully and put out more slowly. Now it is a race to see who can update the fastest. I see computer and devices suddenly stop working on their own all the time now. Always because of some recent update. This is already an issue, this is just an event so widespread that everyone is hearing about it. The industry is going to have to come up with more robust systems as we cannot depend on computers for everything if they often are just not going to work. This is a relatively new issue with all these updates and the problems will only get far worse with bigger consequences if it continues like this.

  • @askii3
    @askii3 3 หลายเดือนก่อน +6

    A mechanism rolling back an update after X number of failed boots/etc would help a lot here. My router does this, it keeps a copy of the old firmware it can automatically revert to in case flashing a new firmware image bricks it.
    SUSE's MicroOS does similar by having a stateless OS and transactional updates that are snapshotted in the BTRFS file system. If it crashes and reboots, it'll automatically rollback to the snapshot before the update while preserving user data.

    • @gzoechi
      @gzoechi 3 หลายเดือนก่อน +4

      In NixOS you can configure how many system configs you want to keep. Switching back in the boot menu just changes a bunch of symlinks.

    • @askii3
      @askii3 3 หลายเดือนก่อน +3

      @@gzoechi yeah, it's really cool

  • @keithnsearle7393
    @keithnsearle7393 3 หลายเดือนก่อน +8

    So, basically Crowdstrike could not even secure itself against itself. Well done Crowdstrike, well done! (Slowly clapping) To Microsoft, get rid of Crowdstrike, no IFS and no BUTTS!

    • @vister6757
      @vister6757 3 หลายเดือนก่อน +1

      Other antivirus also have access to the kernel due to EU regulator requirements after McAfee and Symantec brought the case against Microsoft when Microsoft placed a code to stop 3rd party software running on its kernel.

  • @_SR375_
    @_SR375_ 3 หลายเดือนก่อน +8

    I want to add that the fact that CrowdStrike is so widely used makes it a target for bad actors, and perhaps how it operates internally, which seems to be monolithic, is also a problem. We also do not know what government and military systems were affected by this "bug" . Regardless of other bad practices that were at play, CrowdStrike itself may want to consider a lessre and perhaps break up its platforms into shards, such that entire industries are not a impacted by one bad software update or a bad pod

  • @mitchellsmith4601
    @mitchellsmith4601 3 หลายเดือนก่อน +4

    This was an embarrassing failure for Crowdstrike. All they had to do was test their patch on Windows PCs prior to release, and they would have seen those PCs blue screen. They could have fixed the issue, tested again, and THEN deployed. The more devices you’re responsible for, the greater the duty to test prior to deployment. This was negligence, pure and simple, and there should be a class action suit against Crowdstrike for the damages they caused. Such a suit would destroy Crowdstrike, of course, but that’s as it should be. Our world needs to deter this negligence in the future.

    • @raristy1
      @raristy1 3 หลายเดือนก่อน +1

      Basic Security + cert teaches EXACTLY that. So my question would be, was ANYONE certified at CrowdStrike???

  • @AlvaroGilFernandez
    @AlvaroGilFernandez 3 หลายเดือนก่อน +2

    As an IT expert, all my life windows has always been a problem, always presents some kind of problem to begin with. We need a new operating system that can replace windows that can be tough trustworthy.

    • @GH-oi2jf
      @GH-oi2jf 3 หลายเดือนก่อน +1

      We had one. It was from IBM and was called OS/2. Actually, Microsoft participated in the development of OS/2 1.x. It ran in the early ATMs. OS/2 2.x came out before Windows NT and was an excellent product, but for some reason the shift to Microsoft took place and OS/2 was left behind. Business would have been better off sticking with IBM.
      I have never worked for IBM (or Microsoft) by the way, but I have worked with OS/2. I have no conflict of interest, just my opinion.

    • @sergioyichiong7269
      @sergioyichiong7269 3 หลายเดือนก่อน

      Non windows oses never had problems? What you re gonna do with the legacy code?

  • @sneezyfido
    @sneezyfido 3 หลายเดือนก่อน +4

    Business culture breeding and promoting incompetence is a huge issue in all large companies

    • @nurulnurul9270
      @nurulnurul9270 3 หลายเดือนก่อน

      Ouch. Somehow I found myself agreed with you

  • @johnmoore8599
    @johnmoore8599 3 หลายเดือนก่อน +1

    You aren't thinking through the problem sufficiently. There are two ways to solve this issue at the kernel level to prevent these kinds of problems with monolithic kernels. 1. Build a subsystem that if the kernel panics, reverts the system to the last known good configuration before the crash. 2. Build a driver subsystem that insulates the kernel from buggy or bad drivers and lets it continue to operate. This was the idea behind nooks written by Michael Swift in 2005. Either architectural change would build resiliency into the current OS kernels humans use. For whatever reason, no one is doing this. Their answers are always develop better drivers and they put these best practices into place and along comes some company like cough, McAfee, or cough, Crowdstrike who bring Windows systems down. People get angry and pissed because of ruined plans or lost money, but ultimately, nothing significantly changes because someone at Microsoft/Apple/Linux Foundation doesn't want to pay to make their OSes more reliable.

  • @StuartLynne
    @StuartLynne 3 หลายเดือนก่อน +1

    There is no Silver Bullet.improvement within a decade in productivity, in reliability, in simplicity.” • - Fred Brooks, 1986.

  • @daviddunkelheit9952
    @daviddunkelheit9952 3 หลายเดือนก่อน +1

    This failure was quicker in onset and damage than Solarwinds. Diversity in systems …NOW!
    Need to build heterogeneity into the system rather than homogeneity.

  • @gzoechi
    @gzoechi 3 หลายเดือนก่อน +4

    CrowdStrike has shown that it has become the biggest threat to security

  • @sm5574
    @sm5574 3 หลายเดือนก่อน +7

    A lot of the developers who are doing shoddy work don't know that they are. They may be incompetent, or they may not know the codebase as well as they think they do, or the codebase may itself be a ticking timebomb, full of patches and poor decisions that effectively hide a myriad of bugs.
    The industry is absolutely broken because it is full of people who are completely unaware of best practices and solid patterns, relying instead on their own unstructured learning that has gone unchecked for decades.

    • @stickman1742
      @stickman1742 3 หลายเดือนก่อน

      The only real solution though is that the systems need to be more robust. There are always going to be some software bugs, it would be impossible for everyone to always create software without a single bug. These computers have to have a design that is far more robust so that it won't just refuse to run if there is one bug in even a kernel driver. This is a must if we are going to avoid much bigger problems like this in the future. Problem is, most companies just want to build on the current designs and move quickly as that is how you can make the most money. It will take a disaster to make everyone stop and say I guess we really do need a new design for this. Then all companies will be willing to pay for that newly designed system and the computer companies will make it. Is this even big enough to make that happen? It may cause them to look at it a big, but I kind of doubt it's big enough to push that much change. They'll probably put a band-aid on it.

    • @sm5574
      @sm5574 3 หลายเดือนก่อน

      @@stickman1742, I agree, but I would estimate that developers (even at the senior level) who are capable of writing high-quality code are very much in the minority, and the people in charge of hiring rarely understand what to look for, as they are not, themselves, capable of writing such code. Thus, the vast majority of codebases are and will be more error-prone and difficult to maintain than should be considered acceptable.

  • @yogibarista2818
    @yogibarista2818 3 หลายเดือนก่อน +10

    The issue essentially is that there is a kernel-mode driver - no doubt WHQL certified - that is running uncertified p-code from installable 'definition' files, so that a bug there will cause the kernel-mode driver to execute bad code, and bug-check the system. Perhaps the kernel-mode driver needs better checking and self-defence - could the WHQL certification process require this?. The 'fix' is to gain access to safe-mode, boot without the driver, and then remove the installable definition files, so perhaps a system should identify crashing 'boot-required' drivers and sideline them if they crash repeatedly.

    • @incandescentwithrage
      @incandescentwithrage 3 หลายเดือนก่อน +1

      You mean just like malware would do,?

    • @JeffBartlett-kj6sq
      @JeffBartlett-kj6sq 3 หลายเดือนก่อน +2

      I heard that it ate a file of all zeros. So, 1) no signature bytes. 2) no header, 3) no header checksum, 4) no whole file checksum. 5) no file encryption nor signing. So a bad actor can figure out the p code and put a definition file in the directory, or do a man in the middle attack and own the machine from ring zero.

    • @johnhebert9583
      @johnhebert9583 3 หลายเดือนก่อน

      Someone else who watched the Dave Plummer video about Crowdstrike. His is the most thorough explanation I've seen.

  • @omriliad659
    @omriliad659 3 หลายเดือนก่อน +2

    One (partial) solution is to have a backup computer, that stays a version behind or a few days and only comes online if the main server stops responding. It would prevent this problem and could only be exploited in case a hacker could take down the most up to date server. You could even rotate the servers, so you update 2 versions each time.
    Another solution is to have canary distribution with faster turnaround. Set the most secret systems to have the update first, have the next group within an hour later etc. It means you make your last group vulnerable for a few more hours, but you give them the peach of mind that it was tested for a few more hours and is unlikely to crush that fast.
    Last solution is to disconnect systems from the internet. No computers with internet connection means no attack surface, and you can still work offline or maybe even with others on the same network. Keep the protection system guarding the gateway, maybe even keep several layers of different software at each layer, but leave the inner network isolated.

  • @miraculixxs
    @miraculixxs 3 หลายเดือนก่อน +1

    It was not a security issue. It is a management issue. Perhaps MBAs should not run engineering orginizations.

  • @gregorymathy2782
    @gregorymathy2782 3 หลายเดือนก่อน +2

    That unfortunately goes beyond IT infrastructure cost… harmonization of process and procedure… CrowdStrike, Boeing, cars breakdown… all those stuff are driven unfortunately by cost reduction and profit optimization …
    We are unfortunately only seeing the top of the iceberg and I am pretty sure we are only at the beginning of it … I wonder what will be the next big things …

  • @captainnerd6452
    @captainnerd6452 3 หลายเดือนก่อน +3

    Error checking and error handling design. Don't trust data coming in, and don't trust data being returned from functions. Really don't trust the user.

    • @mfrunyan
      @mfrunyan 3 หลายเดือนก่อน +1

      Precisely. This is amateur level code running in the kernel space.

  • @RiteGuy
    @RiteGuy 3 หลายเดือนก่อน +10

    All great points, Arjan, and I agree with them. But you left out a biggy - companies want to make as much money as possible so they cut corners everywhere.
    You did lightly touch on time by saying sometimes you don’t have the time to create a proper fix for a threat. I agree, but there’s another time problem. To companies, time = money, so the time allowed to work on things is cut right away even when there isn’t a looming threat.
    Remote updating is a godsend for companies. It lets them ship a product that is incomplete and flawed thanks to time and money restraints.
    Them as the product is completed/fixed, the current installations of software are usually automatically updated without the knowledge of the user.
    These issues and all the ones you mentioned are breaking software. I’m afraid of AI, not for the reasons most people cite but because software code is garbage in this day and age. Why would AI software be any better?

  • @MadeleineTakam
    @MadeleineTakam 3 หลายเดือนก่อน +13

    I find it utterly incredible that they don’t test the update on a sandboxed system before sending it out.

    • @Eris123451
      @Eris123451 3 หลายเดือนก่อน +1

      I don't.
      It's a quality assurance issue and it's turned out for example, that after years of promoting quality systems and quality assurance that at least 2 of the biggest manufacturing companies in Japan had been falsifying their production records and data for decades.
      If it a choice between scrapping millions of pound of work or passing it on the nod, few if any managers are going to bite the bullet and take that kind of financial hit,
      That mind set is probably at the roots of the majority of major operational failures in almost any industry.

  • @marcelogarcia5539
    @marcelogarcia5539 3 หลายเดือนก่อน +6

    I thought this was one of the lessons from COVID: resilience is important as efficiency.

  • @krissn8111
    @krissn8111 3 หลายเดือนก่อน +6

    Was there any test in canary environments? I guess not and how long does it take to test in canary? I cant understand a company like crowdstrike overlooking best practices.

  • @NickThunnda
    @NickThunnda 3 หลายเดือนก่อน +2

    In the good old days we had big mainframes running code which took checkpoints and did automatic rollbacks upon failure. They were replaced by lots of networked Microsoft boxes.

  • @gregharn1
    @gregharn1 3 หลายเดือนก่อน +4

    It's not a software problem. The decisions CS made was a tradeoff for functionality. The real root problem is policy. If you're a company running an EDR or even certain AV, you MUST build out a redundant infrastructure - specifically to mitigate bad updates. Which really just means if a system can run on 1 machine, you deploy at least 2 AND run different EDR or AV on each system. If 1 crashes (like last week), no big deal. 2 is 1, 1 is none.

  • @MartinPHellwig
    @MartinPHellwig 3 หลายเดือนก่อน +2

    When you expect something to work all the time in all circumstances but you can't define what all the time actually is or what the specifics of circumstances mean, you have unrealistic expectations. That is something each individual has to learn those willing to be realistic will have an easier time with less severe consequences learning that.

  • @CraftyF0X
    @CraftyF0X 3 หลายเดือนก่อน +2

    I for one always saw the possibility of something like this happening, hence my reservations against automatic forced background software updates, which would sound shady AF in the 90s while today a widely accepted daily occurence. Don't get me wrong it has its advantages but something like this case was always in the ards.

  • @joelmamedov404
    @joelmamedov404 3 หลายเดือนก่อน +1

    Technical glitches can happen. The fundamental problem is not technical. It’s managerial. The “business continuity “ planning does not exists anymore. The critical systems and industries must have redundant and durable systems. All the eggs are in the same basket unfortunately.

  • @DistortedV12
    @DistortedV12 3 หลายเดือนก่อน +8

    I think one of the problems is this automatic update culture

    • @robertbutsch1802
      @robertbutsch1802 3 หลายเดือนก่อน

      According to CrowdStrike this was not an “update” but a content delivery.

    • @luciaceba4640
      @luciaceba4640 3 หลายเดือนก่อน

      @@robertbutsch1802which, is an update

  • @galuszkak
    @galuszkak 3 หลายเดือนก่อน +10

    I think this is interesting case that software design decision to build monolithic kernels 30-40 years ago is showing it’s consequences today (Linux, Windows etc.). Prof. Andrew Tanenbaum was trying to convince software industry that micro kernels are better for reliability and security, while sacrificing some performance. Looking back this is my best guess that by going with monolithic kernels we build whole security industry around it because of security flows that can be there by design.

    • @pureabsolute4618
      @pureabsolute4618 3 หลายเดือนก่อน +2

      It's also how big "kernel space" is in general. Windows NT has graphics in user space. Of course, that was too slow, so they moved it "back" (windows 98 didn't have a protected kernel).

    • @CallousCoder
      @CallousCoder 3 หลายเดือนก่อน +4

      The problem with micro kernels is that they are complex. Gnu Herd failed because of it Darwin is the only one now but on x64 (I need to check ARM, I developed assembly on ARM but never from bare metal) only has 2 security rings. We used to 4 but since all major operating systems and most CPUs since VAX had 2 rings of protection, x64 also settled for two. So you don’t have your classical ring 1 for your drivers anymore. So you maybe loosely coupling your drivers but all in all they run in the privileged are - hands MacOSX on Intel did crash with shitty drivers too.

    • @gzoechi
      @gzoechi 3 หลายเดือนก่อน

      NixOS can easily switch back multiple versions of configurations (not just the kernel). That's not a problem where the kernel architecture needs to get involved.

    • @MartinMaat
      @MartinMaat 3 หลายเดือนก่อน

      It has nothing to do with this. The point of a virus scanner is that it should have control over everything by design. Which is not only a major security issue in itself but also a major privacy issue. As people get scared they tend to accept compromises, all the way down to fascism.

    • @ra2enjoyer708
      @ra2enjoyer708 3 หลายเดือนก่อน

      @@MartinMaat You meant liberalism?

  • @durand101
    @durand101 3 หลายเดือนก่อน +2

    The reason the world is so fragile right now is because of a) tech monopolies and b) efficiency over pragmatism. Why does MS have such a large share of the corporate market and why aren't our various regulators challenging that? In nature, monoculture ecosystems are the quickest to be killed by disease.
    And why does everything have to be automated to the point where there are no humans in the loop? Mostly to be more "efficient" and reduce costs - at the risk of much more expensive black swan events eventually coming to ruin your day.

  • @samable9585
    @samable9585 3 หลายเดือนก่อน +2

    for serious bug or zero day bug -- CrowdStrike should have simply disabled inbound traffic to the host (other than itself) and work on fix and roll it in limited manner. If it succeeds keep rolling it.... Would you fly a plane with this type of method? We ground planes immediately when there is threat -- but we treat security threat in computers in slightly business-as-usual method and take chance. This may change it ... Act first, disable and then push changes

  • @davidgrisez
    @davidgrisez 3 หลายเดือนก่อน +1

    One main thing that allowed CrowdStrike security software to crash the computer operating system was the fact that this security software must be installed as a device driver operating at the high privilege level of the operating system kernel. Normal program software running at a lower privilege level should not be able to crash the operating system.

  • @samchristy6745
    @samchristy6745 3 หลายเดือนก่อน +1

    Most threats do not require an immediate response level, for many a canary release mechanism based on system criticalness level
    1) deploy to non-critical systems (grocery stores, small businesses, gas stations, government)
    2) wait 36 hours
    3) deploy to mid-level critical systems (banks, financial institutions)
    4) wait 36 hours
    5) deploy to critical systems, (hospitals, pharmacy's, airports)
    For the defcon 5 threat level scenarios, then perhaps use the shotgun approach.

  • @theronwolf3296
    @theronwolf3296 3 หลายเดือนก่อน +1

    Maybe the kernel security layer should be virtualized, so that a corruption of the kernel can be quickly be switched off.
    Despite the claimed need for such deep access, if companies like Crowdstrike can corrupt the kernel, hackers (including nation state actors) could the same, or worse. At least the Crowdstrike bug just crashed the system, but other bugs can subvert it.

  • @PerisMartin
    @PerisMartin 3 หลายเดือนก่อน

    Well, the way you solve this is to keep doing what you are doing. Keep teaching and preaching good practices with your videos. You never know the second and third order consequences of your good work. Keep it up!

  • @steves9250
    @steves9250 3 หลายเดือนก่อน +1

    Shows how a product that works 99.99% of the time makes one mistake and it all goes to hell

  • @epiphoney
    @epiphoney 3 หลายเดือนก่อน +1

    Mark Russinovich retweeted using Rust instead of C++ for systems programming, "for no particular reason".

  • @diogotrindade444
    @diogotrindade444 3 หลายเดือนก่อน

    All parties need to fix this broken system:
    - Security companies cannot ever force push without testing.
    - OS (special MS) need to improve all aspects in this scenario with lots of new well documentated automated testing/check tools for multiple steps in the process.
    - Essensial companies cannot trust blindly on updates without basic checks, and MS should not be the only OS running if you want to make sure that you online all the time.
    We need better software build for failure special for essential compatines that cannot stop. If companies do not fix this on all levels it can open a new door for failure.

  • @richardbloemenkamp8532
    @richardbloemenkamp8532 3 หลายเดือนก่อน +2

    Staged/canary releases should be obligatory unless imminent danger at which point the government should be involved. It is totally ridiculous that millions of PC's install kernel patches that have not even been checked on a starting group of a few thousand computers for at least one or two days. In this case there was no imminent great danger that absolutely required all of the millions of PC's to be updated within a few hours.

  • @billfrug
    @billfrug 3 หลายเดือนก่อน +3

    So your argument is that there was an imminent security threat that the update addressed? Is there any evidence of that?

    • @JeanPierreWhite
      @JeanPierreWhite 3 หลายเดือนก่อน

      I don't think so. Just a bad update. Fragile endpoints, lack of change management.
      I retire and a year later the world goes to crap. Geez, that didn't take long ;-)

    • @theronwolf3296
      @theronwolf3296 3 หลายเดือนก่อน

      Nothing I have seen so far even identifies the threat that was SO serious that this rush was essential. That's another part of the problem, security companies just go along and do things.

  • @CaribouDataScience
    @CaribouDataScience 3 หลายเดือนก่อน +3

    What’s the cliché say about putting all your eggs in one basket?

  • @charlesnicholas4758
    @charlesnicholas4758 3 หลายเดือนก่อน +3

    Good video but everyone seems to ignore the fundamental problem. How do you compile source code into a file of binary zeros?! At least if it had been a null file the size would have been noticed.

  • @CallousCoder
    @CallousCoder 3 หลายเดือนก่อน +6

    If this clusterfuck has showed one thing, that is that important companies and instances can’t cope with disasters. There are no manual backup processes in place. And it’s not only computer systems that can fail but also long power outages, internet outages. Long traffic problems that disrupt goods from going where they need to be. We need to not rely as much on government global infrastructure but decentralize systems. Back when I was in the energy business, I was propagating for small pebble reactors for towns or larger neighbourhoods. Instead of massive 1GW nuclear power stations - the bigger the more complex and the more material is needed. Whereas small reactors are simpler to build and even safer. Russia understood this and they are moving in that way. And it also is great against acts of war. Taking out 4 or 5 power plants and you disturbs all of the industrial areas of the Netherlands. Taking out 50-100 smaller ones is a lot more of a hassle. And we should use cash cash and more cash for our daily shopping. And we should actually buy locally from local farms much much more.

    • @joansparky4439
      @joansparky4439 3 หลายเดือนก่อน

      economies of scale drive this, which means this is NOT true: _"the bigger the more complex and the more material is needed"_

    • @JeanPierreWhite
      @JeanPierreWhite 3 หลายเดือนก่อน

      Many companies do have manual processes. However they are very slow and inefficient. If that wasn't the case then we wouldn't deploy computers in the first place.

    • @CallousCoder
      @CallousCoder 3 หลายเดือนก่อน +1

      @@JeanPierreWhite many did a lot didn’t like hospitals and GP offices that’s just unthinkable! I worked in healthcare software, we documented a backup process as part of our manual. You could print your agenda, you could print user details and treatment and medication plans. And most did that. You don’t need your computer system to diagnose or treat people. Same with issuing boarding cards. The SITA system was still running, print a passenger manifest and issue the boarding cards manually. Some airlines did most didn’t.
      So it showed how painfully unprepared we are. And this was only a simple computer outage. Let alone something more impactful like power outage.

    • @CallousCoder
      @CallousCoder 3 หลายเดือนก่อน

      @@joansparky4439 it is true in case of nuclear power plants and engineering.
      Of something is bigger it will always require more resources to build. I can’t build a reactor thinner.
      And this statement is only true in case of consumer goods where you can make billions. But critical systems its cost isn’t manufacturing the actual system. But all the security and secondary systems.
      A single engine plane will always be simpler and cheaper than a 4 engine plane.
      It’s not just bolting an extra engine on the plane, but your mass increases so that first engine should be able to hold the plane up with that added mass. You will need to monitor the two engines and balance them for wear and tear. You’ll need to service the two engines. And this complexity and problem gets worse with 4 engines. What if two power engines stop? Then the other two should take over but also the whole load bearing structure is indiscriminately loaded and that needs to be designed and tested.
      Critical systems where you start adding to the control systems get quickly more complex.
      You probably haven’t studied and engineering and especially not done critical systems. Because that’s where the law of general economics don’t apply. Simply because of the snowball effect.
      Also there’s not enough true mission critical systems to get the benefits of standard economy. How many nuclear power plants are there build every year? How many satellites etc.

    • @CallousCoder
      @CallousCoder 3 หลายเดือนก่อน

      @@joansparky4439 funny story, I got into critical systems for building a very simple device that measured heights of snow/ice on Antarctica. You would knock out a prototype for this in a few hours these days, back in 1993 in about a week (all in assembly and no libraries available). But since this system has to run unattended for 5 years. Suddenly the complexity of the peripherals systems exploded! We needed two batteries, charge circuits that would batteries equally (if they don’t have the same capacity you are basically discharging one over the other). These charging circuits had to be redundant. Which also mean two solar cells, that were cross connected. You need to be ware that 6 months out of the year there’s no charging so each battery itself should be able to hold a 7 month charge. So suddenly the batteries became twice as big. The housing as a result became twice as big. But we also need 3 sets of ultrasound range finders. For redundancy and that added enormous code complexity, to see if primary system was working, by comparing it to the secondary. If there was a discrepancy (which with snow and ice is very normal because that forms in heaps) to take the secondary system after comparing it to the tertiary system.
      If the tertiary system decided the primary and or secondary system is defective, you don’t want to use those range finders to save crucial energy. As a matter of fact let’s decouple them from the CPU bus.
      The cost exploded! Not only in resources but mainly in design and development. And you never build enough of these systems to get the economic benefit.
      There are simply never enough build for that. Basically all those critical systems from planes to tanks to satellites to bespoke research equipment, are manually made by a very select few people.
      It’s not that you go to China and let a factory build 2200 satellites. First of all that factory that can do that doesn’t exist and needs to be designed and build.
      And 2200 is a big number of satellites.

  • @Apstergo
    @Apstergo 3 หลายเดือนก่อน +2

    Knowing these questions is important. Actively listening to industry experts and less to corporate experts (They only lead to better return on invest, and now that is AI).
    This event should be a wakeup call, but I don't think people with think of it like that.

  • @bobdowling6932
    @bobdowling6932 3 หลายเดือนก่อน +1

    There is (should be) a standard pre-release test even for time-critical security software: The test should be that the target operating system can at least boot to a point where the updater can allow new versions of the security software to be installed. The test should be run twice: once on an instance that keeps upgrading the software and one on a freshly installed operating system. If just those tests are implemented then, to an extent, you can rush the rest because fixes can be sent out to clean up errors. This test doesn’t need re-writing for each instance of the software.
    Other tests (does it block the malware, does it not interfere with critical applications, ...) can be run after launch because errors can be cleaned up automatically. There is room for subtlety here: a customer might sign up for the pre-application-testing version or the post-application-testing version. Perhaps they do their own testing. Perhaps they have made a risk-balancing decision.
    This sounds so obvious. Hindsight is a beautiful thing.

  • @gamechannel1271
    @gamechannel1271 3 หลายเดือนก่อน +6

    I'll just say there is no reason this software would have a need to "quickly deal with a security threat". The software itself is a security threat. It should be removed from all computers, and the company should be disbanded. See videos from people who have analyzed their driver, and how poorly it is validating its virus definition files. The download of a bad definition file caused this crash, NOT a driver update. Because they wanted to bypass the driver update process.

    • @kjetilhvalstrand1009
      @kjetilhvalstrand1009 3 หลายเดือนก่อน

      Absulutly an update can be back door into the kernel, if they do not check what is pushed, it shows how bad this company is.

  • @d3stinYwOw
    @d3stinYwOw 3 หลายเดือนก่อน

    CrowdStrike hit linux few months ago, too, but nobody told anything since impact was smaller
    CrowdStrike also was able to force such upgrades. Plus, we can have both w.r.t tests and velocity, ContinuousDelivery main front person, Dave, told it as well :)

  • @TimShear-p3s
    @TimShear-p3s 3 หลายเดือนก่อน

    It seems to me to be the old problem of shutting the gate after the horses have left. What's needed is to have a kernel process built with a number of 'gates' where the system will not continue past a gate unless it passes its constraint and then to have a error 'fail safe' that allows the system to execute in safe mode so any changes can be made to restore operation if the gate fails. This would have saved CrowdStrike.
    Further, the system should be based on an identity management framework where identities/entities and permissions can't be pivoted or navigated out of to other entities. At the kernel level and beyond. All operations should check credentials before executing programs. This is easy to do if there is a relationship between the entity requesting access and the entity controlling access. No relationship, no access.

  • @MikeHunt-pu5cm
    @MikeHunt-pu5cm 3 หลายเดือนก่อน

    The Crowdstrike issue is extremely easy to fix in two minutes...
    boot in safe mode
    run cmd (command prompt) in admin mode
    Type:-
    cd %windir%/system32/crowdstrike (hit enter)
    del C*291.sys (hit enter)
    Then reboot the machine
    All done...!

  • @kellyaquinastom
    @kellyaquinastom 3 หลายเดือนก่อน +1

    This is called “Experience”

  • @natduinfo
    @natduinfo 3 หลายเดือนก่อน +2

    NSA has better access to kernel than CrowdStrike. Let that sink in. 😂

  • @eglobalsystems2554
    @eglobalsystems2554 3 หลายเดือนก่อน

    That's taught us again. SDETs are important part of our software life cycle!

  • @bernhardkrickl5197
    @bernhardkrickl5197 3 หลายเดือนก่อน

    The promise of Continuous Delivery (as Dave Farley explains so often) is that you can release quickly and safely *because* you have a lots of tests. You work in small steps to achieve that. There might be an imminent threat and we will have to make a big change to our software to deal with it. You are back at square one: How do you know your change actually deals with the threat? Oh, that's right: By testing. If you say you need to skip that phase you don't believe in testing in the first place. If you skip that phase you get *something* to the market quicker. But will it help? Or are you pouring oil into the fire? The practice of continuous delivery with TDD is the best insurance that your software stays flexible and easy to change so you can deal with such problems quickly when they arise suddenly.

  • @gruntaxeman3740
    @gruntaxeman3740 3 หลายเดือนก่อน +1

    One root cause of issue is that bullshit security, having a lot of complexity and adding more complexity in form of some security application.
    In reality when someone want to make reliable system amount of complexity is minimized. That is why critical places all unnecessry "moving parts" are removed and system is locked down tightly. It can be even better if code is formally verified to avoid bugs.
    Humanity has knowledge how to do this correctly. I even have alone the knowledge how to do it.
    Instead we see bloated software stacks, dumb IT who thinks that end point security software should be installed on critical, dedicated system. Or dumb insurance company who require it.
    One issue is also that today 95% of software developers don't even know how computer works. There is lack of deep knowledge and software developers are actually those people who are understand technology better than some lawyer in insurance company.

  • @dannym817
    @dannym817 3 หลายเดือนก่อน

    As a software engineeer myself:
    - To less time: to test, to build well/refactor, to rebuild legacy code. Deadlines pushing bad/not well tested software into production.
    - To much stress because to much firing/people leaving and rehiring all the time. And with the problem the knowledge of parts of the software is gone.
    - A lot of bad managers in the IT world, who make the above happen
    - Companies see software development/it as a cost instead of a win: For example in some companies i worked sales persons get bonusses when they make enough sales selling the software, while software engineers dont get anything.
    - There is not a real easy to see how good/bad a software engineer have been working for managers/people who cant read code. And because of this most companies only look at speed. Not how well software is written.
    This have been happening for a very long time in lots, probably most companies. With legacy code that isnt workable anymore and very hard to maintain. And should have been replaced years ago.

  • @robinlioret7998
    @robinlioret7998 3 หลายเดือนก่อน +32

    Add poor patching management in the companies: never apply patches directly in production without testing it in lower environments before...

    • @gcaussade
      @gcaussade 3 หลายเดือนก่อน +2

      This is what really amazes me, the fact that so many companies were just rolling this out. But his point is correct. I give more blame to Microsoft and crowd strike. They're the ones that have to work very closely together and do like something new more like real-time testing. It's amazing this hasn't happened prior. The largest breach in US history was the United healthcare Optum breach months ago. That was a result of companies not patching fast enough! And that was remote software not something near the kernel.
      Still led to a massive disaster and problems with the health care system for over a month! So if anything, CIOs and CISOs felt more compelled to have to roll out security software even faster to make sure that it at least is up to date. What would happen if you were breached because you didn't roll out crowd strike fast enough?
      That's the dilemma he brings up.

    • @xBanki
      @xBanki 3 หลายเดือนก่อน +7

      Reading from anecdotal reports online, CrowdStrike likes to push their customers into enabling automatic patch updates. Logically, it makes sense why they would do that, however historical evidence (And literally any administrative handbook) says blindly accepting updates, no matter the reputation of the company and the claimed quality of the updates should not be done to prevent outages like we saw.

    • @robertbutsch1802
      @robertbutsch1802 3 หลายเดือนก่อน +3

      This was the equivalent of an AV pushing out a new virus signature file. No enterprise is going to pay the cost of CrowdStrike just to be a week behind on threat protection.

    • @silmarian
      @silmarian 3 หลายเดือนก่อน +1

      They pushed it using the same channel as signature updates, not the usual upgrade path.

    • @Lofote
      @Lofote 3 หลายเดือนก่อน +1

      That is not really valid for 2024 anymore. That was the case in earlier times, but in 2024 the zeroday-attacks are so common and threadening that security updates are considered time-critical. Meaning the risk of crashing your systems is considered more acceptable than having a successful security hack, where your data may be downloaded to the hacker, which is considered a far bigger desaster. Time is critical in 2024 with security patches :(...

  • @ulrichborchers5632
    @ulrichborchers5632 3 หลายเดือนก่อน

    A rant about this is perfectly fine. We need to speak the truth if something clearly goes wrong. To remain silent, not wanting to be "negative" in such a scenario would be wrong, it only strengthens the wrong approach and thinking.
    The responsibility of a software engineer includes detecting problems early, thus avoiding them in the first place. This is an essential part of implementing a good solution, by avoiding a bad one.
    CD is exactly about that. It is not rigid at all, don't fall into that trap. To bypass good engineering practices is never a good idea, especially at this scale. CD with all its techniques supports fast incremental progress into production. It can raise software quality dramatically, minimize defects, and it also does prevent desaster, both with respect to releases and to the intrinsic quality of the product.
    A failed integration test with a widely used OS obviously would have prevented this. The choice here is not whether to allow a quick release into production or not (good CD practices even speed up the release because a high degree of automation is included), BUT the choice whether to apply CD best practices or not is this: Do you want to avoid desaster and be notified as early as possible that you have to fix the software before deploying it ... or do you really want to deploy a bug into production if you could have detected and fixed it before? To apply CD is not about preventing a quick deployment. It would have been about preventing the release of a problem in this scenario.
    If people experience problems not being able to release quickly when they have to, then they DO have quality problems with their software or with the system architecture or they do not have enough understanding of CD and how to apply it correcly. They then have to improve their engineering skills instead of accusing the necessary and professional techniques which they have not mastered.
    Incompetent people in charge making decisions under pressure, or for whatever reasons, are the actual thread. It is not "AI", but yes, the thread is also about intransparency and lack of knowledge.
    "AI" is a marketing term, nothing more. If a software can detect patterns and even learn to improve the detection of security problems, that is perfectly fine, whatever techniques or tools it uses.
    The internal mechanics of a software are never transparent to the "end user". This is normal, whether "AI" is included or not. But yes, we do have a problem if engineering is done without exact knowledge of what is going on. The ongoing loss of culture, education and priorities is the actual problem, not the use and integration of technology itself. When I notice software "engineers" using the term "AI" internally and following the same belief system, that scares the hell out of me.
    Technology is not a thread, but inkompetence is, and always has been. There is a clear answer to the question of how to solve the global problem: Education, thinking, acting with a sense of responsibility and the right priorities as human beings.
    Now this may read strange, but what that means is this: Stop cheating. If you do something, do it well for the sake of what you are doing ... money must never be the top priority, money is not even real. It is a mental construct to make exchanging things more efficient, by comparing the value of things. This requires balance to work. Subtract or add the same on both sides. This is simple math and obvious stuff. If you see money as something to be maximized, then this is abuse of the limited resources we have and of other human beings, of your very own species. Magic does not exist in this world, money cannot be "generated" from nothing. The law of coservation of energy is a fundamental law of physics and it of course applies to everything, including life itself. For every shortcut and for every selfish act, someone else will have to pay the price.
    Is there a problem with a software which breaks encapsulation, to access the Kernel of an OS? Yes of course. Does it have to do with EU regulations, forcing MS to allow access to a protected layer? It seems. But the root cause of this may be unfair economic practices to exclude others from "competition". Or it is pretending that a successful company does so and thus invading their space, for the sake of making money. It does not matter which side is right there. That is fundamentally the wrong type of thinking. The actual root cause is a wrong definition of "success" which has invaded our culture in a very harmful way.

  • @Colaholiker
    @Colaholiker 3 หลายเดือนก่อน +1

    Surprisingly, my employer was not affected. After all, our IT usually doesn't leave any chance to screw up pass without making good use of it.
    But we are affected by supposed "AI securtiy software". Some years ago, they changed from a common rules based antivirus to something that supposedly uses AI. And it is so terrible. One thing it likes to target are programs that you just compiled (even a simple "Hello World" that you used to test if a new compiler version is working), parts of development software that we have used for ages, and now it even stopped attacking another supposed security software (we think it is just something they use to monitor what we do) that supposedly filters web traffic. Yes, it deletes that software as well. Of course, this is all just stuff that has an effect on individual workstations, not on a global scale, but it is so annoying...

  • @osark2487
    @osark2487 3 หลายเดือนก่อน +1

    At this point autopsies have poped up on youtube channels everywhere. We most definitely know what happened, how and why.

  • @k98killer
    @k98killer 3 หลายเดือนก่อน

    The driver code itself was not updated, but rather a "channel" file that contained attack detection templates was pushed out with all zeros. The driver contained faulty template verification code that allowed the broken file to be parsed, and this included what should have been a valid pointer offset value. The driver then dereferenced a bad pointer and crashed the system. So really, if they had more thorough testing of their core code, they could have prevented this.

  • @allenpierce4575
    @allenpierce4575 3 หลายเดือนก่อน +1

    doesn't help that the newer versions of windows doesn't allow you to roll back the update without it trying to reinstalling right after removing it

  • @center-q4k
    @center-q4k 2 หลายเดือนก่อน

    great analysis - mentioned about the dichotomy...

  • @miyu545
    @miyu545 3 หลายเดือนก่อน +3

    That's what you have when you have no patch management or change management process. Microsoft does not have one. It has the public to do that for them.

    • @JeanPierreWhite
      @JeanPierreWhite 3 หลายเดือนก่อน

      Not true. Microsoft does have change management as do their customers.
      The problem is that Windows is too fragile. If a problem sneaks by then you have few recovery options.

  • @pureabsolute4618
    @pureabsolute4618 3 หลายเดือนก่อน +1

    First, there is *no way* someone else's driver should be pushed to *your* customers. Second, if it is pushed, protect it with the software stack equivalent of a try catch. If that takes too many resources, have it remove the try-catch guardrails via something like a manual group policy push.
    But people focus too much on non-scaled performance. If a driver takes 10% more of your computing power by being outside of the kernel, that should be a choice you can select as the customer.. and in most cases you should select that. I remember when.. Windows XP? crashed becuase of a bad graphics driver.. I was pissed, since the performance hit caused by being outside the kernal wouldn't affect what we were trying to use "NT" for. Kernel's should have the option (or by default) be as slim as possible.

  • @Chukwu1967
    @Chukwu1967 3 หลายเดือนก่อน +1

    Hey. Bright side. South West Airlines just announced , proudly, that they still use windows 3.1. So ....have at it.

  • @dalenmainerman
    @dalenmainerman 3 หลายเดือนก่อน

    Great video as always! Thanks, Arjan!
    Completely out of topic:
    It would be very interesting to see your take on the game "The Farmer Was Replaced", where you have to write code to automate farming drone
    Thanks!

  • @askii3
    @askii3 3 หลายเดือนก่อน

    I imagine Monte Carlo like testing methods are going to become more common for testing critical software, as demonstrated by the TigerBeetle database team.
    The bug was from dereferencing a null pointer. The channel Low Level Learning had a video on this. I saw another video on this saying it would've been very difficult to catch this in testing (I don't recall why). Clearly automated testing needs to be improved to have higher odds catching such errors.

  • @mrtnsnp
    @mrtnsnp 3 หลายเดือนก่อน +12

    In part you want to avoid single points of failure. So don't run all your systems using the same security software, the same base OS for all parts of your systems. A more diverse collection of systems is less likely to go down all at the same time. Crowdstrike for sure isn't the only provider for these kinds of services, and for sure they won't be the first (or last) to introduce bugs in kernel drivers. There are sufficient opportunities for shit to hit a fan. On the other hand: Apple removed access to the kernel for all third party software. That may be needed for Windows as well, with an API to perform these tasks from user space rather than inside the kernel. And crowdstrike needs to have better processes for developing their code, but they are not unique there.

    • @whlewis9164
      @whlewis9164 3 หลายเดือนก่อน +4

      A more diverse collection of systems also introduces complexity of support, management, monitoring, licensing, and contracting.

    • @mrtnsnp
      @mrtnsnp 3 หลายเดือนก่อน +1

      @@whlewis9164 Yes. Instead of one configuration, you have to support two, you split the options for each functional piece in two, and end up with two sets. In my view this likely beats the downtime that was just experienced.

    • @whlewis9164
      @whlewis9164 3 หลายเดือนก่อน

      @@mrtnsnp I very much doubt our corporate management overlords will opt for the best technical approaches. They will like continue to squeeze the budgets, ship support overseas, and consider the bottom line over everything else.

    • @mrtnsnp
      @mrtnsnp 3 หลายเดือนก่อน

      @@whlewis9164 They get what they pay for, that is for sure.

    • @ra2enjoyer708
      @ra2enjoyer708 3 หลายเดือนก่อน

      @@mrtnsnp Two? Try O(n^2), aka every OS with its own barely specced configuration format which depends on a specific version of parser which depends on a specific language it was written for. Also this kind of clusterfuck introduces another attack surface in the form of different parts of the stack interpreting the same value differently, in worst case with a race condition on top.

  • @christiananke667
    @christiananke667 3 หลายเดือนก่อน

    We tend to adopt solutions that are already provided by nature - and we did here. This crowdstrike thing can be compared with an desease attacking an species. Diversity helps to prevent extinction - but is no gurantee. Diversity in (security) software helps in here as well. So, we can be happy, that other operating systems exist, otherwise ...
    Furthermore if we think in software bugs and AI "bugs" we also should think of why children miss behave (or are buggy). It is the same thing: train/program/validate them well and you have what you need - turning that around means society gets the children/software it deserves. Means somehow we are asking for what happened, like children running around like crazy - no idea that moving too fast can hurt them. But - at latest - physics will teach them. We are not able to raise our children properly anymore, why do we think we do better in software or AI?
    So long, and thanks for all the fish.

  • @semibiotic
    @semibiotic 3 หลายเดือนก่อน

    That is the question of proper, profit-related system administration.
    Companies layoff their system administrators, turning all updates on automatically ? So they got automatic crashes like that.
    Big companies should have system administrators to manage infrastructure maintenance, grow and updates safely.

  • @philipoakley5498
    @philipoakley5498 3 หลายเดือนก่อน

    I agree.
    There is a lack of appreciation in the general 'software' industry of the shaky foundations of logic & perfection that coding is built upon, and how it has permeated the foundations of many other parts of society.
    Those who are reaching for blame should have a read of the years of study of human error, safety studies, and their ilk to see how these major failures continue to happen.
    It's just another day in our interconnected world.
    When the underlying unexpected factor(s) is/are finally identified, they'll likely be tedious and boring from some disused cupboard that had been forgotten about (xkcd/2347).

  • @JordanEdmundsEECS
    @JordanEdmundsEECS 3 หลายเดือนก่อน

    Perhaps even more fundamental, software quality is always going to come with a price tag, and gotta please them shareholders.

  • @johnmcway6120
    @johnmcway6120 3 หลายเดือนก่อน +1

    its just going to happen sometimes. theres construction accidents, there's medical accidents, theres accidents in every field big or small regardless of their complexity.
    no manager is going to see this happen and say, hey guys we just spoke with the board and we decided that we can invest twice as much money to ship this feature and decided to move the expected delivery date by 2 months to ensure theres no crunch and devs are well rested. thats not how business works, not in my experience.
    a good answer is to always have back up systems in place. when driving cars one should always keep a fire extinguisher and spare tire. there are steps that all of us can take, developers, managers, users but we wont. thats why just get used to the idea that this is simply going to keep happening.

    • @JeanPierreWhite
      @JeanPierreWhite 3 หลายเดือนก่อน

      There are more resilient OS's than Windows.
      Companies need to move away Windows at the desktop for critical systems.

  • @d0wnboy
    @d0wnboy 3 หลายเดือนก่อน

    I love this. Rain-dead business stick their IT I. The cloud and losing complete control of their businesses. It couldn’t happen to a better class of people.

  • @henryvaneyk3769
    @henryvaneyk3769 3 หลายเดือนก่อน

    Part of the solution is the adoption of safe languages like Rust for system-related components.

  • @douglasengle2704
    @douglasengle2704 3 หลายเดือนก่อน

    It is always been risky to use an operating system that was made popular because it could run on the preferred video game PC platform with the same security concerns. There is a huge benefit with going with the crowd by using MS Windows, but for critical system the ability to switch to Linux operating system on the same hardware platform will likely take place at companies like Tesla.

  • @daimajind7231
    @daimajind7231 3 หลายเดือนก่อน +4

    anyone consider how a single company can affect so many systems single handedly at kernel level worldwide. Does that mean a single bad actor at the company has the potential to compromise those same systems globally with a silent malicious payload without anyone knowing or even noticing thanks to the default automatic update to the bleeding edge build version?

  • @esra_erimez
    @esra_erimez 3 หลายเดือนก่อน +1

    As a society we are too dependent on computers in general.

  • @danjolly9505
    @danjolly9505 3 หลายเดือนก่อน

    You just described outsourcing thinking. My single biggest problem with these systems is

  • @MelloBlend
    @MelloBlend 3 หลายเดือนก่อน

    I want to know what the actual failure was. I saw someone post a clip of the offending jump routine that was trying to move data to or from register R8. Now these are general purpose registers but what was the offending data or address or executable issue? Was that file we all deleted some other purpose that no one is mentioning because they don't know. Was it something nefarious?

  • @attilazimler1614
    @attilazimler1614 3 หลายเดือนก่อน

    A bunch of those systems has no real reason to connect to the internet, just a reason to connect to an internal other node. And thus if you actually cannot get directly to the device it wouldn't need this protection, so it wouldn't break it. It seems that what we actually forgot, is proper isolation of the systems.

  • @RagdollRocket
    @RagdollRocket 3 หลายเดือนก่อน +1

    how about doing an automated test installing the update to a test machine before it gets rolled out😂 it's a nullpointer, it crashed on ALL Machines using the update.

  • @davidwolianskyj809
    @davidwolianskyj809 3 หลายเดือนก่อน

    Question: 4:59
    Answer: Use Rust.

  • @calkelpdiver
    @calkelpdiver 3 หลายเดือนก่อน

    This is why you need to properly test your installation/deployment process and tools. I'm not blaming the Test group for this one as I'm sure they were pressured and overruled on releasing this patch (by both Microsoft and Crowdstrike). Been there and done that one a few times in my career.
    Testing of the deployment process and installation/configuration tool/app is always overlooked. Always has been. I've tested installation software for commercial products and found them at times to be pretty crappy in how they check for version differences of files (don't overwrite a similar file and warn the user if you have to, but again a lot of this is run in Silent mode), correct location of files, and validate changes to the config/INI files or registry changes that are done. A lot of installer's are "dirty" in their process, basically just jam things on.
    But this mentality is endemic to software development, and always has been. Companies have to consider and remember that the Installer software is the first one an end-user encounters. And if it isn't pretty much bullet proof then you're going to have a pissed off customer and your reputation will be heavily impacted by it.

  • @rossminet
    @rossminet 3 หลายเดือนก่อน

    Thank you! Too much tech is an inference we can draw.

  • @davew-marketer8264
    @davew-marketer8264 3 หลายเดือนก่อน +2

    Just good questions! I hope lot people will open the eyes more and more
    @ArjanCodes i just discovered your channel since 2 weeks. But what a love for your good and clear vidoes. Thanks!

  • @DavidTangye
    @DavidTangye 3 หลายเดือนก่อน

    The two best ways to to reduce the risk and impact: Canary releases, and switch in Linux.