CrowdStrike Exposes a Fundamental Problem in Software

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ก.ย. 2024

ความคิดเห็น • 582

  • @ArjanCodes
    @ArjanCodes  หลายเดือนก่อน +1

    ✅ Get the FREE Software Architecture Checklist, a guide for building robust, scalable software systems: arjan.codes/checklist.

  • @MysticCoder89
    @MysticCoder89 หลายเดือนก่อน +22

    Your kernel is crashed. No malicious code can be executed. Your computer is completely protected now. Thank you for choosing our company!

  • @ropro9817
    @ropro9817 หลายเดือนก่อน +165

    I think the even more fundamental problem here is the security software mono-culture. I know CrowdStrike is big, but honestly, I was surprised when I heard in the news how broadly sweeping the impact was across companies and even across industries. If everyone's using the same software, that provides a ripe attack vector for hackers. 😒

    • @FrankmoonDusty
      @FrankmoonDusty หลายเดือนก่อน +17

      This a 100%. We need to move away from tech monopolies like crowdstrike, and Microsoft by extension.

    • @penfold-55
      @penfold-55 หลายเดือนก่อน +11

      It's a much bigger issue... Most of the biggest companies are held in a very small space, in northern California, US.
      Microsoft, Google, Nvidia, AMD, Intel, Facebook, Amazon, and so on.
      The issue is that Europe and Asia are just so far behind America

    • @alexivanov4157
      @alexivanov4157 หลายเดือนก่อน +1

      Bravo! This is the main point from the issue!

    • @username7763
      @username7763 หลายเดือนก่อน +5

      It isn't entirely a mono culture. All it takes is for one layer in a massive distrubuted system to all be on the same thing. The problem is the crazy complexity of today's IT systems. Everything is a damned service that requires it's own cluster and infrastructure.

    • @robertbutsch1802
      @robertbutsch1802 หลายเดือนก่อน +11

      No enterprise IT folks in their right minds are going to say look, everyone else is using the best threat protection software in the business. So lets use Acme Security Software so we’re not promoting a mono-culture.

  • @whatcouldgowrong7914
    @whatcouldgowrong7914 หลายเดือนก่อน +25

    People seem to be overlooking the glaring fact that they pushed an update that was corrupted or checksum failed which means there was a wide open vulnerability that allowed man in the middle exploits or injecting code with modified files directly into the Kernel….

    • @vitalyl1327
      @vitalyl1327 หลายเดือนก่อน

      Keep in mind that Clownstrike is a scam company selling a "cybersecurity" snake oil. Just like all the other antivirus companies. There is zero value in their product. They have no incentive whatsoever to do the right thing, because they're consciously scamming their customers anyway.

    • @fransstar8731
      @fransstar8731 หลายเดือนก่อน

      I see a lot of answers/recommendations, but what surprises me why CrowdStrike is working in Linux and not in Windows. Apparently Microsoft needs an extra driver to let CrowdStrike working. I think it has all to do with the different structure between Microsoft and CrowdStrike. I think it is time that Microsoft should change its total stucture like Linux. This is whole thing is to blame to Microsoft. It is clear ethical hacking can only be done with Linux and not Microsoft. Wake up people. Linux has sudo Windows not this was and is the main issue. Awaiting comments. Thanks.

    • @whatcouldgowrong7914
      @whatcouldgowrong7914 หลายเดือนก่อน

      @@fransstar8731 They tried to and was blocked by Europe. At the very least Microsoft need to revoke their WHQL and prevent changes after the fact

  • @AMMullan
    @AMMullan หลายเดือนก่อน +179

    So we have the CrowdStrike option ENABLED so CrowdStrike won't release the latest version of their software to use (we stay 1 version behind) - apparently they don't actually even check for this so we got it anyway. Absolutely shoddy development :(

    • @TheGreenRedYellow
      @TheGreenRedYellow หลายเดือนก่อน +1

      Wouldn't you get it in the next release, so technically you are not immune to this update, unless you manually deploy it.

    • @gcaussade
      @gcaussade หลายเดือนก่อน +6

      @@AMMullan wow that's interesting information. It's interesting to see what happened. Did they have an emergency release? Maybe they felt there would be a breach if they didn't release something right away? So many questions. I have a hard time believing there are so many incompetent organizations around the world. If these companies were choosing to be one version behind, specifically to avoid something like this, then how did this happen?!
      That's crazy!

    • @gcaussade
      @gcaussade หลายเดือนก่อน +11

      @@TheGreenRedYellow what do you mean? You would assume that people would report the BSOD and they would stop the roll out. The problem with being one version behind is that you're not getting the latest protection. But, I could see doing this to avoid this exact situation

    • @AMMullan
      @AMMullan หลายเดือนก่อน +4

      @@gcaussade yeah they killed that update so anyone not getting the latest update wouldn't have received this at all 😕

    • @TheGreenRedYellow
      @TheGreenRedYellow หลายเดือนก่อน +2

      @@gcaussade it is really about how many updates did they release. Like what if they have released 2 updates within same day?

  • @ying-ym8ut
    @ying-ym8ut หลายเดือนก่อน +19

    The CEO of CrowdStrike, George Kurtz used to be the Chief Technology Officer of McAfee in 2010, when a security update from the antivirus firm crashed tens of thousands of computers.

    • @KC-uf1rg
      @KC-uf1rg หลายเดือนก่อน +1

      He upped the ante now 😂😂😂

    • @Discoverer-of-Teleportation
      @Discoverer-of-Teleportation หลายเดือนก่อน

      😂😂😂😂

    • @yoyolim538
      @yoyolim538 หลายเดือนก่อน +3

      We got crowd struck

    • @Jace-yt2zm
      @Jace-yt2zm หลายเดือนก่อน +1

      Crowdstrike dropped the ball and brought down a big chunk of the world’s commerce and business. While CEO George Kurtz is thoroughly enjoying and consumed by his race-car-driver-lifestyle in events all over the globe!

  • @kwas101
    @kwas101 หลายเดือนก่อน +33

    It's partly about $$$ and partly about how everything nowadays is expected to happen with speed. Back in the day (30 years ago) I worked for a bank. We maintained a very large enquiries counter system. Before anything got pushed out to branches, it was tested for weeks. We had dozens of test engineers and they would run through every conceivable action. Then and only then a release would happen to a local branch. This would be tested in the wild for a week. Then a small group of branches for two weeks, then a larger group, then finally the main group. The result was that very few (if any) show stoppers made it to production. This meant a slow cadence of releases though. Also this was a large project with extensive management backing, so the cost was not really a factor (within reason).
    This type of behaviour would never fly today. Everything has to be done on the cheap, with minimal testing, just "get it out there". I call it the "just get it f**king done" attitude - this is very common nowadays, especially among MSPs.

    • @gzoechi
      @gzoechi หลายเดือนก่อน +1

      It's not necessary to go that slow. With proper CI/CD practice this would work as well at high speed.
      You still need to put a lot of work in to get proper quality.

    • @enadegheeghaghe6369
      @enadegheeghaghe6369 หลายเดือนก่อน +2

      If you spend weeks testing your Cyber security software, you will get hacked for sure before you deploy it.
      Hackers are a lot more sophisticated now compared to a decade ago

    • @manoo2056
      @manoo2056 หลายเดือนก่อน

      the issue is thar in the short term the "get it the fack done" acctitud "saves money" but in the long term it explodes. I see like entropy rising and then some feedback to bring back equilibrium. Let's hope we survive that feedback !! XD

    • @michaelwills1926
      @michaelwills1926 หลายเดือนก่อน

      @@enadegheeghaghe6369hackers rise with the level of tech. Besides you still should sandbox any release even zero day patching

  • @James-hb8qu
    @James-hb8qu หลายเดือนก่อน +17

    My career has been leading engineering organizations. This is not a new issue or a unique issue. Bad driver code crashes systems. Because of that, the industry has created well known and effective ways to prevent problems. You've listed them.
    The issue here is a company with wide spread driver releases that failed to follow those practices. The free market has created a process for handling that and it is called competition and consumer choice.

    • @joansparky4439
      @joansparky4439 หลายเดือนก่อน

      markets that prohibit or undermine competition via rules that are being enforced by the market authority (#) do not give the consumer the chance to chose a different supplier
      #) goal is to give one or a few control over the supply, so it can be kept below demand, which guarantees the the consumer always pays more than it cost - which is what profit is. Or in other words - real free markets would trend towards zero profit for all involved due to competition.

  • @ChristianSteimel
    @ChristianSteimel หลายเดือนก่อน +61

    Most surprising is that PCs still don't use A/B installs of the OS, where you use one copy and update the other copy, then switch over to the updated copy, and you can switch back if the update failed for some reason. With disk space so cheep, you'd thing every Linux/Mac/Windows PC would use that by now. In Linux at least you can revert to a prior Kernel version.

    • @incandescentwithrage
      @incandescentwithrage หลายเดือนก่อน +7

      Yeah but the same thing happened with Crowdstrike on Linux previously, causing a kernel panic.
      If you hook into the kernel, changing kernel isn't going to help.
      A/B is what happens with OS feature updates on Windows already.
      Nothing preventing people using backup software on the daily.

    • @JeanPierreWhite
      @JeanPierreWhite หลายเดือนก่อน +9

      Bingo. Windows is not ready for critical functions. Microsoft have had over 30 years to develop resilient OS. Time to give up on them and go to Linux systems that support immutable OS's and atomic releases.

    • @askii3
      @askii3 หลายเดือนก่อน +6

      SUSE MicroOS is essentially capable of such A/B installs. It does snapshots of atomic transactional updates where it can automatically rollback the update on failure. This is what JeanPierreWhite (above) is referring to with immutable Linux distros with atomic updates.

    • @gzoechi
      @gzoechi หลายเดือนก่อน

      I found that stupid 30y ago.
      NixOS does that quite well though.

    • @gzoechi
      @gzoechi หลายเดือนก่อน +2

      ​@@JeanPierreWhiteThey never even tried to approach the problem. If it had been 300 years they wouldn't have made any more progress on that front

  • @mitchellsmith4601
    @mitchellsmith4601 หลายเดือนก่อน +4

    This was an embarrassing failure for Crowdstrike. All they had to do was test their patch on Windows PCs prior to release, and they would have seen those PCs blue screen. They could have fixed the issue, tested again, and THEN deployed. The more devices you’re responsible for, the greater the duty to test prior to deployment. This was negligence, pure and simple, and there should be a class action suit against Crowdstrike for the damages they caused. Such a suit would destroy Crowdstrike, of course, but that’s as it should be. Our world needs to deter this negligence in the future.

    • @raristy1
      @raristy1 หลายเดือนก่อน +1

      Basic Security + cert teaches EXACTLY that. So my question would be, was ANYONE certified at CrowdStrike???

  • @keithnsearle7393
    @keithnsearle7393 หลายเดือนก่อน +8

    So, basically Crowdstrike could not even secure itself against itself. Well done Crowdstrike, well done! (Slowly clapping) To Microsoft, get rid of Crowdstrike, no IFS and no BUTTS!

    • @vister6757
      @vister6757 หลายเดือนก่อน +1

      Other antivirus also have access to the kernel due to EU regulator requirements after McAfee and Symantec brought the case against Microsoft when Microsoft placed a code to stop 3rd party software running on its kernel.

  • @samarbid13
    @samarbid13 หลายเดือนก่อน +30

    This is a reminder of how fragile our IT solutions are. Imagine a solar storm occurring and the devastation it would cause! We need a plan B for critical infrastructures to always be in place!

    • @henson2k
      @henson2k หลายเดือนก่อน

      We need operating system that can disable drivers on reboot

    • @JeanPierreWhite
      @JeanPierreWhite หลายเดือนก่อน +1

      Its a reminder of how fragile Windows is. Notice how it was only Windows computers that borked?

    • @gzoechi
      @gzoechi หลายเดือนก่อน +1

      How does this increase stock prices in the short time? Yeah, not gonna happen.

    • @zackang4731
      @zackang4731 หลายเดือนก่อน +2

      @@JeanPierreWhite Because it's a software written specifically for Windows? A major bug written in the Safari browser can potentially cause the same problem to ALL Mac users, and no Window computers will be affected, because the browser is no embedded in the system the same way Safari is in the MacOS

    • @STCatchMeTRACjRo
      @STCatchMeTRACjRo หลายเดือนก่อน +1

      @@JeanPierreWhite they could have released a buggy update for non-Windows os as well.

  • @EmperorShang
    @EmperorShang หลายเดือนก่อน +2

    My rage at everyone downplaying this for CrowdStrike is immeasurable. This is a billion dollar company, with a B, trusted by critical government, public, and private services and they shafted each and everyone. The lack of outrage from our authorities is absolutely disgusting. Speaks a lot to the state of cybersecurity and tech in general

  • @bulversteher
    @bulversteher หลายเดือนก่อน +51

    The Crowdstrike disaster hasn't struck because they needed to move fast, but because they obviously haven't tested this specific update on a single Windows machine. Because if they did, they'd immediately noticed it would crash. And they made a similar mistake already in April. That time it could be somewhat forgiven because it only occurred on two distributions of Linux which hadn't been in their test matrix.

    • @1DwtEaUn
      @1DwtEaUn หลายเดือนก่อน +5

      yeh, you think they'd have at least one of every supported OS in a test lab and rollout to that first, is 30 minutes before global rollout that big of a delay.

    • @d3stinYwOw
      @d3stinYwOw หลายเดือนก่อน +2

      Or they have such machine, but their testing might be influenced by local changes, or flaky.

    • @The_Ballo
      @The_Ballo หลายเดือนก่อน +2

      that, or they did it on purpose

    • @Travolta12e
      @Travolta12e หลายเดือนก่อน +3

      Wasn't the .sys file just a bunch of zeroes? I wonder if it was either a compilation or distribution problem that somehow corrupted the file, but the original file was working as intended.
      I mean, no matter how incompetent they are, it's naive to think that they just push new files to production without a minimal testing first.

    • @stickman1742
      @stickman1742 หลายเดือนก่อน +1

      How can you say they didn't because they needed to move fast? One of the most common reasons why software is put out without enough testing is because they are trying to move fast. You may not thing they needed to move fast, but internally they may have felt pressure. These kinds of drivers normally have to go thru certification tests to be put into Windows, but updates can bypass this to get out more quickly.
      Don't underestimate the ability of companies, including very big ones, to take shortcuts whenever possible. Not too long ago I spent some time working for a huge financial company that has more money than most companies would know what to do with. They are supposed to have a complete system just for testing to protect everyone's financial data, but they didn't really want to put in the money or effort. Could they afford it, of course! They just wanted to skip a few things, would probably make their quarterly report look a little better. That test system was never working so everyone had to run tests using people's actual financial data. They would just hand out people's real financial records saying "You're not supposed to see this but we don't have any test data" to any employee just to get the job done. This is the attitude of the biggest institutions running this country.

  • @_SR375_
    @_SR375_ หลายเดือนก่อน +8

    I want to add that the fact that CrowdStrike is so widely used makes it a target for bad actors, and perhaps how it operates internally, which seems to be monolithic, is also a problem. We also do not know what government and military systems were affected by this "bug" . Regardless of other bad practices that were at play, CrowdStrike itself may want to consider a lessre and perhaps break up its platforms into shards, such that entire industries are not a impacted by one bad software update or a bad pod

  • @lumeronswift
    @lumeronswift หลายเดือนก่อน +5

    Something that needs to be more highlighted from this issue is that companies have in recent years been offloading their IT resources but are still adopting external, overseas-managed (i.e. managed in the US) solutions. Companies should always have an in-house team ready to respond to system failures. Informed, careful companies would only have had a couple of hours of downtime...

  • @MadeleineTakam
    @MadeleineTakam หลายเดือนก่อน +13

    I find it utterly incredible that they don’t test the update on a sandboxed system before sending it out.

    • @Eris123451
      @Eris123451 หลายเดือนก่อน +1

      I don't.
      It's a quality assurance issue and it's turned out for example, that after years of promoting quality systems and quality assurance that at least 2 of the biggest manufacturing companies in Japan had been falsifying their production records and data for decades.
      If it a choice between scrapping millions of pound of work or passing it on the nod, few if any managers are going to bite the bullet and take that kind of financial hit,
      That mind set is probably at the roots of the majority of major operational failures in almost any industry.

  • @wernerlippert5499
    @wernerlippert5499 หลายเดือนก่อน +7

    Humans tend to think they can sacrifice quality for speed, which works for some time and then fails miserably. It's a bit like the uncertainty principle, there is a fundamental limit that cannot be cheated.

    • @stickman1742
      @stickman1742 หลายเดือนก่อน

      We are pushing towards these kind of bad events pretty quickly. Software updates are being pushed out constantly in an effort to move ahead as fast as possible. It wasn't that long ago that this was not the way. Updates were treated very carefully and put out more slowly. Now it is a race to see who can update the fastest. I see computer and devices suddenly stop working on their own all the time now. Always because of some recent update. This is already an issue, this is just an event so widespread that everyone is hearing about it. The industry is going to have to come up with more robust systems as we cannot depend on computers for everything if they often are just not going to work. This is a relatively new issue with all these updates and the problems will only get far worse with bigger consequences if it continues like this.

  • @metamadbooks
    @metamadbooks หลายเดือนก่อน +27

    But you can have it both ways: it's called rolling updates. You don't deploy software to a billion endpoints in one go.

    • @JeanPierreWhite
      @JeanPierreWhite หลายเดือนก่อน +4

      Correct. This was dumb.

    • @amyhaynes3019
      @amyhaynes3019 หลายเดือนก่อน

      Right

    • @Julio-ek1lw
      @Julio-ek1lw หลายเดือนก่อน

      I disagreed with your comment, the number of deployments doesn’t remove the dichotomy

  • @askii3
    @askii3 หลายเดือนก่อน +6

    A mechanism rolling back an update after X number of failed boots/etc would help a lot here. My router does this, it keeps a copy of the old firmware it can automatically revert to in case flashing a new firmware image bricks it.
    SUSE's MicroOS does similar by having a stateless OS and transactional updates that are snapshotted in the BTRFS file system. If it crashes and reboots, it'll automatically rollback to the snapshot before the update while preserving user data.

    • @gzoechi
      @gzoechi หลายเดือนก่อน +4

      In NixOS you can configure how many system configs you want to keep. Switching back in the boot menu just changes a bunch of symlinks.

    • @askii3
      @askii3 หลายเดือนก่อน +3

      @@gzoechi yeah, it's really cool

  • @AlvaroGilFernandez
    @AlvaroGilFernandez หลายเดือนก่อน +2

    As an IT expert, all my life windows has always been a problem, always presents some kind of problem to begin with. We need a new operating system that can replace windows that can be tough trustworthy.

    • @GH-oi2jf
      @GH-oi2jf หลายเดือนก่อน +1

      We had one. It was from IBM and was called OS/2. Actually, Microsoft participated in the development of OS/2 1.x. It ran in the early ATMs. OS/2 2.x came out before Windows NT and was an excellent product, but for some reason the shift to Microsoft took place and OS/2 was left behind. Business would have been better off sticking with IBM.
      I have never worked for IBM (or Microsoft) by the way, but I have worked with OS/2. I have no conflict of interest, just my opinion.

    • @sergioyichiong7269
      @sergioyichiong7269 หลายเดือนก่อน

      Non windows oses never had problems? What you re gonna do with the legacy code?

  • @sm5574
    @sm5574 หลายเดือนก่อน +7

    A lot of the developers who are doing shoddy work don't know that they are. They may be incompetent, or they may not know the codebase as well as they think they do, or the codebase may itself be a ticking timebomb, full of patches and poor decisions that effectively hide a myriad of bugs.
    The industry is absolutely broken because it is full of people who are completely unaware of best practices and solid patterns, relying instead on their own unstructured learning that has gone unchecked for decades.

    • @stickman1742
      @stickman1742 หลายเดือนก่อน

      The only real solution though is that the systems need to be more robust. There are always going to be some software bugs, it would be impossible for everyone to always create software without a single bug. These computers have to have a design that is far more robust so that it won't just refuse to run if there is one bug in even a kernel driver. This is a must if we are going to avoid much bigger problems like this in the future. Problem is, most companies just want to build on the current designs and move quickly as that is how you can make the most money. It will take a disaster to make everyone stop and say I guess we really do need a new design for this. Then all companies will be willing to pay for that newly designed system and the computer companies will make it. Is this even big enough to make that happen? It may cause them to look at it a big, but I kind of doubt it's big enough to push that much change. They'll probably put a band-aid on it.

    • @sm5574
      @sm5574 หลายเดือนก่อน

      @@stickman1742, I agree, but I would estimate that developers (even at the senior level) who are capable of writing high-quality code are very much in the minority, and the people in charge of hiring rarely understand what to look for, as they are not, themselves, capable of writing such code. Thus, the vast majority of codebases are and will be more error-prone and difficult to maintain than should be considered acceptable.

  • @gzoechi
    @gzoechi หลายเดือนก่อน +4

    CrowdStrike has shown that it has become the biggest threat to security

  • @gregharn1
    @gregharn1 หลายเดือนก่อน +4

    It's not a software problem. The decisions CS made was a tradeoff for functionality. The real root problem is policy. If you're a company running an EDR or even certain AV, you MUST build out a redundant infrastructure - specifically to mitigate bad updates. Which really just means if a system can run on 1 machine, you deploy at least 2 AND run different EDR or AV on each system. If 1 crashes (like last week), no big deal. 2 is 1, 1 is none.

  • @yogibarista2818
    @yogibarista2818 หลายเดือนก่อน +10

    The issue essentially is that there is a kernel-mode driver - no doubt WHQL certified - that is running uncertified p-code from installable 'definition' files, so that a bug there will cause the kernel-mode driver to execute bad code, and bug-check the system. Perhaps the kernel-mode driver needs better checking and self-defence - could the WHQL certification process require this?. The 'fix' is to gain access to safe-mode, boot without the driver, and then remove the installable definition files, so perhaps a system should identify crashing 'boot-required' drivers and sideline them if they crash repeatedly.

    • @incandescentwithrage
      @incandescentwithrage หลายเดือนก่อน +1

      You mean just like malware would do,?

    • @JeffBartlett-kj6sq
      @JeffBartlett-kj6sq หลายเดือนก่อน +2

      I heard that it ate a file of all zeros. So, 1) no signature bytes. 2) no header, 3) no header checksum, 4) no whole file checksum. 5) no file encryption nor signing. So a bad actor can figure out the p code and put a definition file in the directory, or do a man in the middle attack and own the machine from ring zero.

    • @johnhebert9583
      @johnhebert9583 หลายเดือนก่อน

      Someone else who watched the Dave Plummer video about Crowdstrike. His is the most thorough explanation I've seen.

  • @gamechannel1271
    @gamechannel1271 หลายเดือนก่อน +6

    I'll just say there is no reason this software would have a need to "quickly deal with a security threat". The software itself is a security threat. It should be removed from all computers, and the company should be disbanded. See videos from people who have analyzed their driver, and how poorly it is validating its virus definition files. The download of a bad definition file caused this crash, NOT a driver update. Because they wanted to bypass the driver update process.

    • @kjetilhvalstrand1009
      @kjetilhvalstrand1009 หลายเดือนก่อน

      Absulutly an update can be back door into the kernel, if they do not check what is pushed, it shows how bad this company is.

  • @user-bd7dn6yt8b
    @user-bd7dn6yt8b หลายเดือนก่อน +2

    NSA has better access to kernel than CrowdStrike. Let that sink in. 😂

  • @johnmoore8599
    @johnmoore8599 หลายเดือนก่อน +1

    You aren't thinking through the problem sufficiently. There are two ways to solve this issue at the kernel level to prevent these kinds of problems with monolithic kernels. 1. Build a subsystem that if the kernel panics, reverts the system to the last known good configuration before the crash. 2. Build a driver subsystem that insulates the kernel from buggy or bad drivers and lets it continue to operate. This was the idea behind nooks written by Michael Swift in 2005. Either architectural change would build resiliency into the current OS kernels humans use. For whatever reason, no one is doing this. Their answers are always develop better drivers and they put these best practices into place and along comes some company like cough, McAfee, or cough, Crowdstrike who bring Windows systems down. People get angry and pissed because of ruined plans or lost money, but ultimately, nothing significantly changes because someone at Microsoft/Apple/Linux Foundation doesn't want to pay to make their OSes more reliable.

  • @gregorymathy2782
    @gregorymathy2782 หลายเดือนก่อน +2

    That unfortunately goes beyond IT infrastructure cost… harmonization of process and procedure… CrowdStrike, Boeing, cars breakdown… all those stuff are driven unfortunately by cost reduction and profit optimization …
    We are unfortunately only seeing the top of the iceberg and I am pretty sure we are only at the beginning of it … I wonder what will be the next big things …

  • @DistortedV12
    @DistortedV12 หลายเดือนก่อน +8

    I think one of the problems is this automatic update culture

    • @robertbutsch1802
      @robertbutsch1802 หลายเดือนก่อน

      According to CrowdStrike this was not an “update” but a content delivery.

    • @luciaceba4640
      @luciaceba4640 หลายเดือนก่อน

      @@robertbutsch1802which, is an update

  • @RiteGuy
    @RiteGuy หลายเดือนก่อน +10

    All great points, Arjan, and I agree with them. But you left out a biggy - companies want to make as much money as possible so they cut corners everywhere.
    You did lightly touch on time by saying sometimes you don’t have the time to create a proper fix for a threat. I agree, but there’s another time problem. To companies, time = money, so the time allowed to work on things is cut right away even when there isn’t a looming threat.
    Remote updating is a godsend for companies. It lets them ship a product that is incomplete and flawed thanks to time and money restraints.
    Them as the product is completed/fixed, the current installations of software are usually automatically updated without the knowledge of the user.
    These issues and all the ones you mentioned are breaking software. I’m afraid of AI, not for the reasons most people cite but because software code is garbage in this day and age. Why would AI software be any better?

  • @captainnerd6452
    @captainnerd6452 หลายเดือนก่อน +3

    Error checking and error handling design. Don't trust data coming in, and don't trust data being returned from functions. Really don't trust the user.

    • @mfrunyan
      @mfrunyan หลายเดือนก่อน +1

      Precisely. This is amateur level code running in the kernel space.

  • @krissn8111
    @krissn8111 หลายเดือนก่อน +6

    Was there any test in canary environments? I guess not and how long does it take to test in canary? I cant understand a company like crowdstrike overlooking best practices.

  • @marcelogarcia5539
    @marcelogarcia5539 หลายเดือนก่อน +6

    I thought this was one of the lessons from COVID: resilience is important as efficiency.

  • @omriliad659
    @omriliad659 หลายเดือนก่อน +2

    One (partial) solution is to have a backup computer, that stays a version behind or a few days and only comes online if the main server stops responding. It would prevent this problem and could only be exploited in case a hacker could take down the most up to date server. You could even rotate the servers, so you update 2 versions each time.
    Another solution is to have canary distribution with faster turnaround. Set the most secret systems to have the update first, have the next group within an hour later etc. It means you make your last group vulnerable for a few more hours, but you give them the peach of mind that it was tested for a few more hours and is unlikely to crush that fast.
    Last solution is to disconnect systems from the internet. No computers with internet connection means no attack surface, and you can still work offline or maybe even with others on the same network. Keep the protection system guarding the gateway, maybe even keep several layers of different software at each layer, but leave the inner network isolated.

  • @durand101
    @durand101 หลายเดือนก่อน +2

    The reason the world is so fragile right now is because of a) tech monopolies and b) efficiency over pragmatism. Why does MS have such a large share of the corporate market and why aren't our various regulators challenging that? In nature, monoculture ecosystems are the quickest to be killed by disease.
    And why does everything have to be automated to the point where there are no humans in the loop? Mostly to be more "efficient" and reduce costs - at the risk of much more expensive black swan events eventually coming to ruin your day.

  • @steves9250
    @steves9250 หลายเดือนก่อน +1

    Shows how a product that works 99.99% of the time makes one mistake and it all goes to hell

  • @davidgrisez
    @davidgrisez หลายเดือนก่อน +1

    One main thing that allowed CrowdStrike security software to crash the computer operating system was the fact that this security software must be installed as a device driver operating at the high privilege level of the operating system kernel. Normal program software running at a lower privilege level should not be able to crash the operating system.

  • @billfrug
    @billfrug หลายเดือนก่อน +3

    So your argument is that there was an imminent security threat that the update addressed? Is there any evidence of that?

    • @JeanPierreWhite
      @JeanPierreWhite หลายเดือนก่อน

      I don't think so. Just a bad update. Fragile endpoints, lack of change management.
      I retire and a year later the world goes to crap. Geez, that didn't take long ;-)

    • @theronwolf3296
      @theronwolf3296 หลายเดือนก่อน

      Nothing I have seen so far even identifies the threat that was SO serious that this rush was essential. That's another part of the problem, security companies just go along and do things.

  • @user-fed-yum
    @user-fed-yum หลายเดือนก่อน +9

    It's getting boring hearing every commentator say that this so called security product needs to run in kernel. That's just simply not true. The operating system designers need to do better, much better. No amount of testing and canary deployments and whatever else you come up with, will stop this happening again. Having unbreakable kernels that can't be infested with alien bloatware, should be mandatory. You need to think safety critical, running a nuclear power plant, a helicopter, an X-ray machine, thus, risk mitigation is not enough. Risk eradication is what is needed here.

    • @henson2k
      @henson2k หลายเดือนก่อน +2

      Totally agree, Microsoft is also responsible for this mess.

    • @vladimus9749
      @vladimus9749 หลายเดือนก่อน +3

      Exactly. Instead of admitting defeat after the XP debacle and starting fresh, they patched and patched in order to maintain backwards compatibility for decades. I'm not an apple fan at all, but I think they do a decent job of massive architecture transitions providing emulation for a few years before dropping support for the old.

    • @zackang4731
      @zackang4731 หลายเดือนก่อน +3

      Not a defender, but I feel that's a mistaken understanding of how the software structure works.
      Kernel access is needed for interacting with the hardware components of the computer - To allow the usage of third party hardware like network cards, sound cards, memory chips, etc, the OS must allow third parties to be able to install their own stuff on the kernel. But it cannot automatically verify that all such installed stuff are safe. So having an "unbreakable kernel" is ideal, but unrealistic. At least outside of completely controlled ecosystems like Mac where no other third party can develop stuff for the Mac.
      The way to actually properly secure those kinds of systems you mention, is to not use any commercial products, and those things should not even be exposed to the Internet - At most only running on their intranet. If they need to be exposed to the Internet, then their own IT team should come up with an inhouse solution.

    • @user-fed-yum
      @user-fed-yum หลายเดือนก่อน +2

      @@zackang4731 My understanding is not mistaken at all. Step away from what you know and understand. "Must" is a very strong word, there are many other ways to do things that you (and all of us) haven't necessarily thought of yet.

    • @zackang4731
      @zackang4731 หลายเดือนก่อน

      @@user-fed-yum I'm open-minded enough to step away from what I know and understand only if someone presents an idea to contest. But imagining a arbitrary fluffy ideal that has no basis and dismissing all current understanding in lieu of that is no way to make progress.
      Like, I CAN think of a way to make it such that a security software does not have to have kernel/superuser access - We just need each software to be wrapped with file permissions to allow other software to read/write their files(In Linux style). But that's far from saying all software cannot have kernel access.

  • @StuartLynne
    @StuartLynne หลายเดือนก่อน +1

    There is no Silver Bullet.improvement within a decade in productivity, in reliability, in simplicity.” • - Fred Brooks, 1986.

  • @MartinPHellwig
    @MartinPHellwig หลายเดือนก่อน +2

    When you expect something to work all the time in all circumstances but you can't define what all the time actually is or what the specifics of circumstances mean, you have unrealistic expectations. That is something each individual has to learn those willing to be realistic will have an easier time with less severe consequences learning that.

  • @galuszkak
    @galuszkak หลายเดือนก่อน +10

    I think this is interesting case that software design decision to build monolithic kernels 30-40 years ago is showing it’s consequences today (Linux, Windows etc.). Prof. Andrew Tanenbaum was trying to convince software industry that micro kernels are better for reliability and security, while sacrificing some performance. Looking back this is my best guess that by going with monolithic kernels we build whole security industry around it because of security flows that can be there by design.

    • @pureabsolute4618
      @pureabsolute4618 หลายเดือนก่อน +2

      It's also how big "kernel space" is in general. Windows NT has graphics in user space. Of course, that was too slow, so they moved it "back" (windows 98 didn't have a protected kernel).

    • @CallousCoder
      @CallousCoder หลายเดือนก่อน +4

      The problem with micro kernels is that they are complex. Gnu Herd failed because of it Darwin is the only one now but on x64 (I need to check ARM, I developed assembly on ARM but never from bare metal) only has 2 security rings. We used to 4 but since all major operating systems and most CPUs since VAX had 2 rings of protection, x64 also settled for two. So you don’t have your classical ring 1 for your drivers anymore. So you maybe loosely coupling your drivers but all in all they run in the privileged are - hands MacOSX on Intel did crash with shitty drivers too.

    • @gzoechi
      @gzoechi หลายเดือนก่อน

      NixOS can easily switch back multiple versions of configurations (not just the kernel). That's not a problem where the kernel architecture needs to get involved.

    • @MartinMaat
      @MartinMaat หลายเดือนก่อน

      It has nothing to do with this. The point of a virus scanner is that it should have control over everything by design. Which is not only a major security issue in itself but also a major privacy issue. As people get scared they tend to accept compromises, all the way down to fascism.

    • @ra2enjoyer708
      @ra2enjoyer708 หลายเดือนก่อน

      @@MartinMaat You meant liberalism?

  • @sneezyfido
    @sneezyfido หลายเดือนก่อน +4

    Business culture breeding and promoting incompetence is a huge issue in all large companies

    • @nurulnurul9270
      @nurulnurul9270 หลายเดือนก่อน

      Ouch. Somehow I found myself agreed with you

  • @richardbloemenkamp8532
    @richardbloemenkamp8532 หลายเดือนก่อน +2

    Staged/canary releases should be obligatory unless imminent danger at which point the government should be involved. It is totally ridiculous that millions of PC's install kernel patches that have not even been checked on a starting group of a few thousand computers for at least one or two days. In this case there was no imminent great danger that absolutely required all of the millions of PC's to be updated within a few hours.

  • @diogotrindade444
    @diogotrindade444 หลายเดือนก่อน

    All parties need to fix this broken system:
    - Security companies cannot ever force push without testing.
    - OS (special MS) need to improve all aspects in this scenario with lots of new well documentated automated testing/check tools for multiple steps in the process.
    - Essensial companies cannot trust blindly on updates without basic checks, and MS should not be the only OS running if you want to make sure that you online all the time.
    We need better software build for failure special for essential compatines that cannot stop. If companies do not fix this on all levels it can open a new door for failure.

  • @CallousCoder
    @CallousCoder หลายเดือนก่อน +6

    If this clusterfuck has showed one thing, that is that important companies and instances can’t cope with disasters. There are no manual backup processes in place. And it’s not only computer systems that can fail but also long power outages, internet outages. Long traffic problems that disrupt goods from going where they need to be. We need to not rely as much on government global infrastructure but decentralize systems. Back when I was in the energy business, I was propagating for small pebble reactors for towns or larger neighbourhoods. Instead of massive 1GW nuclear power stations - the bigger the more complex and the more material is needed. Whereas small reactors are simpler to build and even safer. Russia understood this and they are moving in that way. And it also is great against acts of war. Taking out 4 or 5 power plants and you disturbs all of the industrial areas of the Netherlands. Taking out 50-100 smaller ones is a lot more of a hassle. And we should use cash cash and more cash for our daily shopping. And we should actually buy locally from local farms much much more.

    • @joansparky4439
      @joansparky4439 หลายเดือนก่อน

      economies of scale drive this, which means this is NOT true: _"the bigger the more complex and the more material is needed"_

    • @JeanPierreWhite
      @JeanPierreWhite หลายเดือนก่อน

      Many companies do have manual processes. However they are very slow and inefficient. If that wasn't the case then we wouldn't deploy computers in the first place.

    • @CallousCoder
      @CallousCoder หลายเดือนก่อน +1

      @@JeanPierreWhite many did a lot didn’t like hospitals and GP offices that’s just unthinkable! I worked in healthcare software, we documented a backup process as part of our manual. You could print your agenda, you could print user details and treatment and medication plans. And most did that. You don’t need your computer system to diagnose or treat people. Same with issuing boarding cards. The SITA system was still running, print a passenger manifest and issue the boarding cards manually. Some airlines did most didn’t.
      So it showed how painfully unprepared we are. And this was only a simple computer outage. Let alone something more impactful like power outage.

    • @CallousCoder
      @CallousCoder หลายเดือนก่อน

      @@joansparky4439 it is true in case of nuclear power plants and engineering.
      Of something is bigger it will always require more resources to build. I can’t build a reactor thinner.
      And this statement is only true in case of consumer goods where you can make billions. But critical systems its cost isn’t manufacturing the actual system. But all the security and secondary systems.
      A single engine plane will always be simpler and cheaper than a 4 engine plane.
      It’s not just bolting an extra engine on the plane, but your mass increases so that first engine should be able to hold the plane up with that added mass. You will need to monitor the two engines and balance them for wear and tear. You’ll need to service the two engines. And this complexity and problem gets worse with 4 engines. What if two power engines stop? Then the other two should take over but also the whole load bearing structure is indiscriminately loaded and that needs to be designed and tested.
      Critical systems where you start adding to the control systems get quickly more complex.
      You probably haven’t studied and engineering and especially not done critical systems. Because that’s where the law of general economics don’t apply. Simply because of the snowball effect.
      Also there’s not enough true mission critical systems to get the benefits of standard economy. How many nuclear power plants are there build every year? How many satellites etc.

    • @CallousCoder
      @CallousCoder หลายเดือนก่อน

      @@joansparky4439 funny story, I got into critical systems for building a very simple device that measured heights of snow/ice on Antarctica. You would knock out a prototype for this in a few hours these days, back in 1993 in about a week (all in assembly and no libraries available). But since this system has to run unattended for 5 years. Suddenly the complexity of the peripherals systems exploded! We needed two batteries, charge circuits that would batteries equally (if they don’t have the same capacity you are basically discharging one over the other). These charging circuits had to be redundant. Which also mean two solar cells, that were cross connected. You need to be ware that 6 months out of the year there’s no charging so each battery itself should be able to hold a 7 month charge. So suddenly the batteries became twice as big. The housing as a result became twice as big. But we also need 3 sets of ultrasound range finders. For redundancy and that added enormous code complexity, to see if primary system was working, by comparing it to the secondary. If there was a discrepancy (which with snow and ice is very normal because that forms in heaps) to take the secondary system after comparing it to the tertiary system.
      If the tertiary system decided the primary and or secondary system is defective, you don’t want to use those range finders to save crucial energy. As a matter of fact let’s decouple them from the CPU bus.
      The cost exploded! Not only in resources but mainly in design and development. And you never build enough of these systems to get the economic benefit.
      There are simply never enough build for that. Basically all those critical systems from planes to tanks to satellites to bespoke research equipment, are manually made by a very select few people.
      It’s not that you go to China and let a factory build 2200 satellites. First of all that factory that can do that doesn’t exist and needs to be designed and build.
      And 2200 is a big number of satellites.

  • @daimajind7231
    @daimajind7231 หลายเดือนก่อน +4

    anyone consider how a single company can affect so many systems single handedly at kernel level worldwide. Does that mean a single bad actor at the company has the potential to compromise those same systems globally with a silent malicious payload without anyone knowing or even noticing thanks to the default automatic update to the bleeding edge build version?

  • @samable9585
    @samable9585 หลายเดือนก่อน +2

    for serious bug or zero day bug -- CrowdStrike should have simply disabled inbound traffic to the host (other than itself) and work on fix and roll it in limited manner. If it succeeds keep rolling it.... Would you fly a plane with this type of method? We ground planes immediately when there is threat -- but we treat security threat in computers in slightly business-as-usual method and take chance. This may change it ... Act first, disable and then push changes

  • @charlesnicholas4758
    @charlesnicholas4758 หลายเดือนก่อน +3

    Good video but everyone seems to ignore the fundamental problem. How do you compile source code into a file of binary zeros?! At least if it had been a null file the size would have been noticed.

  • @CraftyF0X
    @CraftyF0X หลายเดือนก่อน +2

    I for one always saw the possibility of something like this happening, hence my reservations against automatic forced background software updates, which would sound shady AF in the 90s while today a widely accepted daily occurence. Don't get me wrong it has its advantages but something like this case was always in the ards.

  • @miyu545
    @miyu545 หลายเดือนก่อน +3

    That's what you have when you have no patch management or change management process. Microsoft does not have one. It has the public to do that for them.

    • @JeanPierreWhite
      @JeanPierreWhite หลายเดือนก่อน

      Not true. Microsoft does have change management as do their customers.
      The problem is that Windows is too fragile. If a problem sneaks by then you have few recovery options.

  • @theronwolf3296
    @theronwolf3296 หลายเดือนก่อน +1

    Maybe the kernel security layer should be virtualized, so that a corruption of the kernel can be quickly be switched off.
    Despite the claimed need for such deep access, if companies like Crowdstrike can corrupt the kernel, hackers (including nation state actors) could the same, or worse. At least the Crowdstrike bug just crashed the system, but other bugs can subvert it.

  • @ulrichborchers5632
    @ulrichborchers5632 หลายเดือนก่อน

    A rant about this is perfectly fine. We need to speak the truth if something clearly goes wrong. To remain silent, not wanting to be "negative" in such a scenario would be wrong, it only strengthens the wrong approach and thinking.
    The responsibility of a software engineer includes detecting problems early, thus avoiding them in the first place. This is an essential part of implementing a good solution, by avoiding a bad one.
    CD is exactly about that. It is not rigid at all, don't fall into that trap. To bypass good engineering practices is never a good idea, especially at this scale. CD with all its techniques supports fast incremental progress into production. It can raise software quality dramatically, minimize defects, and it also does prevent desaster, both with respect to releases and to the intrinsic quality of the product.
    A failed integration test with a widely used OS obviously would have prevented this. The choice here is not whether to allow a quick release into production or not (good CD practices even speed up the release because a high degree of automation is included), BUT the choice whether to apply CD best practices or not is this: Do you want to avoid desaster and be notified as early as possible that you have to fix the software before deploying it ... or do you really want to deploy a bug into production if you could have detected and fixed it before? To apply CD is not about preventing a quick deployment. It would have been about preventing the release of a problem in this scenario.
    If people experience problems not being able to release quickly when they have to, then they DO have quality problems with their software or with the system architecture or they do not have enough understanding of CD and how to apply it correcly. They then have to improve their engineering skills instead of accusing the necessary and professional techniques which they have not mastered.
    Incompetent people in charge making decisions under pressure, or for whatever reasons, are the actual thread. It is not "AI", but yes, the thread is also about intransparency and lack of knowledge.
    "AI" is a marketing term, nothing more. If a software can detect patterns and even learn to improve the detection of security problems, that is perfectly fine, whatever techniques or tools it uses.
    The internal mechanics of a software are never transparent to the "end user". This is normal, whether "AI" is included or not. But yes, we do have a problem if engineering is done without exact knowledge of what is going on. The ongoing loss of culture, education and priorities is the actual problem, not the use and integration of technology itself. When I notice software "engineers" using the term "AI" internally and following the same belief system, that scares the hell out of me.
    Technology is not a thread, but inkompetence is, and always has been. There is a clear answer to the question of how to solve the global problem: Education, thinking, acting with a sense of responsibility and the right priorities as human beings.
    Now this may read strange, but what that means is this: Stop cheating. If you do something, do it well for the sake of what you are doing ... money must never be the top priority, money is not even real. It is a mental construct to make exchanging things more efficient, by comparing the value of things. This requires balance to work. Subtract or add the same on both sides. This is simple math and obvious stuff. If you see money as something to be maximized, then this is abuse of the limited resources we have and of other human beings, of your very own species. Magic does not exist in this world, money cannot be "generated" from nothing. The law of coservation of energy is a fundamental law of physics and it of course applies to everything, including life itself. For every shortcut and for every selfish act, someone else will have to pay the price.
    Is there a problem with a software which breaks encapsulation, to access the Kernel of an OS? Yes of course. Does it have to do with EU regulations, forcing MS to allow access to a protected layer? It seems. But the root cause of this may be unfair economic practices to exclude others from "competition". Or it is pretending that a successful company does so and thus invading their space, for the sake of making money. It does not matter which side is right there. That is fundamentally the wrong type of thinking. The actual root cause is a wrong definition of "success" which has invaded our culture in a very harmful way.

  • @osark2487
    @osark2487 หลายเดือนก่อน +1

    At this point autopsies have poped up on youtube channels everywhere. We most definitely know what happened, how and why.

  • @robinlioret7998
    @robinlioret7998 หลายเดือนก่อน +32

    Add poor patching management in the companies: never apply patches directly in production without testing it in lower environments before...

    • @gcaussade
      @gcaussade หลายเดือนก่อน +2

      This is what really amazes me, the fact that so many companies were just rolling this out. But his point is correct. I give more blame to Microsoft and crowd strike. They're the ones that have to work very closely together and do like something new more like real-time testing. It's amazing this hasn't happened prior. The largest breach in US history was the United healthcare Optum breach months ago. That was a result of companies not patching fast enough! And that was remote software not something near the kernel.
      Still led to a massive disaster and problems with the health care system for over a month! So if anything, CIOs and CISOs felt more compelled to have to roll out security software even faster to make sure that it at least is up to date. What would happen if you were breached because you didn't roll out crowd strike fast enough?
      That's the dilemma he brings up.

    • @xBanki
      @xBanki หลายเดือนก่อน +7

      Reading from anecdotal reports online, CrowdStrike likes to push their customers into enabling automatic patch updates. Logically, it makes sense why they would do that, however historical evidence (And literally any administrative handbook) says blindly accepting updates, no matter the reputation of the company and the claimed quality of the updates should not be done to prevent outages like we saw.

    • @robertbutsch1802
      @robertbutsch1802 หลายเดือนก่อน +3

      This was the equivalent of an AV pushing out a new virus signature file. No enterprise is going to pay the cost of CrowdStrike just to be a week behind on threat protection.

    • @silmarian
      @silmarian หลายเดือนก่อน +1

      They pushed it using the same channel as signature updates, not the usual upgrade path.

    • @Lofote
      @Lofote หลายเดือนก่อน +1

      That is not really valid for 2024 anymore. That was the case in earlier times, but in 2024 the zeroday-attacks are so common and threadening that security updates are considered time-critical. Meaning the risk of crashing your systems is considered more acceptable than having a successful security hack, where your data may be downloaded to the hacker, which is considered a far bigger desaster. Time is critical in 2024 with security patches :(...

  • @danleedev
    @danleedev หลายเดือนก่อน +2

    Normally, I would immediately add a channel with a thumbnail that says "Things will get worse" to the "Don't recommend channel" list. But, I have loved your channel for years and so I will give you the benefit of the doubt. It was a bad call, Arjan. You're better than that.

  • @miraculixxs
    @miraculixxs หลายเดือนก่อน +1

    It was not a security issue. It is a management issue. Perhaps MBAs should not run engineering orginizations.

  • @samchristy6745
    @samchristy6745 หลายเดือนก่อน +1

    Most threats do not require an immediate response level, for many a canary release mechanism based on system criticalness level
    1) deploy to non-critical systems (grocery stores, small businesses, gas stations, government)
    2) wait 36 hours
    3) deploy to mid-level critical systems (banks, financial institutions)
    4) wait 36 hours
    5) deploy to critical systems, (hospitals, pharmacy's, airports)
    For the defcon 5 threat level scenarios, then perhaps use the shotgun approach.

  • @joelmamedov404
    @joelmamedov404 หลายเดือนก่อน +1

    Technical glitches can happen. The fundamental problem is not technical. It’s managerial. The “business continuity “ planning does not exists anymore. The critical systems and industries must have redundant and durable systems. All the eggs are in the same basket unfortunately.

  • @NickThunnda
    @NickThunnda หลายเดือนก่อน +2

    In the good old days we had big mainframes running code which took checkpoints and did automatic rollbacks upon failure. They were replaced by lots of networked Microsoft boxes.

  • @rydmerlin
    @rydmerlin หลายเดือนก่อน +3

    Why couldn’t Windows be written to quarantine any driver that behaved like crowdstrikes? Wouldn’t that have allowed recovery to be quicker?

    • @MrShoorf
      @MrShoorf หลายเดือนก่อน

      It's like installing 2 antiviruses on the same machine. We might even call it *SecondStrike* 🤣

    • @Together707
      @Together707 หลายเดือนก่อน

      As far as I understood it did exactly that. Crowdstrike update tryed to meddle with registries which wasnt supposed to touch, and the solution is blue screen and immidiate reboot to stop the process and revert the changes. Only it got stock in a loop this time.

    • @kjetilhvalstrand1009
      @kjetilhvalstrand1009 หลายเดือนก่อน

      Well, companies that’s not Microsoft, have their own way push packages often. It might not be up to windows at all, and as I understand it, it was case here, there had there “driver” that executed other code, that was not signed.

  • @jamespong6588
    @jamespong6588 หลายเดือนก่อน +1

    Imagine, running in the kernel level and not checking for data validation and just dereferencing a null pointer,
    Imagine being this useless as an antivirus, and manage to be installed everywhere

  • @mrtnsnp
    @mrtnsnp หลายเดือนก่อน +12

    In part you want to avoid single points of failure. So don't run all your systems using the same security software, the same base OS for all parts of your systems. A more diverse collection of systems is less likely to go down all at the same time. Crowdstrike for sure isn't the only provider for these kinds of services, and for sure they won't be the first (or last) to introduce bugs in kernel drivers. There are sufficient opportunities for shit to hit a fan. On the other hand: Apple removed access to the kernel for all third party software. That may be needed for Windows as well, with an API to perform these tasks from user space rather than inside the kernel. And crowdstrike needs to have better processes for developing their code, but they are not unique there.

    • @whlewis9164
      @whlewis9164 หลายเดือนก่อน +4

      A more diverse collection of systems also introduces complexity of support, management, monitoring, licensing, and contracting.

    • @mrtnsnp
      @mrtnsnp หลายเดือนก่อน +1

      @@whlewis9164 Yes. Instead of one configuration, you have to support two, you split the options for each functional piece in two, and end up with two sets. In my view this likely beats the downtime that was just experienced.

    • @whlewis9164
      @whlewis9164 หลายเดือนก่อน

      @@mrtnsnp I very much doubt our corporate management overlords will opt for the best technical approaches. They will like continue to squeeze the budgets, ship support overseas, and consider the bottom line over everything else.

    • @mrtnsnp
      @mrtnsnp หลายเดือนก่อน

      @@whlewis9164 They get what they pay for, that is for sure.

    • @ra2enjoyer708
      @ra2enjoyer708 หลายเดือนก่อน

      @@mrtnsnp Two? Try O(n^2), aka every OS with its own barely specced configuration format which depends on a specific version of parser which depends on a specific language it was written for. Also this kind of clusterfuck introduces another attack surface in the form of different parts of the stack interpreting the same value differently, in worst case with a race condition on top.

  • @epiphoney
    @epiphoney หลายเดือนก่อน +1

    Mark Russinovich retweeted using Rust instead of C++ for systems programming, "for no particular reason".

  • @Apstergo
    @Apstergo หลายเดือนก่อน +2

    Knowing these questions is important. Actively listening to industry experts and less to corporate experts (They only lead to better return on invest, and now that is AI).
    This event should be a wakeup call, but I don't think people with think of it like that.

  • @bobdowling6932
    @bobdowling6932 หลายเดือนก่อน +1

    There is (should be) a standard pre-release test even for time-critical security software: The test should be that the target operating system can at least boot to a point where the updater can allow new versions of the security software to be installed. The test should be run twice: once on an instance that keeps upgrading the software and one on a freshly installed operating system. If just those tests are implemented then, to an extent, you can rush the rest because fixes can be sent out to clean up errors. This test doesn’t need re-writing for each instance of the software.
    Other tests (does it block the malware, does it not interfere with critical applications, ...) can be run after launch because errors can be cleaned up automatically. There is room for subtlety here: a customer might sign up for the pre-application-testing version or the post-application-testing version. Perhaps they do their own testing. Perhaps they have made a risk-balancing decision.
    This sounds so obvious. Hindsight is a beautiful thing.

  • @Chukwu1967
    @Chukwu1967 หลายเดือนก่อน +1

    Hey. Bright side. South West Airlines just announced , proudly, that they still use windows 3.1. So ....have at it.

  • @gruntaxeman3740
    @gruntaxeman3740 หลายเดือนก่อน +1

    One root cause of issue is that bullshit security, having a lot of complexity and adding more complexity in form of some security application.
    In reality when someone want to make reliable system amount of complexity is minimized. That is why critical places all unnecessry "moving parts" are removed and system is locked down tightly. It can be even better if code is formally verified to avoid bugs.
    Humanity has knowledge how to do this correctly. I even have alone the knowledge how to do it.
    Instead we see bloated software stacks, dumb IT who thinks that end point security software should be installed on critical, dedicated system. Or dumb insurance company who require it.
    One issue is also that today 95% of software developers don't even know how computer works. There is lack of deep knowledge and software developers are actually those people who are understand technology better than some lawyer in insurance company.

  • @PerisMartin
    @PerisMartin หลายเดือนก่อน

    Well, the way you solve this is to keep doing what you are doing. Keep teaching and preaching good practices with your videos. You never know the second and third order consequences of your good work. Keep it up!

  • @user-sl8kv8hi2q
    @user-sl8kv8hi2q หลายเดือนก่อน

    It seems to me to be the old problem of shutting the gate after the horses have left. What's needed is to have a kernel process built with a number of 'gates' where the system will not continue past a gate unless it passes its constraint and then to have a error 'fail safe' that allows the system to execute in safe mode so any changes can be made to restore operation if the gate fails. This would have saved CrowdStrike.
    Further, the system should be based on an identity management framework where identities/entities and permissions can't be pivoted or navigated out of to other entities. At the kernel level and beyond. All operations should check credentials before executing programs. This is easy to do if there is a relationship between the entity requesting access and the entity controlling access. No relationship, no access.

  • @bernhardkrickl5197
    @bernhardkrickl5197 หลายเดือนก่อน

    The promise of Continuous Delivery (as Dave Farley explains so often) is that you can release quickly and safely *because* you have a lots of tests. You work in small steps to achieve that. There might be an imminent threat and we will have to make a big change to our software to deal with it. You are back at square one: How do you know your change actually deals with the threat? Oh, that's right: By testing. If you say you need to skip that phase you don't believe in testing in the first place. If you skip that phase you get *something* to the market quicker. But will it help? Or are you pouring oil into the fire? The practice of continuous delivery with TDD is the best insurance that your software stays flexible and easy to change so you can deal with such problems quickly when they arise suddenly.

  • @eglobalsystems2554
    @eglobalsystems2554 หลายเดือนก่อน

    That's taught us again. SDETs are important part of our software life cycle!

  • @daviddunkelheit9952
    @daviddunkelheit9952 หลายเดือนก่อน +1

    This failure was quicker in onset and damage than Solarwinds. Diversity in systems …NOW!
    Need to build heterogeneity into the system rather than homogeneity.

  • @MikeHunt-pu5cm
    @MikeHunt-pu5cm หลายเดือนก่อน

    The Crowdstrike issue is extremely easy to fix in two minutes...
    boot in safe mode
    run cmd (command prompt) in admin mode
    Type:-
    cd %windir%/system32/crowdstrike (hit enter)
    del C*291.sys (hit enter)
    Then reboot the machine
    All done...!

  • @Colaholiker
    @Colaholiker หลายเดือนก่อน +1

    Surprisingly, my employer was not affected. After all, our IT usually doesn't leave any chance to screw up pass without making good use of it.
    But we are affected by supposed "AI securtiy software". Some years ago, they changed from a common rules based antivirus to something that supposedly uses AI. And it is so terrible. One thing it likes to target are programs that you just compiled (even a simple "Hello World" that you used to test if a new compiler version is working), parts of development software that we have used for ages, and now it even stopped attacking another supposed security software (we think it is just something they use to monitor what we do) that supposedly filters web traffic. Yes, it deletes that software as well. Of course, this is all just stuff that has an effect on individual workstations, not on a global scale, but it is so annoying...

  • @CaribouDataScience
    @CaribouDataScience หลายเดือนก่อน +3

    What’s the cliché say about putting all your eggs in one basket?

  • @christiananke667
    @christiananke667 หลายเดือนก่อน

    We tend to adopt solutions that are already provided by nature - and we did here. This crowdstrike thing can be compared with an desease attacking an species. Diversity helps to prevent extinction - but is no gurantee. Diversity in (security) software helps in here as well. So, we can be happy, that other operating systems exist, otherwise ...
    Furthermore if we think in software bugs and AI "bugs" we also should think of why children miss behave (or are buggy). It is the same thing: train/program/validate them well and you have what you need - turning that around means society gets the children/software it deserves. Means somehow we are asking for what happened, like children running around like crazy - no idea that moving too fast can hurt them. But - at latest - physics will teach them. We are not able to raise our children properly anymore, why do we think we do better in software or AI?
    So long, and thanks for all the fish.

  • @allenpierce4575
    @allenpierce4575 หลายเดือนก่อน +1

    doesn't help that the newer versions of windows doesn't allow you to roll back the update without it trying to reinstalling right after removing it

  • @k98killer
    @k98killer หลายเดือนก่อน

    The driver code itself was not updated, but rather a "channel" file that contained attack detection templates was pushed out with all zeros. The driver contained faulty template verification code that allowed the broken file to be parsed, and this included what should have been a valid pointer offset value. The driver then dereferenced a bad pointer and crashed the system. So really, if they had more thorough testing of their core code, they could have prevented this.

  • @TheEvertw
    @TheEvertw หลายเดือนก่อน +1

    While I agree with your analysis, this does display a callous disregard by Crowdstrike of the wellbeing of their customers. I expect they will be sued into oblivion.

  • @pureabsolute4618
    @pureabsolute4618 หลายเดือนก่อน +1

    First, there is *no way* someone else's driver should be pushed to *your* customers. Second, if it is pushed, protect it with the software stack equivalent of a try catch. If that takes too many resources, have it remove the try-catch guardrails via something like a manual group policy push.
    But people focus too much on non-scaled performance. If a driver takes 10% more of your computing power by being outside of the kernel, that should be a choice you can select as the customer.. and in most cases you should select that. I remember when.. Windows XP? crashed becuase of a bad graphics driver.. I was pissed, since the performance hit caused by being outside the kernal wouldn't affect what we were trying to use "NT" for. Kernel's should have the option (or by default) be as slim as possible.

  • @boydr7160
    @boydr7160 หลายเดือนก่อน

    Proper development Processes = Pay more money for cybersecurity.
    Managers: No, I would rather outsource all my cybersecurity to some other company for as LITTLE money as possible. Then blame them instead.

  • @normanlorrain
    @normanlorrain หลายเดือนก่อน

    One suggestion, I heard: make EULAs illegal. There is no reason that crowdstrike should not be subject to a giant class action lawsuit over this failure. If companies are legally liable and face financially percussions, they will implement the processes needed on their own. Eg, Ford Pinto.

  • @esra_erimez
    @esra_erimez หลายเดือนก่อน +1

    As a society we are too dependent on computers in general.

  • @semibiotic
    @semibiotic หลายเดือนก่อน

    That is the question of proper, profit-related system administration.
    Companies layoff their system administrators, turning all updates on automatically ? So they got automatic crashes like that.
    Big companies should have system administrators to manage infrastructure maintenance, grow and updates safely.

  • @mojoneko8303
    @mojoneko8303 หลายเดือนก่อน

    The complete failure of the internet would be a good premise for a Sci-Fi doomsday disaster movie. This is having me think being a prepper might be a good idea..

  • @attilazimler1614
    @attilazimler1614 หลายเดือนก่อน

    A bunch of those systems has no real reason to connect to the internet, just a reason to connect to an internal other node. And thus if you actually cannot get directly to the device it wouldn't need this protection, so it wouldn't break it. It seems that what we actually forgot, is proper isolation of the systems.

  • @Calphool222
    @Calphool222 หลายเดือนก่อน

    In this particular case, there's something more basic that would have caught the problem, and it requires no slow down of the development process when new malware shows up. When they deploy code, they need their software to phone home once the OS boots up, that's it. If when they deploy a new channel file and then reboot their test servers, they were to wait for the "phone home" before moving the code down the line for further testing and eventual deployment, they would have caught this problem. The user-land code would never have phoned home because *the OS was stuck* in boot. This is really just basic smoke testing, and it shows how immature their deployment pipeline must be.

  • @kellyaquinastom
    @kellyaquinastom หลายเดือนก่อน +1

    This is called “Experience”

  • @eduardodiaz5459
    @eduardodiaz5459 หลายเดือนก่อน

    The fundamental problem is that all the Apps and OSes install, change or update whatever they want without asking to the user.

  • @d0wnboy
    @d0wnboy หลายเดือนก่อน

    I love this. Rain-dead business stick their IT I. The cloud and losing complete control of their businesses. It couldn’t happen to a better class of people.

  • @rickchandler2570
    @rickchandler2570 หลายเดือนก่อน +1

    I think the responsibility lies with Microsoft. Kick everyone out of the kernel. I worked in support at CS for 6 years and remember when Apple kicked everyone out of the kernel about 3-4 years ago. We figured out how to make our product work in that situation and they can figure out the same for Windows

    • @JeanPierreWhite
      @JeanPierreWhite หลายเดือนก่อน

      In addition Microsoft should have two copies of their OS in separate partitions. OS upgrades should be via blue/green methodology. Update the inactive partition, boot to inactive partition, if it comes up great make it the active partition, if it fails revert to the original partition. OS should also be able to detect boot loops prior to loading drivers so that the OS can revert to the other partition should a boot loop occur.
      If ChromeOS can do this why can't windows?

    • @sergioyichiong7269
      @sergioyichiong7269 หลายเดือนก่อน

      ​@@JeanPierreWhiteChrome os is a new OS .

    • @JeanPierreWhite
      @JeanPierreWhite หลายเดือนก่อน

      @@sergioyichiong7269 If 2011 qualifies ChromeOS as new. Yep it's new. Windows is old. That's kinda the point.

  • @henryvaneyk3769
    @henryvaneyk3769 หลายเดือนก่อน

    Part of the solution is the adoption of safe languages like Rust for system-related components.

  • @dannym817
    @dannym817 หลายเดือนก่อน

    As a software engineeer myself:
    - To less time: to test, to build well/refactor, to rebuild legacy code. Deadlines pushing bad/not well tested software into production.
    - To much stress because to much firing/people leaving and rehiring all the time. And with the problem the knowledge of parts of the software is gone.
    - A lot of bad managers in the IT world, who make the above happen
    - Companies see software development/it as a cost instead of a win: For example in some companies i worked sales persons get bonusses when they make enough sales selling the software, while software engineers dont get anything.
    - There is not a real easy to see how good/bad a software engineer have been working for managers/people who cant read code. And because of this most companies only look at speed. Not how well software is written.
    This have been happening for a very long time in lots, probably most companies. With legacy code that isnt workable anymore and very hard to maintain. And should have been replaced years ago.

  • @johnmcway6120
    @johnmcway6120 หลายเดือนก่อน +1

    its just going to happen sometimes. theres construction accidents, there's medical accidents, theres accidents in every field big or small regardless of their complexity.
    no manager is going to see this happen and say, hey guys we just spoke with the board and we decided that we can invest twice as much money to ship this feature and decided to move the expected delivery date by 2 months to ensure theres no crunch and devs are well rested. thats not how business works, not in my experience.
    a good answer is to always have back up systems in place. when driving cars one should always keep a fire extinguisher and spare tire. there are steps that all of us can take, developers, managers, users but we wont. thats why just get used to the idea that this is simply going to keep happening.

    • @JeanPierreWhite
      @JeanPierreWhite หลายเดือนก่อน

      There are more resilient OS's than Windows.
      Companies need to move away Windows at the desktop for critical systems.

  • @Douglas_Blake_579
    @Douglas_Blake_579 หลายเดือนก่อน

    Two takeaways from this mess ...
    1) The day will come when system complexity reaches a level where failure is inevitable
    2) We are rapidly becoming so reliant upon our technology that one day the lights will all go off and there won't be anyone to get them back on. (And trust me, google and AI won't even be options when that happens).
    Oh boy, I sure wouldn't want to be the guy who pushed that update out over the wires. He got some serious 'splainin to do.

  • @TheWolverine1984
    @TheWolverine1984 หลายเดือนก่อน +2

    I thought this all was a long-winded setup for "I wrote a free book about how to be a senior software engineer"
    It's like "How do we solve all those problems? Well, I don't know, but I wrote a free book about how to be a better software engineer"
    😆

    • @ArjanCodes
      @ArjanCodes  หลายเดือนก่อน +2

      Haha, now I only need to write a book 😁.

    • @TheWolverine1984
      @TheWolverine1984 หลายเดือนก่อน +1

      @@ArjanCodes That would be great actually.

    • @joansparky4439
      @joansparky4439 หลายเดือนก่อน

      @@ArjanCodes U did not ask the important question - why was cloudstrike relied on by so many? What is with ALTERNATIVES, with COMPETITION, with REDUNDANCY? How did a competitive economic system create a "monopol" (which is intrinsically subject to this kind of failure)?
      _The fundamental problem for this is sociological in nature and (if one digs deeper) actually caused by how life itself functions, but that is far outside of programming._

    • @Ramdileo_sys
      @Ramdileo_sys หลายเดือนก่อน

      @@ArjanCodes Why my Win10 computers here don't crash or anything?? .. and everybody was working normally........... because I don't let somebody I don't know update my computers just because some as^&%%shole said I have to .......... I update my computer if I need it.. and after I try that update in a not essential machine for at least some days or a week ..... today my Windows is running with the same files it was running yesterday..... and also last weak...... and last month......... and the same software that I install on it last year .......... it boils my piss that this imbeciles are constantly beta-testing their crap in the computers I use for work.... like if this were a 1960 Terminal.. instead of a PERSONAL Computer.. ......... I don't understand who the hell you people tolerate this nonsense over there...... ¿¿was medical centers with this problem over there??...... and probable in nuclear plants affected also ...... yes... because those things are connected to the internet sewer rigth.. because the retardation is overwhelming in this world.. so yeah probably .....

  • @laurentitolledo1838
    @laurentitolledo1838 หลายเดือนก่อน +1

    don't use a life saving device that needs constant connection to the internet....

  • @DavidTangye
    @DavidTangye หลายเดือนก่อน

    The two best ways to to reduce the risk and impact: Canary releases, and switch in Linux.