What Happened?
Today, a significant global IT outage is broadly affecting diverse industries including aviation, banking, medical, technology, retail, and media due to a faulty content update published by security vendor CrowdStrike.
Worldwide, thousands of computers running Microsoft Windows and CrowdStrike’s Falcon security software now show the Blue Screen of Death (BSOD) after receiving an automatic CrowdStrike update. Having reported over 24,000 customers in their last earnings report, it’s certainly possible that there are hundreds of thousands of affected CrowdStrike systems out there. Let’s be clear: CrowdStrike is a highly reputable vendor. If this can happen to them and their customers, it shows how vulnerable our IT environments can be.
Full disclosure – Cequence operations have not been affected in any way and we continue to protect our customer’s applications and APIs without any disruption due to this outage.
What is the Impact?
As of publication, this incident is still ongoing. For example, major airlines are unable to operate because so many of their computers run Windows and CrowdStrike. Many of their key systems now are only capable of showing the BSOD. It bears repeating that this only affects Windows machines running CrowdStrike as the security component.
When a Windows machine experiences the Blue Screen of Death it is just that, the Operating System will not boot up the machine normally, the screen turns blue, and it displays an error message explaining it cannot run. Usually when this happens, an experienced and technical user would boot into “safe mode” and dig up the offending DLL (Dynamic Link Library – it connects files to the things they need to run) and remove or replace it. This means that every affected machine now needs a human to physically stop by and correct things. A BSOD cannot be fixed with a recall update or a roll-forward update. Nothing can be sent to the machine remotely, slowing recovery efforts immensely. Also, many IT teams are now outsourced to virtual centers, and cannot physically get to the computers they are paid to support and fix.
What Can Organizations Learn from It?
If you weren’t impacted by this, what can you take away? Well, I have been focusing on how centralized systems are causing huge problems when they break. Earlier this year, healthcare organizations couldn’t process payments because an API attack took an organization that brokers their transactions offline. Understanding where one’s centralized services might be and knowing what to do if that goes down is important from a business resumption standpoint but also from an understanding of real operational risk.
Additionally, this came because an automatic update was pushed and automatically deployed. Patching is important, very important, and needs to have some care and thought put into it. Early in my career I was called to another office building because someone pushed a patch that broke the login screen for every computer. 24 deskside techs deployed with 3.5” discs in hand had to visit each machine and reboot to a batch file that we could use to update the login screen. The goal of IT is to enable business, sometimes these types of things can really drag it down.
Closing Thoughts
So how do you avoid this happening in your own organization? Here’s some suggestions:
- Start with understanding where your centralized services are and understand where these things are getting their directives. If it is a cloud-based software agent, it could potentially have malware, bugs, or other issues, so auto deploy shouldn’t be the default first step.
- IT teams should validate anything that comes in, making sure it doesn’t break a proprietary or legacy system and then authorize the auto deploy to a sample group of machines. If there are no issues, then everyone gets it. This assumes a small IT team. If you have hundreds of desk-side support personnel, blast away. Most organizations don’t have enough IT “hands on keyboard” type folks to survive a problem like this.
- Adopting emerging technology is fine, as long as the appropriate backup and restoration systems are in place to assure business continuity.
CrowdStrike has a suggested fix, and their CEO reported the problem has been identified. The fix is basically, boot the machine into safe mode, delete a DLL and reboot. Hands on every machine. This is a great example of the kinds of tabletop exercises organizations should be doing. Unfortunately, today’s event isn’t a practice run.
To relate this event to API security and bot management, which is Cequence’s core business, one of our fundamental principles has been to protect as many applications and APIs as possible without requiring significant changes to the customer applications or infrastructure. Software and networks are highly dynamic, so the more effective a solution can be despite those changes, the better. Resilience in the face of unexpected change is a requirement for cybersecurity that won’t result in business continuity failures. We fully support the organizations impacted by this outage, CrowdStrike, the overall cybersecurity community, and are here to help in any way we can.
Update 20 Jul 2024: CrowdStrike – Technical Details: Falcon Content Update for Windows Hosts, Microsoft – Helping our customers through the CrowdStrike outage

 
  
  
 