There has been a ton of commentary, analysis, and responses regarding the CrowdStrike incident that happened on July 19, 2024. As of today, July 26, 2024 there are still companies recovering and trying to get back to normal operation. In the coming months, I’m sure there are going to be hearings and other outcomes related to this, but I wanted to offer up my take on what happened and how we prevent this from happening again.
First off, I want to thank all the people out there working tirelessly to recover from this. You all deserve a vacation when this is back to normal. It’s been a few years since I did operations, but I know these types of incidents are taxing and stressful.
Also, thank you to all the creators out there, the memes coming from this have been straight gold.
Secondly, I have been around long enough to be a part of similar incidents (maybe not to this size / scope, but still significant). I remember when Symantec Endpoint Protection had a virus database malware incident and caused a lot of issues, McAfee endpoint protection pushed a bad signature update, and other similar ‘security’ related incidents. Everyone is quickly pointing out things like:
- 1. This is a problem with Microsoft OS architecture and how insecure it is.
- 2. This was a hack / geo-political attack to disrupt news and US elections.
- 3. This was a bad actor acting on behalf of a nation state to disrupt worldwide operations.
I don’t agree or see any of what is happening in the above. I view this as a change management and business operations problem. We have prioritized speedy response and protection over established processes given the threat actors and landscape. So, given the global reach and active deployment of protections provided by CrowdStrike, we prioritized that over managed deployments, testing, and production management. Yes, there may be some bad testing on the CrowdStrike side in this incident, but that miss-step just highlighted the miss-step we have established as an industry.
Immediately after the above incidents, I remember establishing plans for virus signature downloads and central pushing / managed updates. This allowed for thorough testing and validation before it hit the masses of machines. In this case, we allowed a vendor to push out an update to every machine without testing or validating that change. On top of that, this vendor had full access to core infrastructure, operations, etc. Oddly enough, with the responses coming from CrowdStrike leadership this week, they see the same outcome and are introducing more testing, staggered deployments, etc. to limit impact if / when it happens again.
So, how do we fix this (and I hope we do)? We need to push our vendors, especially security and privileged vendors, to allow for staged deployment, testing, and validation capabilities. We should not be allowing direct endpoint updates.
Now, you’re going to say this is critical to get these updates out since they are active protection mechanisms for endpoints and infrastructure. Ok, I agree, is important for speedy deployment of these updates. But, I cannot think of a change management process for firewalls, patching, etc. that allows straight to production without testing / validation. Everyone, and soon to include CrowdStrike, has to go through change control and/or staged deployment and testing.
Had we had this in place ahead of the update, we would have been able to catch this before it got as far as it did and hopefully limit the impact. Going forward, we need to find a balance around prioritizing protections / responses and established change control processes. This is not a ‘security’ incident, this is a change management and business operations problem. And, if anything, this just showed it before someone really malicious, found the same hole and pushed out something really dangerous.