Select Page

If you were flying on Friday, July 19th, 2024, it is highly likely some aspect of your trip was impacted by a global system outage caused by a significant issue with the CrowdStrike Falcon endpoint protection software used by all six of the largest global airlines. Everything from internal systems such as gate agent workstations and aircraft scheduling systems to customer-facing systems like check-in kiosks and flight status monitors were affected. The result was a massive number of flight disruptions leaving millions delayed or stranded.

The impact was not limited to the airlines. Large enterprise customers in every industry category were impacted, with IT professionals scrambling around the clock to get systems back online. Hospitals had to delay non-emergency procedures, and some banks were completely offline. According to the global risk management advisory firm Parametrix, 124 of the Fortune 500 business were impacted, with an estimated financial impact exceeding $5.4b.

Complicating matters was the fact that in the initial hours, no one knew what the root cause was, and whether this might be the result of a coordinated global cyber attack, or perhaps something more mundane. As a result, it was hard in those early hours to know if and when full recovery was going to be possible.

Fortunately, as the smoke continues to clear, we have a more complete view of what happened. And within that, we can glean a number of key insights that are applicable to any organization.

What Happened?

According to CrowdStrike’s own Post Incident Review (PIR), in the early morning hours of Friday, July 19th, CrowdStrike released a real-time content configuration to their Falcon Sensor product. The configuration file was flawed in such a way that it caused any Windows PC or server running the Sensor product to crash into an unrecoverable reboot loop well-known by IT technicians as the “Blue Screen of Death” (BSOD). The problem was isolated and resolved within 90 minutes, and any computer that came online after that time was unaffected by the outage.

The CrowdStrike platform is architected in such a way that client systems check in regularly to download new configuration files. The frequency of check-ins, as well as how quickly new releases are to be installed is configurable within the software for each customer. Many customers choose to have a near continuous check-in, and install new releases immediately so that new protections are adopted rapidly to help prevent zero-day attacks.

Any system that was online and checking in at the time the flawed configuration was published received the file and went into the BSOD loop. And once a system went into the BSOD loop, the only path to recovery was through manual intervention. In other words, every affected machine had to be physically touched to repair and restart. While the process was reasonably straightforward and quick, it was the sheer volume of affected systems that thwarted rapid recovery. Organizations using virtual machines had a slightly better recovery path, but were still impacted.

Interestingly, CrowdStrike had solid testing, quality assurance and deployment processes in place designed to prevent just such a thing from happening. However, the particular configuration deployed passed all of their internal testing and quality assurance checks, showing that even a well-thought-out architecture is subject to flaws in design and logic.

Key Lessons

1. Develop Business Resilience – How sensitive is your business to technology outages? Is there a single system that could take your business down? How long could you survive if key infrastructure was unavailable, for example your network, your datacenter, or cloud applications? If necessary, could you pivot to manual processes for a period of time to keep critical processes running? Smart businesses evaluate each piece of technology asking these key questions, and for critical processes and applications pre-determine what steps would be necessary to keep the business running. It is also important to understand what is most critical so that you can prioritize recovery efforts. For example, in the CrowdStrike outage, airlines prioritized getting key flight management and operations processes online before pivoting to customer-facing technologies. While this created customer frustration, at least the planes could take off and land safely.

2. Consider Product Diversity – There will always be single points of failure in a system. Key enterprise applications are by design generally single-threaded. But there are instances where it is possible to implement diverse products across the enterprise. In this instance some organizations had more than one endpoint protection solution in place, so endpoints running the alternatives remained fully functional. There are of course tradeoffs to this approach, including complexity and higher cost, and in the case of security tools, potential gaps in coverage.

3. Evaluate your Internal Quality Paradigms – Just like CrowdStrike, most organizations have established product delivery pipelines. Do you have processes in place to assure that once a product has completed testing that it can be delivered without unauthorized changes that could introduce unanticipated problems? Many system outages can be directly traced to inadequate change control processes—often a problem is discovered late in the deployment process, and tech teams slipstream changes on the assumption that “this one little change won’t hurt anything” just to keep the pipeline moving.

4. Implement Deployment Staging – It can be tempting with vendor-supplied products to always move to the latest and greatest releases. And with security products, time is often of the essence to get the latest protections into the field to prevent impact from emerging threats and zero-day attacks. But there are strategies to configure products to roll out in such a way that negative impacts can be detected and isolated early before the entire enterprise is affected.

As an example, here is an excerpt from CrowdStrike’s PIR that specifically discusses their product staging approach: “The sensor release process includes…a staged sensor rollout process that starts with dogfooding internally at CrowdStrike, followed by early adopters. It is then made generally available to customers. Customers then have the option of selecting which parts of their fleet should install the latest sensor release (‘N’), or one version older (‘N-1’) or two versions older (‘N-2’) through Sensor Update Policies.”

This is actually a very solid staging strategy, and could have resulted in minimal impact from the event, especially for organizations who opted for an N-1 or N-2 deployment strategy for critical systems. Again, the potential tradeoff is speed to protection versus product stability—each organization needs to discuss which is more important, sometimes all the way to the individual technology asset level.

5. Evaluate your Supply Chain – Business resilience doesn’t just extend to your own operations. What would happen if a key supplier was taken offline by a significant event? In this instance, even companies that weren’t CrowdStrike customers found themselves impacted by the chain of events. Organizations should challenge their suppliers, especially their IT product vendors to ensure they have had these same conversations, and addressed them in their internal processes.

Conclusion

There will be much written in coming months about the causes and impacts of the CrowdStrike incident. Some will conclude that CrowdStrike is no longer worth the risk as a viable vendor for security products, and look to pivot away. I recommend a more pragmatic view—CrowdStrike is not the first, nor will it be the last to experience a similar event, and the vendor you pivot to could be the next one impacted. Indeed, there are much larger vendors (whose names begin with M, A and G) that have experienced near catastrophic outages, and yet they continue to grow and mature in this cloud-centric world. And CrowdStrike continues to be a leader in their particular market space. One could argue that this will only make them stronger as they apply the lessons learned to their internal processes.

Instead, I recommend that organizations use this as a use case to evaluate their own exposures going forward. Smart IT and governance professionals will tee up conversations about the topics above to evaluate where it makes sense to drive additional maturity into their product delivery and vendor management processes.

Andy Weeks
Director
SunHawk Consulting, LLC
Andy.Weeks@SunHawkConsulting.com

Andy Weeks is a Director with SunHawk Consulting, LLC and has a dynamic background spanning in Executive Management, Information Security, Identity Management, Information Technology, and Strategic Planning. He has proven expertise in leading high-performance teams and driving results across multiple business areas.

SunHawk experts are highly experienced professionals ready to assist you within our focus areas of:

Healthcare Compliance | Corporate Investigations
Corporate Compliance | Litigation Disputes

Have a question? We are ready to answer it.