Real-Case Analysis #29: CrowdStrike Update Crashes Millions of Systems
Elisabeth Do
July 23, 2024
3 min
A huge IT outage occurred owing to a faulty update published by CrowdStrike, a well-known American cybersecurity company. This incident, also known as the "CrowdStrike Disruption," is regarded as one of the greatest IT outages in history.
Incident Details
Incident Description
CrowdStrike delivered a flawed content update for its Falcon Sensor program on July 19, 2024, at 04:09 UTC, resulting in a global IT disruption that affected an estimated 8.5 million Windows computers. The problem caused afflicted systems to crash and reach a blue screen of death (BSOD) state, leaving many unable to restart correctly. The outage had far-reaching implications, affecting vital services in a variety of industries, including airlines, healthcare, finance, retail, and government services.
Timeline
Detection: The problem was discovered almost immediately, as Windows virtual computers on Microsoft Azure started rebooting and crashing. At 06:48 UTC, Google Compute Engine had also reported the issue.
Notification: At 07:15 UTC, Google openly confirmed that the CrowdStrike upgrade was to blame. CrowdStrike CEO George Kurtz verified the issue within hours.
Mitigation: CrowdStrike rolled back the problematic content update at 05:27 UTC. However, many afflicted systems required manual intervention to fix the problem.
Resolution: At 09:45 UTC, Kurtz confirmed that the fix has been deployed. However, due to the nature of the issue, complete resolution for all affected systems took longer, as manual fixes were frequently required.
Root Cause Analysis
Initial Cause: The outage was triggered by a logic error in a sensor configuration update, specifically in Channel File 291, which controls how Falcon evaluates named pipe execution on Windows systems.
Underlying Issues: The update was designed to target newly observed, malicious named pipes used by common C2 frameworks in cyberattacks. However, a flaw in the logic caused operating system crashes.
The inability of many affected systems to automatically download the fix due to being stuck in boot loops.
The use of BitLocker encryption on many corporate devices, which complicated the recovery process by requiring recovery keys.
Impact Analysis
Immediate Impacts
Widespread Service Interruptions
Airlines:Delta Airlines was heavily affected, canceling over 4,000 flights and experiencing significant operational delays.
Healthcare: Hospitals and healthcare services, including the UK's National Health Service, faced disruptions, although many systems were gradually restored.
Financial Sector:Global banks experienced outages, affecting transactions and operations.
Economic and Logistical Consequences
The financial damage from the outage is estimated to be at least $10 billion, considering the widespread service interruptions and recovery efforts required.
Freight and supply chain sectors faced prolonged disruptions, with experts noting that recovery could take up to three times longer than the duration of the outage itself.
Long-Term Impacts
Reputation and Trust
CrowdStrike's reputation suffered a serious blow. Following the incident, the company's shares fell by more than 13%, indicating investor concerns about the trustworthiness of its services.
The event has exposed the risks of relying too heavily on a single cybersecurity supplier, leading to discussions about the need for more resilience and redundancy in IT systems.
Regulatory and Compliance Considerations
The incident may result in heightened regulatory scrutiny and possible legislative steps to improve quality assurance and compliance with frameworks such as the Secure Software Development Framework (SSDF).
Companies are likely to reconsider their compliance with cybersecurity regulations and practices in order to avoid such risks in the future.
Sector-Specific Impacts
Aviation: The aviation sector, already strained by high demand, faced additional challenges due to the disruption. This incident underscored the fragility of the global supply chain and the need for robust contingency plans.
Government and Public Services: Government agencies and public services experienced significant disruptions, highlighting the critical nature of cybersecurity in maintaining public infrastructure and services.
Recovery and Remediation
Incident Response and Initial Actions
CrowdStrike acted quickly after discovering the problem on July 19, 2024, to lessen the impact of the incorrect upgrade that had caused widespread system breakdowns. The erroneous "Channel File 291" update was reversed within 78 minutes of its distribution, at 05:27 UTC. This prompt action was critical in preventing more systems from being compromised. However, devices that had already downloaded the incorrect update required manual intervention to restore functioning.
Manual Recovery Process
CrowdStrike provided thorough remediation methods for the affected systems. The primary solution involves booting the affected Windows hosts into Safe Mode or Windows Recovery Environment. Administrators were told to go to the%WINDIR%\System32\drivers\CrowdStrike directory and delete the faulty file with the timestamp "C-00000291*.sys" at 04:09 UTC. After removing this file, the system could be rebooted normally. Administrators who had BitLocker-encrypted systems had to utilize the recovery key to access Safe Mode, which complicated the recovery process even further.
Automated Tools and External Support
To speed up the recovery process, Microsoft launched a customized recovery tool meant to help IT administrators fix damaged systems more rapidly. This utility, which was available as a USB bootable solution, had two primary options: recovery from the Windows Preinstallation Environment (WinPE) and recovery from Safe Mode. The instrument was especially beneficial in circumstances where manual intervention was problematic due to the severity of the impact. CrowdStrike also worked with major cloud service providers such as AWS and Azure to give customized guidelines and tools for virtual environments, ensuring that cloud-hosted systems could be restored with minimal downtime.
Communication and Customer Support
CrowdStrike had open and consistent communication with its customers throughout the crisis. Updates were constantly issued on the CrowdStrike Support Portal and official blog, offering the most recent information and advise on remediation measures. CEO George Kurtz issued a public apology, underlining the company's commitment to resolving the problem and restoring client systems as soon as possible. Customers were encouraged to only communicate through legitimate channels to avoid potential exploitation by hostile actors.
Post-Incident Analysis and Future Prevention
Following the rapid recovery attempts, CrowdStrike conducted a thorough root cause study to determine the underlying issues that contributed to the incorrect update. The organization has committed to improving its testing and validation methods to avoid similar accidents in the future. This includes more rigorous pre-release testing and better means for rolling back updates in the event of unforeseen problems. The incident demonstrated the need of strong quality assurance methods and extensive disaster recovery strategies in mitigating the impact of such disruptions.