Digital Event Horizon
Cloudflare recently faced an unexpected technical challenge when a sudden doubling in size of an important file caused a chain reaction bringing down several core services. In this article, we explore what led to such an incident and how Cloudflare is taking measures to prevent similar issues in the future.
Cloudflare faced an unexpected technical challenge when a file doubled in size, causing a chain reaction that brought down several core services. The issue was not a large-scale cyberattack, but rather an unforeseen doubling of an important file due to a permission change. A limit of 200 features in machine learning triggered panic when the expanded file reached this threshold, causing software failure and system overwhelm. Cloudflare staff worked tirelessly to stop the propagation of the erroneous feature file, restoring most services after several hours. The incident highlighted the importance of robust system design, continuous monitoring, and ongoing improvements to prevent similar incidents in the future.
Cloudflare, a leading content delivery network (CDN) service provider, recently faced an unprecedented technical challenge when a sudden doubling in size of an important file caused a chain reaction that brought down several core services. In this article, we will delve into the events surrounding this incident and explore what led to such an unexpected outcome.
The Cloudflare Outage: A Hyper-Scale DDoS or an Internal Fault?
When a widespread outage occurred on November 18th, 2025, Cloudflare initially suspected that it was hit by a hyper-scale distributed denial-of-service (DDoS) attack. CEO Matthew Prince expressed his concern in an internal chat room, writing "I worry this is the big botnet flexing." However, upon further investigation, it became clear that the issue was not a large-scale cyberattack but rather an unexpected doubling of an important file.
The Root Cause: An Unforeseen Permission Change
The root cause of the problem lay in a change made to one of Cloudflare's database systems' permissions. This alteration resulted in the database outputting multiple entries into a feature file used by Cloudflare's Bot Management system. The resulting larger-than-expected feature file then doubled in size and was propagated across the entire network.
Consequences: Software Failure and System Overwhelm
The software that routes traffic across the Cloudflare network read this bloated feature file to keep its Bot Management system up-to-date with evolving threats. However, due to a limit of 200 features set for machine learning, when the expanded file reached this threshold, it triggered panic within the system. Subsequently, the number of 5xx error HTTP status codes surged.
Cloudflare's Recovery: A Story of Trial and Error
After discovering the issue, Cloudflare staff worked tirelessly to stop the propagation of the erroneous feature file and manually inserted a known good version into its distribution queue. They then forced a restart of their core proxy service. The process took several hours but eventually restored most services to normal.
The Outage's Aftermath: Lessons Learned
Cloudflare acknowledged that this incident was their worst outage since 2019, resulting in significant disruption to online services and users worldwide. CEO Matthew Prince emphasized the importance of learning from past experiences, stating that previous outages have "always led to us building new, more resilient systems."
Measures to Prevent Similar Incidents in the Future
Cloudflare is now taking steps to strengthen its system's integrity by:
1. Hardening the ingestion of Cloudflare-generated configuration files.
2. Enabling more global kill switches for features.
3. Eliminating the ability for core dumps or other error reports to overwhelm system resources.
A Lesson in System Reliability and Continuous Improvement
The recent incident highlights the importance of robust system design, continuous monitoring, and ongoing improvements. It serves as a reminder that even seemingly minor changes can have far-reaching consequences when it comes to large-scale systems like Cloudflare's core services.
In conclusion, this remarkable story showcases how an unforeseen permission change led to a widespread technical failure at one of the world's most critical CDN providers. By analyzing this incident and taking steps towards strengthening its system's reliability, Cloudflare aims to ensure that such occurrences will be fewer in the future.
Related Information:
https://www.digitaleventhorizon.com/articles/A-Cloudflare-Conundrum-The-Unsettling-Story-of-a-Sudden-File-Doubling-that-Brings-Down-Several-Core-Services-deh.shtml
https://arstechnica.com/tech-policy/2025/11/cloudflare-broke-much-of-the-internet-with-a-corrupted-bot-management-file/
Published: Wed Nov 19 17:05:51 2025 by llama3.2 3B Q4_K_M