Cloudflare's Code Orange: How 'Fail Small' Built a Stronger Network

Over the past two quarters, Cloudflare undertook a major engineering initiative called Code Orange: Fail Small. The goal was to make the network more resilient, secure, and reliable after two global outages in November and December 2025. The project is now complete, focusing on safer configuration changes, reducing failure impact, improving break glass procedures, and enhancing customer communication during incidents. Below, we answer common questions about what changed and what it means for you.

What exactly was Code Orange: Fail Small?

Code Orange: Fail Small was an intensive engineering effort spanning roughly six months. Its mission was to prevent outages like those on November 18 and December 5, 2025, from happening again. The team redesigned how configuration changes are deployed, introduced health-monitoring tools, tightened incident response, and built new systems to catch problems before they impact customers. While resilience is an ongoing priority, this specific set of improvements is now complete and has already strengthened the network.

Cloudflare's Code Orange: How 'Fail Small' Built a Stronger Network — Source: blog.cloudflare.com

How did Cloudflare make configuration changes safer?

Previously, internal configuration changes could reach the entire network instantly. Now, high-risk configuration pipelines are identified, and deployments use a health-mediated methodology. Changes are rolled out gradually with real-time health monitoring. If a problem is detected, the system automatically rolls back before affecting customer traffic. This approach, already used for software releases, now applies to all configuration deployments across product teams that were impacted by past outages.

What is Snapstone and why is it important?

Snapstone is a new internal component that brings health-mediated deployment to configuration changes. It bundles a configuration change into a package and then releases it gradually, using health checks to decide whether to continue or roll back. Before Snapstone, teams had to build their own rollout mechanisms, leading to inconsistent application. Snapstone provides a unified, flexible system that works with any type of configuration—data files like the one that caused the November outage or control flags like the one involved in the December incident. This makes the entire network more resilient to human error or unexpected side effects.

How does health-mediated deployment actually prevent outages?

Health-mediated deployment works like a safety net. When a configuration change is rolled out to a small subset of servers, observability tools monitor for anomalies—such as increased error rates or latency spikes. If something looks wrong, the deployment is halted and automatically reversed. This ensures that a bad change never reaches the full network. It's the same principle used in modern software rollouts: catch issues early, fail small, and recover fast. For customers, this means fewer disruptions and more consistent performance.

What other improvements were made besides configuration changes?

The project also revised break glass procedures—emergency access methods used during severe incidents—to reduce the risk of human error. Incident management processes were updated to improve coordination and decision-making. Additionally, the team introduced measures to prevent drift and regressions over time, ensuring that new code or config doesn't re-introduce old vulnerabilities. Finally, they strengthened how Cloudflare communicates with customers during outages, providing clearer, faster updates so you know what's happening and what's being done.

What does this mean for Cloudflare customers?

For most customers, the improvements are invisible but impactful. Configuration changes that previously could have caused incidents are now deployed more carefully. The network is more resilient to single points of failure, and if something does go wrong, the blast radius is smaller. You'll also receive better communication during any future incidents. In short, your traffic is more likely to stay online, and if there is an issue, it will be resolved faster with less disruption. Cloudflare's network is stronger today than before Code Orange.

Is this project truly finished, or will there be more work?

The specific engineering effort under Code Orange: Fail Small is complete. However, resilience is never a job done; it's a continuous priority. Cloudflare has integrated the lessons and tools from this project into its standard development lifecycle. New teams and products will adopt health-mediated deployment by default. Snapstone and the new procedures will evolve as the network grows. So while this chapter is closed, the commitment to reliability continues, with the same fail small mindset applied to future challenges.

Tags: