Three Key Lessons from the Recent AWS and Cloudflare Outages

In less than a month, two of the internet’s core infrastructure providers—Amazon Web Services and Cloudflare—suffered major outages that rippled across the global economy.

On October 20, 2025, AWS’s US-EAST-1 region experienced a disruption lasting over 14–15 hours, triggered by a latent race condition in DynamoDB’s DNS management system. The defect wiped out DNS records for critical endpoints, cascading into failures across dozens of AWS services and the applications that depend on them.
On November 18, 2025, Cloudflare’s Bot** **Management system pushed a configuration change that doubled the size of a “feature file.” That oversized file exceeded a hard-coded limit in the traffic proxy software, causing processes to crash and restart repeatedly across their global network. Pro…

In less than a month, two of the internet’s core infrastructure providers—Amazon Web Services and Cloudflare—suffered major outages that rippled across the global economy.

On October 20, 2025, AWS’s US-EAST-1 region experienced a disruption lasting over 14–15 hours, triggered by a latent race condition in DynamoDB’s DNS management system. The defect wiped out DNS records for critical endpoints, cascading into failures across dozens of AWS services and the applications that depend on them.
On November 18, 2025, Cloudflare’s Bot** **Management system pushed a configuration change that doubled the size of a “feature file.” That oversized file exceeded a hard-coded limit in the traffic proxy software, causing processes to crash and restart repeatedly across their global network. Prominent services, including X, ChatGPT, Canva and many others, saw widespread 5xx errors and elevated latency.

Both events followed a similar pattern: A subtle defect in a single subsystem triggered global, cascading failures. The AWS incident began with packet loss at edge nodes, followed by DNS failures that prevented DynamoDB endpoint resolution. Dependent services and health checks failed for over 15 hours despite the DNS fix being applied early in the outage. The Cloudflare outage started when a database permission change caused errant data in a configuration file used by the Bot Management system. This abnormal file propagated rapidly across Cloudflare’s global network, triggering core proxy restarts and repeated 5xx server errors. Mitigations like bypassing the proxy for certain services reduced the impact temporarily before a complete rollback resolved the incident. Both examples reveal how failures within a single subsystem can cascade globally, impacting thousands of services that rely on these cloud providers.

From these events, there are three key lessons for enterprises relying on cloud and network services:

Design Out Single Points of Dependency

The recent AWS and Cloudflare outages underscore the critical risk of over-reliance on a single cloud provider, region, or infrastructure component. While hyperscaler outages are becoming more frequent due to increasing complexity, from network intricacies to software misconfigurations, simultaneous outages across multiple providers remain rare. This trend makes adopting a multi-cloud approach essential for failover and resiliency. Moreover, both outages showed how cascading failures can propagate widely when interdependent services share a common provider or region, intensifying the impact of a single error.

Enterprises need to architect for diversity and isolation:

Multi-region by default for critical services, with active-active or warm standby failover – not just “we have backups somewhere.”
Multi-cloud and multi-provider paths for DNS, connectivity, and security services so a single provider’s bad day doesn’t become your bad week.
Segmentation and blast-radius control: separate control, data, and management planes; avoid patterns where one region, cluster, or file can take everything down.

The goal isn’t zero outages – that’s unrealistic. The goal is that any single failure is locally painful, not globally catastrophic.

Prioritize Real-Time, AI-Powered Monitoring. Use AI to Help With Remediation

Detecting true failures quickly among noisy alerts remains a challenge for hyperscalers and enterprises alike. Both the AWS and Cloudflare incidents illustrated how error signals can be numerous and confusing during outages. Leveraging AI-driven monitoring that correlates data, filters false positives, and triggers automated rollback or failover reduces downtime significantly.

For enterprises, the takeaway is this: Monitoring without an automated, policy-driven response is just an expensive alerting system.

Modern architectures should:

Use AI/ML to correlate signals across regions, providers, and services – filter noise, surface causal relationships, and highlight where a problem truly originates.
Codify safe, automated actions or use AI to accelerate remediation: accelerate rollbacks on a bad configuration, route around a failing region, or lower feature flags.
Guardrails first, automation second: implement checks on configuration size, cardinality, and blast radius before rollout, especially for systems that touch global routing or security.

Regularly Practice and Test Disaster Recovery Plans

Both outages highlighted something every SRE already knows: you do not rise to the level of your runbook. You fall to the level of your practice.

Disaster recovery drills, tabletop exercises, and failover tests build team readiness, reveal gaps, and ensure failover systems meet recovery objectives. Such preparation enables smoother, calmer responses during prolonged outages involving phased cascading failures like those seen at AWS and Cloudflare. Effective disaster recovery testing should cover a variety of scenarios, including technical failures, cyberattacks, and human errors, to ensure comprehensive preparedness. Tests must be realistic and frequent, with detailed documentation and post-test evaluations to continuously update and improve strategies in alignment with evolving technologies and threats. Use AI to help pressure test your plan and simulate potential risks and impact.

An Industry Wake-Up Call

These lessons align with a broader architectural priority. The complexity surface continues expanding with every new region, service, and feature added, multiplying the ways a small change can cascade into global impact. The sustainable answer lies in designing for failure by limiting blast radius, decoupling control, data, and management planes, automating validation and rollback, and avoiding single points of dependency, whether that’s on one cloud, network provider, or single control plane. This also reconciles with AI-era infrastructure needs, where workloads require low-latency, highly available connections between users, data, and GPU clusters. Networks must be cloud-agnostic, multi-region, policy-driven, and capable of rerouting, failover, or isolating issues in real time when a provider has a “bad day.”

The AWS and Cloudflare outages should serve as a wake-up call to all enterprises. No cloud or network provider is immune to failure. Preparation and architectural resilience are essential to minimize disruption and ensure your business keeps running when the next service disruption strikes.

Misbah Rehman

** Design Out Single Points of Dependency**

** Prioritize Real-Time, AI-Powered Monitoring. Use AI to Help With Remediation**

** Regularly Practice and Test Disaster Recovery Plans**

An Industry Wake-Up Call

Similar Posts

Design Out Single Points of Dependency

Prioritize Real-Time, AI-Powered Monitoring. Use AI to Help With Remediation

Regularly Practice and Test Disaster Recovery Plans