On October 20, 2025, at approximately 3:11 AM ET, Amazon Web Services experienced one of its most significant outages in recent history. For roughly 15 hours, a critical failure in the US-EAST-1 region (Northern Virginia) cascaded across the internet, taking down thousands of websites and applications that power our daily digital lives.
Snapchat went dark for 375 million daily users. Fortnite and Roblox became unplayable for millions of gamers. Ring doorbells stopped recording. McDonald’s mobile orders failed. United Airlines booking systems stuttered. Even the British government’s tax website (HMRC) became inaccessible.
The scale was staggering: Downdetector received over 6.5 million reports spanning more than 1,000 services globally. This wasn’t just an AWS problem—it…
On October 20, 2025, at approximately 3:11 AM ET, Amazon Web Services experienced one of its most significant outages in recent history. For roughly 15 hours, a critical failure in the US-EAST-1 region (Northern Virginia) cascaded across the internet, taking down thousands of websites and applications that power our daily digital lives.
Snapchat went dark for 375 million daily users. Fortnite and Roblox became unplayable for millions of gamers. Ring doorbells stopped recording. McDonald’s mobile orders failed. United Airlines booking systems stuttered. Even the British government’s tax website (HMRC) became inaccessible.
The scale was staggering: Downdetector received over 6.5 million reports spanning more than 1,000 services globally. This wasn’t just an AWS problem—it exposed the fragility of our cloud-dependent world.
What Happened: The Technical Breakdown
Based on AWS status updates and monitoring data from affected customers, here’s what we know so far:
Root Cause
Critical
Error in AWS’s EC2 internal network subsystem responsible for monitoring network load balancer health. DNS resolution failures for DynamoDB API endpoints in US-EAST-1.
Affected Services
Critical
14 AWS services including EC2, DynamoDB, SQS, Amazon Connect, Lambda, and S3. Cascading failures across dependent services.
Duration
High
Approximately 15 hours from first reports (3:11 AM ET) to full restoration. Partial recovery began after 8 hours.
Geographic Scope
High
Primary impact in US-EAST-1 (N. Virginia), but global services affected due to control plane dependencies and cross-region service dependencies.
The Chain Reaction
The failure originated in a subsystem responsible for monitoring the health of network load balancers in EC2. When this subsystem failed, it triggered a cascade:
- Network Load Balancer monitoring failed → Load balancers couldn’t properly route traffic
- DynamoDB API endpoints became unreachable → DNS resolution failures prevented connections
- Dependent services cascaded into failure → SQS, Lambda, S3, and others couldn’t function without DynamoDB
- Global services impacted → Even services in other regions failed due to control plane dependencies in US-EAST-1
The Global Impact
The outage didn’t just affect tech companies—it rippled through every sector of the digital economy:
Snapchat
Social Media
375M daily users
Roblox
Gaming
70M+ daily users
Fortnite
Gaming
Millions affected
Ring
Smart Home
Global disruption
McDonald’s App
Food & Retail
Order systems down
United Airlines
Travel
Booking systems affected
Robinhood
Finance
Trading disrupted
Bank of Scotland
Banking
Service interruptions
Why Did Global Services Fail from a Single Region?
This is the critical question. Many affected services were deployed across multiple regions. So why did they still fail?
1.
Control Plane Dependencies: AWS’s global control plane (IAM, Route53, CloudFormation) has critical infrastructure in US-EAST-1. Even healthy regions couldn’t perform certain operations.
2.
Single-Region Data Stores: Many multi-region applications kept their primary databases (DynamoDB, RDS) in US-EAST-1 only, making them single points of failure.
3.
Configuration and Secrets: Applications in healthy regions couldn’t start or scale because they relied on AWS Secrets Manager or Parameter Store in US-EAST-1.
4.
Async Processing Bottlenecks: SQS queues and Lambda functions often centralized in US-EAST-1 for cost optimization, creating hidden dependencies.
What Engineering Teams Should Learn
This outage is a masterclass in distributed systems failure modes. Here are the critical lessons:
1
Multi-Region Architecture is Non-Negotiable
Services running only in US-EAST-1 had zero availability. Multi-region deployments with active-active or active-passive failover could have maintained partial service.
Action: Design for multi-region from day one, even if it seems expensive. The cost of downtime far exceeds infrastructure costs.
2
Don’t Put All Dependencies in One Region
Many services with multi-region deployments still failed because their DynamoDB databases, SQS queues, or Lambda functions were only in US-EAST-1.
Action: Map ALL dependencies. Ensure critical data stores and async processing exist in multiple regions with replication.
3
Test Your Disaster Recovery Plan
Having a DR plan on paper is worthless if you’ve never actually failed over. The outage exposed companies with untested recovery procedures.
Action: Run quarterly chaos engineering exercises. Kill US-EAST-1 deliberately and measure your actual recovery time.
4
Control Plane vs Data Plane Awareness
Even services in healthy regions failed because AWS’s control plane operations (DNS, IAM, CloudFormation) depended on US-EAST-1 infrastructure.
Action: Understand which AWS services have regional vs global control planes. Design systems to operate during control plane outages.
5
Monitoring and Observability Must Be External
Many companies couldn’t access their own monitoring dashboards because they were hosted on AWS infrastructure that was down.
Action: Use external observability tools (DataDog, New Relic, external status pages) that don’t depend on your primary cloud provider.
6
Communication Plans for Extended Outages
Companies struggled to communicate with customers during the outage because their status pages, email systems, and notification services were down.
Action: Maintain status pages and communication channels on separate infrastructure (different cloud provider or on-prem).
Why US-EAST-1 Outages Are Especially Catastrophic
US-EAST-1 (Northern Virginia) isn’t just another AWS region—it’s special:
- •
Oldest AWS Region: Launched in 2006, it has the most mature services and features launch here first.
- •
Default Region: Many AWS services default to US-EAST-1 in SDKs and console, leading to accidental dependencies.
- •
Control Plane Hub: Global AWS services (CloudFront, Route53, IAM) have critical infrastructure here.
- •
Largest Deployment: Estimated to host 30-40% of all AWS workloads globally.
- •
Cost Optimized: Lowest pricing, incentivizing companies to centralize here despite risks.
This combination makes US-EAST-1 outages uniquely impactful. When this region fails, the internet notices.
What Happens Next?
AWS will publish a detailed post-mortem report in the coming days or weeks. These reports typically include:
- Precise timeline of events down to the minute
- Root cause analysis with technical depth
- Why detection and mitigation took so long
- What corrective actions AWS is implementing
- How they’ll prevent similar failures
We’ll update this article once AWS releases their official incident report. In the meantime, engineering teams should be reviewing their own architectures for similar vulnerabilities.
Questions AWS Needs to Answer
- Why did a network load balancer monitoring subsystem have such broad cascading impact?
- Why couldn’t the issue be detected and isolated faster?
- Why did services in other regions experience control plane failures?
- What redundancy existed (or didn’t exist) for this critical subsystem?
- How will AWS improve blast radius isolation for future incidents?
The Bigger Picture: Cloud Dependency Risk
This outage exposes a fundamental tension in modern software architecture:
Cloud providers promise five-nines reliability (99.999% uptime), but achieving that requires architectural discipline that most companies don’t implement. Multi-region deployments are expensive and complex. Many startups and even mature companies accept the risk of single-region deployment to move faster and reduce costs.
The result? We’ve created a world where a networking issue in a single data center in Northern Virginia can disable critical services globally—from emergency Ring cameras to banking apps to government tax systems.
As an industry, we need to have honest conversations about acceptable risk, true cost of downtime, and realistic expectations for cloud reliability.
Stay Informed: Subscribe to Tech Upkeep
This is exactly the kind of critical infrastructure news that product engineers need to know about—but often miss until it’s too late.
Tech Upkeep curates breaking infrastructure incidents, post-mortems from companies like AWS, Netflix, and Google, and the best engineering blog content—delivered to your inbox every Tuesday and Friday.
- Breaking outage analysis like this one
- AWS, Google, Azure post-mortem reports explained
- Production engineering lessons from Netflix, Uber, Airbnb
- System design patterns to prevent failures
We’ll update you the moment AWS publishes their official post-mortem for this incident.
Final Thoughts
The October 20, 2025 AWS outage will likely be studied in computer science courses for years to come. It’s a stark reminder that even the world’s most sophisticated cloud infrastructure has failure modes we don’t fully understand until they manifest.
For engineering teams, this incident should trigger honest architectural reviews. Ask yourself:
- Could our service survive a US-EAST-1 outage?
- Have we actually tested our disaster recovery procedures?
- Do we have dependencies we don’t know about?
- Is our monitoring independent of our infrastructure?
The best time to fix these issues is before the next major outage—not during it.
This article will be updated as new information becomes available and when AWS publishes their official post-mortem report. Last updated: October 21, 2025.
Get Infrastructure News Curated For You
Join 2,500+ engineers who receive breaking infrastructure incidents, post-mortems, and engineering blog content from AWS, Netflix, Google, and more—every Tuesday & Friday.
Free forever. Zero spam. Unsubscribe anytime.