What the AWS US-East-1 outage taught us about resilience

This newsletter takes a deep dive into the October 2025 AWS US-East-1 outage that disrupted widespread global services. Learn how a single DNS failure cascaded across systems and the key design lessons for building resilient, failure-tolerant architectures.

12 mins read

Oct 22, 2025

On Tuesday, October 20, 2025, a major disruption ripped through AWS’s us-east-1 region, causing widespread outages across many internet services.

According to AWS, the issue was not a malicious attack or a mere hardware failure. Instead, it stemmed from DNS-resolution failures affecting the DynamoDB API endpoint in the us-east-1 region. The outage spanned several hours, with initial service recovery by mid-morning US ET, though backlog impacts continued for longer. It disrupted many AWS services and platforms worldwide. For engineers and technical leads, this event served as a stark reminder that even the most robust systems have breaking points.

The incident showed just how dependent each cloud component is on the others. A failure in one region’s DNS system rippled through both controlThe control plane is the brain of a system. It makes decisions and manages the system's resources, such as creating a new server or configuring a network route. It's the management layer. and data planesThe data plane is the muscle of a system. It does the actual work commanded by the control plane, such as handling user traffic or moving data packets. It's the operational layer where your application's workload runs., revealing how “regional isolation” can quickly break down when core services share dependencies.

This newsletter dissects the anatomy of the us-east-1 outage to extract critical lessons for building more resilient, failure-tolerant systems, focusing on the following areas:

The technical root cause and its impact scope.
Architectural patterns that failed and those that could have helped.
Practical strategies for mitigating regional dependencies.
A comparative look at resilience across major cloud providers.

To understand how a single DNS failure triggered such widespread disruption, we need to look at how the incident unfolded. The timeline shows how quickly a localized issue escalated into a global service degradation.

Timeline and anatomy of the outage#

The incident escalated quickly. What began as isolated errors turned into a region-wide disruption in under an hour. Based on AWS’s service health updateshttps://health.aws.amazon.com/health/status for us-east-1, the progression unfolded as follows (all times are in Pacific Daylight Time (PDT)).

12:11 AM: Investigation begins into increased error rates and latencies in us-east-1.
12:51–1:26 AM: Widespread degradation is confirmed across core services (DynamoDB, EC2, Lambda, IAM).
2:01 AM: The root cause was identified. DNS resolution failures were affecting the DynamoDB API endpoint.
2:22–2:27 AM: Initial mitigation starts. Partial recovery begins, but retry storms create backlogs in various services such as Lambda and CloudTrail.
3:03–3:35 AM: Global control-plane services (IAM, STS) start to recover; throttling persists as backlogs are processed.
4:15 AM: The core DNS issue is fully mitigated. Most operations are stable, but backlogs continue to affect some workloads.
4:48–5:10 AM: Recovery efforts focus on clearing queues. Lambda/SQS processing resumes and EC2-dependent services regain capacity.

The timeline below visualizes the escalation and recovery phases, showing how the outage evolved from early detection to full mitigation:

This timeline shows how a failure in a foundational service like DNS can quickly initiate a chain reaction across dozens of dependent systems within an hour. The next step is understanding how broadly that impact spreads across services, workloads, and regions.

Scope of impact#

The us-east-1 outage was not contained within its regional boundaries. Within minutes, effects appeared globally. One contributing factor is that us-east-1 hosts many AWS global control/plane services (e.g., IAM, STS), so disruptions there can exceed a single region. When the region faltered, these global management capabilities degraded, disrupting account operations and API access even in otherwise unaffected regions.

Many core AWS services were impacted or experienced elevated error-rates, such as EC2, S3, DynamoDB, Lambda, RDS, and ECS. The impact surfaced in multiple forms, as mentioned below.

Elevated error rates: API requests for provisioning, describing, and modifying resources frequently failed.
Increased latency: Successful calls experienced significant delays due to retries and internal timeouts.
Launch failures: Attempts to start new EC2 instances, Lambda functions, or ECS tasks failed as control/plane services could not resolve internal endpoints through DNS.

The us-east-1 region (Northern Virginia) was AWS’s first region. Over time, many global AWS services and legacy workloads became anchored there, increasing dependencies that only became obvious during the failure.

The outage also demonstrated how control-plane dependencies can break regional isolation. Operations such as IAM authentication or AWS CLI commands failed globally, since these rely on services located in us-east-1. This revealed a systemic truth: a regional data-plane failure can easily escalate into a global control/plane failure when shared discovery and identity layers are involved.

The following diagram illustrates how the DNS outage cascaded through dependent layers, turning a single service fault into a multi-tier disruption:

This widespread impact was a direct result of the outage’s technical origin, which reveals deep-seated dependencies within AWS’s own architecture.

Technical root cause and analysis#

At its core, the outage was caused by a systemic failure within the DNS resolution system internal to the us-east-1 region. In a cloud environment, DNS resolutionThe process of translating human-readable domain names (like ec2.us-east-1.amazonaws.com) into machine-readable IP addresses. acts as the circulatory system connecting thousands of microservices. When it fails, those services lose the ability to discover or communicate with one another, a disruption that can cascade across the entire ecosystem in minutes.

This failure affected both control plane and data plane operations:

On the control plane, services like EC2 could not resolve the endpoints of their own dependencies, preventing the launch of new instances or scaling operations.
On the data plane, existing workloads running on EC2 instances failed when attempting to reach other AWS services such as DynamoDB or S3.

This may have led to a retry-amplification loop: applications issuing automatic retries increased load on already degraded DNS resolution services. These repeated attempts further overwhelmed dependent services, compounding the disruption across the system. The resulting self-sustaining failure storm persisted until throttling and mitigation efforts eventually stabilized operations.

Historical context: This incident echoes the 2017 S3 outage in us-east-1, which was caused by human error during a debugging session on the S3 billing subsystem. Both outages exposed how a failure in a seemingly isolated or internal system within us-east-1 could have a massive disruption footprint due to critical service dependencies.

The layered diagram below visualizes how the DNS subsystem failure propagated upward through AWS’s infrastructure layers:

The deep integration of DNS highlights a fundamental challenge in distributed systems, forcing us to reconsider how we manage such critical dependencies.

System Design insights on DNS dependency#

The us-east-1 outage serves as a clear reminder of the risks created by implicit, centralized dependencies. Although AWS markets regional isolation as a resilience feature, this event showed that foundational global services (such as DNS/artificial control-plane endpoints) can still create single points of failure. For system designers, the lesson is simple: a dependency on a shared regional service is a dependency on the entire region’s stability.

The control plane experienced the most critical impact. Many resilience strategies, such as auto scaling and automated failover, depend on a functioning control plane. When that layer fails, a system’s ability to self-heal disappears. You cannot launch replacement instances or redistribute load if the EC2 API or IAM token service is unreachable. This exposes a critical weakness. Recovery mechanisms are often built on the very infrastructure that is failing.

Architectural insight: True resilience means designing systems that can continue operating even when their self-healing mechanisms are unavailable. Auto scaling, service discovery, and failover must have fallback modes independent of the primary control plane.

The outage also highlighted how cascading effects can amplify their impact. DNS failures prevented endpoint resolution, which triggered retry storms and throttling across dependent services. What began as a localized fault in service discovery quickly expanded into a region-wide performance collapse.

The dependency chain is easier to see in a layered view. The diagram below shows how a DNS fault disables control plane recovery and then reaches the application layer:

This forces architects to ask tough questions about their own designs:

Does the application rely on control-plane actions during a failure scenario?
Can the application continue in a degraded state if it cannot reach service endpoints?
How does the application cache DNS records, and how long do they persist?

The incident makes one point clear. DNS is not an invisible utility. It is a critical component that must be designed, monitored, and tested with the same rigor as any core service.

From technical failure to global business disruption: The DNS outage in AWS us-east-1 cascaded far beyond its regional boundaries, disrupting major platforms such as Netflix, financial institutions, and IoT services like Ring. The impact appeared as complete service unavailability, authentication failures, and API timeouts. This event exposed the fragility of the SaaS supply chain, where a single infrastructure failure impacted critical functions like auto scaling and triggered widespread revenue loss.

These cascading failures reveal clear lessons for engineers designing for resilience and reliability.

Lessons for System Design and reliability engineering#

The most important lesson from the October 2025 outage is to design systems with the explicit assumption that foundational services, such as DNS and cloud control planes, will fail. Resilience means surviving dependencies. It is not just about surviving hardware failures.

The focus must shift from reactive recovery to proactive resilience. Building robust systems requires embedding failure tolerance directly into the application layer itself, rather than relying solely on the infrastructure provider. Here are some key strategies.

Application-level caching: Implement aggressive, long-lived DNS caching at the application level. If a DNS resolver is unavailable, the application should be able to continue operating with the last known-good IP address.
Graceful degradation: Design systems to operate in a “degraded” mode. If a non-critical feature’s back-end dependency fails, the application should disable that feature gracefully rather than failing entirely. Circuit breakers are an essential pattern here to prevent repeated calls to a failing service.
Separate critical paths: Identify the absolute critical path for your application’s core function and ruthlessly eliminate dependencies on it. For an e-commerce site, the ability to process payments is critical, whereas displaying user recommendations is not. Isolate these paths architecturally.

Pro tip: Utilize chaos engineering practices, such as those pioneered by Netflix’s Chaos Monkey, to proactively test your system’s resilience against dependency failures. Deliberately inject DNS failures or API errors in a controlled environment to observe/assess how your system behaves.

Ultimately, these application-level strategies must be paired with robust infrastructure patterns designed to mitigate large-scale regional failures.

Best practices for mitigating regional failures#

Surviving a regional outage requires architectural patterns that treat a single region as an ephemeral resource. While complex to implement, a multi-region strategy is the most effective defense against events like the us-east-1 failure.

A resilient architecture goes beyond simply deploying infrastructure in multiple locations. It involves a holistic approach to data replication, traffic routing, and service discovery that is independent of any single region’s control plane. Consider the following best practices.

Multi-region deployment: Adopt an active-active or active-passive multi-region architecture. In an active-active setup, traffic is served from multiple regions at all times. In an active-passive setup, traffic is failed over to a secondary region during an outage.
Redundant service discovery: Do not rely solely on a single cloud provider’s DNS for failover. Utilize a multi-provider DNS solution (e.g., AWS Route 53 combined with Cloudflare) that includes health checks, allowing for automatic traffic routing away from an unhealthy region.

Note: Multi-provider DNS cannot fully mitigate failures of internal service-endpoint discovery within a provider’s control plane. Therefore, the architecture must also incorporate service-endpoint redundancy and fallback logic.

Regular failover testing: A failover plan that has not been tested is merely a recovery theory. Conduct drills regularly where you simulate a regional failure and execute your failover playbook. This builds operational muscle and uncovers flaws in your process and architecture.

The following schematic shows a high-level view of a resilient multi-region architecture designed to withstand a regional DNS failure:

AWS has statedhttps://www.theverge.com/news/802486/aws-outage-alexa-fortnite-snapchat-offline that the underlying issue was fully mitigated, with backlog processing continuing through the day. That said, the event highlights opportunities for further improvement in service independence and scope-of-impact containment.

The AWS resilience roadmap#

In the wake of the outage, we can anticipate several key improvements from AWS aimed at strengthening its infrastructure. The primary focus will likely be on further decoupling regional services and decentralizing the control planes, which have proven to be bottlenecks. This involves decoupling monolithic dependencies so that a failure in one part of the system is less likely to cascade into a region-wide or global event.

Continued investment can be expected in cell-based architectures, where services are deployed in smaller, fully independent instances (cells) even within a single region. This approach contains the impact of a failure to a single cell. Furthermore, AWS will likely introduce more resilient tooling and managed services that make it easier for customers to implement and test multi-region failover strategies. The industry as a whole may see a renewed push toward multi-cloud and decentralized architectures, as organizations seek to avoid vendor lock-in and single points of failure.

This event will undoubtedly accelerate the evolution of cloud resilience.

TL;DR#

The October 2025 AWS outage was a painful, but necessary lesson for the entire industry. It reminded us that failure is inevitable in complex distributed systems, and resilience must be a deliberate architectural choice, and not an afterthought. The key takeaways are clear. Treat your dependencies as potential failures, isolate critical workflows, and build for multi-layered recovery. Systems can be built to be truly fault-tolerant by designing for failure at both the application and infrastructure levels. The goal is to emerge stronger from the next outage and not merely survive it.

If you want to go deeper and master the skills needed to build failure-tolerant systems, explore our expert-led courses. Whether you’re designing multi-region architectures, optimizing caching strategies, or preparing for your next System Design interview, these paths offer practical frameworks to help you build truly resilient and scalable services.

Written By:

Fahim ul Haq

5 essential System Design security practices for 2025

How strong is your system's weakest link? Let's explore 5 essential security measures (and 8 key techniques) to design systems that anticipate threats, protect data, and stay resilient against evolving cybersecurity challenges.

24 mins read

Feb 5, 2025