AWS Wide Spread Outage

Examine the 2021 AWS outage to understand how internal network congestion and monitoring failures cascade into global service disruption. Learn crucial System Design lessons about identifying hidden dependencies, implementing contingency plans, and ensuring independent communication systems for resilience.

We'll cover the following...

Introduction
Sequence of events
Analysis
Lessons learned

Introduction

On December 7, 2021, a major AWS outage disrupted services for over eight hours. The incident propagated across many internet services, affecting both consumer devices and large commercial platforms. Headlines like the Financial Times’ “From angry Adele fans to broken robot vacuums” highlighted the scale of the disruption across users and businesses. This event highlighted the risks of heavy reliance on centralized cloud providers. According to Gartner, five companies account for roughly 80% of the cloud market, with Amazon holding a 41% share. Failures at this scale can cause widespread disruptions. Such outages reflect Leslie Lamport’s well-known definition: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”

The following sequence describes the chain of technical events that led to the outage.

Sequence of events

Trigger: An automated capacity expansion in the internal network triggered unexpected behavior from a large number of clients.
Congestion: A surge in connection activity overwhelmed the networking equipment linking the internal network to the main AWS network.
Retry storm: Communication delays increased latency and error rates. This triggered a feedback loop of aggressive retries and ping requests.
Device overload: The persistent traffic spike caused constant overload and performance degradation in the devices connecting the two networks.
Monitoring blindness: The network congestion cut off real-time monitoring data, preventing operations teams from identifying the root cause.
Manual diagnosis: Blind to the system state, operators relied on logs and eventually identified a spike in internal DNS failures.

The following slides visualize the series of events.