Lessons from System Failures
Learn why complex distributed systems are prone to failure due to evolving user needs and emergent properties. Define the four core failure types. Understand how using independent vantage points and failure domains ensures system resilience and graceful degradation.
We'll cover the following...
Introduction
Even widely used services experience failures, which can disrupt both individuals and businesses. System designers must understand why mature services built by experienced teams still experience outages. This chapter examines major failures in widely used services and the techniques used to mitigate them.
Two primary factors contribute to these failures:
Diverse users and evolution: User needs evolve, requiring software updates. While stagnant software is stable, it lacks necessary features. Continuous updates introduce the risk of instability.
Complex systems: Systems possess emergent properties, in which the interactions of components create a complexity greater than the sum of the individual parts.
Types of failure in distributed systems
Modern services are designed to contain failures, localizing impact to a subset of users. Common failure types in distributed systems include:
System failure: The most common cause, resulting from software or hardware crashes. Data in primary memory is lost, but data in secondary storage or replicas remains safe. The system typically reboots to recover.
Method failure: These failures suspend system operations. They may cause incorrect process execution or force the system into a deadlock state.
Communication medium failure: Occurs when a component or service cannot reach other internal or external entities due to network issues.
Secondary storage failure: Occurs when secondary storage or replicas go down. Data on these nodes becomes inaccessible, requiring primary nodes to generate new replicas to ensure reliability.
Vantage points
In large-scale systems, component failures occur regularly. The goal is graceful degradation so that only a small portion of users are affected for a short period. Effective monitoring requires globally distributed vantage points to independently verify service availability and performance.
Note: Services like Downdetector rely on crowd-sourced reporting. If you check the status of popular applications, you will almost always find users somewhere in the world experiencing issues.
Importance of independent service providers
The original internet was designed for resilience: if one part failed, the rest continued to operate.
With the consolidation of service providers,
When dashboards fail, companies often communicate updates via external channels like Twitter. Independent third-party services are therefore essential for objective failure detection and status dissemination.
Note: This relates to failure domains. A failure domain isolates components so that a failure within one domain (or network) does not affect others. Two domains are considered independent if they lie outside each other’s "blast radius."
The following lessons analyze failures in well-known services, their causes, and the mitigation techniques used to avoid them. While analyzing past failures is an excellent way to learn, our ultimate goal is to prevent them from occurring in the first place.