Types of Failures in Distributed Systems
Learn how distributed systems stay reliable during different types of failures using redundancy, observability, and recovery.
Building a large-scale application on a single, powerful machine is inherently fragile.
The moment the machine fails, the entire service comes to a halt. To achieve scalability, reliability, and low latency, modern systems are distributed across many commodity computers, but this introduces a fundamental challenge. A system composed of hundreds or thousands of fallible components is guaranteed to experience failures.
Understanding these failure modes is a critical skill for any engineer and a frequent focus in modern System Design.
This lesson explores the inevitable failures in distributed systems, categorizing them to help you design resilient and robust architectures. Common categories include hardware failures, software failures, and network failures, each of which affects system reliability in distinct ways.
The following illustration highlights failure points across different services in a distributed system.
Hardware failures in distributed systems
At the most basic level, distributed systems run on physical machines, and that hardware can break.
Hardware failures involve the malfunction of components such as disks, memory, power supplies, or network cards, but can also occur at larger domains like racks or network switches. While component quality has improved over the decades, the sheer scale of modern data centers means that hardware failures are not a rare exception, but a daily operational reality.
These failures can be understood as occurring across several hierarchical levels in a distributed system, ranging from individual components up to entire data centers. The pyramid below illustrates this layered nature of hardware failures, where each lower layer represents a broader level of potential disruption.
When a server’s disk crashes or its memory becomes corrupted, it can stop responding, leading to service unavailability. The primary strategies for mitigating these issues involve accepting that failure will occur and planning for it through replication, redundancy, and failover mechanisms.
Replication: The most common strategy is to keep multiple copies of data and application logic on different machines. If one machine fails, another can take over its workload. For instance, a database might replicate its data across three separate servers in a cluster.
Redundancy: This involves having duplicate, critical components. ...