Availability and Fault Tolerance in Distributed Systems
Learn to design distributed systems that can withstand failures and ensure consistent service delivery.
We'll cover the following...
When a service like a payment gateway or a real-time messaging app goes down, the consequences can range from financial loss to a complete erosion of user trust.
Building systems that gracefully handle these failures is a core competency in modern engineering and a crucial topic in System Design. The goal is not to prevent failures entirely (as this is impossible in real-world systems), but to design systems that can tolerate them without impacting the end user.
This lesson examines the principles and patterns employed to achieve high availability and fault tolerance, ensuring that our services remain operational even when individual components fail.
Availability in distributed systems
Availability refers to the consistency with which a system or service remains operational and accessible to users when needed.
In simple terms, it measures uptime, which represents the percentage of time a system functions without interruption. High availability focuses on minimizing downtime to ensure that users can always access the service, as illustrated below.
Availability is typically expressed as a ratio of uptime to total time:
It is often represented in service level agreements (SLAs) to define expected reliability and performance targets.
Some of the most common metrics used to evaluate and manage availability include:
Uptime: Percentage of time a system is operational and performing its required functions.
Mean time between failures (MTBF): Average time that passes between one failure and the next. A higher MTBF means the system is more reliable.
Mean time to recovery (MTTR): Average time needed to repair a failed component and restore full operation. A lower MTTR is better.
Educative byte: The famous “five nines” of
Achieving high availability
The foundational principle for achieving high availability is redundancy, which means eliminating single points of failure by having duplicate components.
If one component fails, a redundant one can take over its workload, ensuring the system remains operational. Redundancy can be physical (multiple zones, regions, servers) or logical (multiple service replicas running in containers or clusters).
To understand how redundancy is practically ...