Grokking the Modern System Design Interview/

...

Fault Tolerance

Learn about fault tolerance, how to measure it, and its importance.

We'll cover the following...

What is fault tolerance?
- Fault tolerance techniques
  - Replication
  - Checkpointing

Press + to interact

Two key qualities make fault tolerance essential: availability and reliability. Availability ensures that the system remains accessible and can receive client requests at any time. Reliability, on the other hand, ensures that the system consistently processes those requests and performs the correct actions every time.

With that understanding, let’s now explore some common techniques used to achieve fault tolerance.

Fault tolerance techniques

Failure occurs at the hardware or software level, which eventually affects the data. Fault tolerance can be achieved by many approaches, considering the system structure. Let’s discuss the techniques that are significant and suitable for most designs.

Replication

One of the most widely-used techniques is replication-based fault tolerance. With this technique, we can replicate both the services and data. We can swap out failed nodes with healthy ones and a failed data store with its replica. A large service can transparently make the switch without impacting the end customers.

We create multiple copies of our data in separate storage. All copies need to update regularly for consistency when any update occurs in the data. Updating data in replicas is a challenging job. When a system needs strong consistency, we can synchronously update data in replicas. However, this reduces the availability of the system. We can also asynchronously update data in replicas when we can tolerate eventual consistency, resulting in stale reads until all replicas converge. Thus, there is a trade-off between both consistency approaches. We compromise either on availability or on consistency under failures—a reality that is outlined in the CAP theorem.

Checkpointing

Checkpointing is a technique that saves the system’s state in stable storage for later retrieval in case of failures due to errors or service disruptions. Checkpointing is a fault tolerance technique performed in many stages at different time intervals. When a distributed system fails, we can get the last computed data from the previous checkpoint and start working from there.

Checkpointing is performed for different individual processes in a system in such a way that they represent a global state of the actual execution of the system. Depending on the state, we can divide checkpointing into two types:

Consistent state: A state is consistent in which all the individual processes of a system have a consistent view of the shared state or sequence of events that have occurred in a system. Snapshots taken in consistent states have data in coherent states, representing a possible situation of the system. For a ...

Distributed Cache System

Pub-Sub

Blob Store

TikTok

Uber Eats

NewsFeed

Facebook Messenger

ChatGPT

Fault Tolerance

What is fault tolerance?

Fault tolerance techniques

Replication

Checkpointing