Fault Tolerance

Learn about fault tolerance, how to measure it, and its importance.

We'll cover the following...

What is fault tolerance?

In the real world, large-scale applications utilize hundreds of servers and databases to accommodate the requests of billions of users, and store vast amounts of data. These applications require a mechanism that ensures data safety and avoids the recalculation of computationally intensive tasks by eliminating a single point of failure.

Fault tolerance refers to a system’s ability to execute persistently even if one or more of its components fail. Here, components can be software or hardware. Conceiving a system that is 100 percent fault-tolerant is practically very difficult.

Fault tolerance in action: when one server fails, another seamlessly takes over
Fault tolerance in action: when one server fails, another seamlessly takes over

Two key qualities make fault tolerance essential: availability and reliability.

Availability ensures that the system remains accessible and can receive client requests at any time. Reliability, on the other hand, ensures that the system consistently processes those requests and performs the correct actions every time.

Fault-tolerant systems aim to ensure high availability of the system by preventing disruptions arising from a single point of failure.

There are two fault-tolerant approaches:

  • Fault-removal: This can be either forward error recovery or backward error recovery.

  • Fault-masking: when the presence of one defect hides the presence of another defect in the system.

For fault tolerance with zero downtime (constantly active), a hot failover must be implemented, which instantly transfers workloads to a functioning backup system.

If maintaining a constantly active standby system is not required, a warm or cold failover can be used, where the backup system loads and starts the workloads only when needed. A warm or cold failover is slower because the system must load and initialize resources before becoming active.

Fault-tolerant computing also provides limited protection against software failures, which remain a major cause of downtime and data center outages for many organizations.

With that understanding, let’s now explore some common techniques used to achieve fault tolerance.

Advantages of fault-tolerant systems

The primary purpose of creating fault tolerance is to prevent (or at least minimize) a situation where the system's functionality becomes unavailable due to a fault in one or more of its components.

Fault tolerance is necessary for systems that are used to protect people’s safety (such as air traffic control hardware and software systems) and in systems where security, data protection, data integrity, and high-value transactions all depend on.

Fault-tolerant systems provide an excellent safeguard against equipment failure, but they can be extraordinarily expensive to implement because they require a fully redundant set of hardware that needs to be linked to the primary system.

Fault tolerance techniques

Failure occurs at the hardware or software level, which eventually affects the data. Fault tolerance can be achieved through various approaches, considering the system's structure and design. Let’s discuss the techniques that are significant and suitable for most designs.

Forward and Backward error recovery

Forward error recovery involves identifying the error and, based on this knowledge, correcting the system state containing the error. Exception handling in high-level languages like Ada and PL/1 provides a system structure that supports forward recovery.

Backward error recovery corrects the system state by restoring the system to its stable state that existed prior to the fault's manifestation.

Replication

One of the most widely used techniques is replication-based fault tolerance.

With this technique, we can replicate both the services and the data. We can swap out failed nodes with healthy ones and a failed data store with its replica. A large service can make the switch transparently without impacting end customers.

We create multiple copies of our data in separate storage.

All copies must be updated regularly to maintain consistency when any updates occur in the data. Updating data in replicas is a challenging job. When a system requires strong consistency, we can update data in replicas synchronously.

However, this reduces the system's availability.

We can also asynchronously update data in replicas when we can tolerate eventual consistency, resulting in stale reads until all replicas converge. Thus, there is a trade-off between the two consistency approaches. We often compromise on either availability or consistency in the face of failures, a reality outlined in the CAP theorem.

Checkpointing

Checkpointing is a technique that saves the system’s state in stable storage for later retrieval in case of failures due to errors or service disruptions. Checkpointing is a fault-tolerance technique performed in multiple stages at regular time intervals.

When a distributed system fails, we can retrieve the last computed data from the previous checkpoint and resume work from that point.

Checkpointing is performed for individual processes in a system in a way that represents the global state of the system's actual execution. Depending on the state, we can divide checkpointing into two types:

  • Consistent state: A state is consistent in which all the individual processes of a system have a consistent view of the shared state or sequence of events that have occurred in a system. Snapshots taken in consistent states have data in coherent states, representing a possible situation of the system. For a checkpoint to be consistent, typically, the following criteria are met:

    • All updates to data that were completed before the checkpoint are saved. Any updates to data that were in progress are rolled back as if they hadn’t been initiated.

    • Checkpoints include all the messages that have been sent or received up until the checkpoint. No messages are in transit (in-flight) to avoid cases of missing messages.

    • The relationships and dependencies between system components and their states align with what would be expected during normal operation.

  • Inconsistent state: This is a state where discrepancies exist in the saved state of different processes within a system. In other words, the checkpoints across different processes are not coherent and coordinated.

Let’s look at an example to better understand consistent and inconsistent states. Consider three processes represented by ii, jj, and kk. Two messages, m1m_1 and m2m_2, are exchanged between the processes. Other than that, we have one snapshot/checkpoint saved for each process represented by C1,iC_{1,i}, C1,jC_{1,j}, and C1,kC_{1,k}, where 1 represents the number of snapshots for a process and the lowercase letter represents the process itself.

In the illustration on the left, the first checkpoints at processes jj and ii are consistent because m1m_1is sent and received after the checkpoints. On the contrary, in the right-hand illustration, the first checkpoint at process jj doesn’t know about m1m_1, while the first checkpoint at process ii recorded the reception of message m1m_1. Therefore, it’s an inconsistent state.

The left-hand illustration represents a consistent state, also because no communication is being performed among the processes when the system performs checkpointing. On the right side, we can see that the processes communicate through messages when the system performs checkpointing.