Motivation

Providing fault-tolerant services to clients is a desirable property of a system. State machine replication (SMR) is a mechanism to implement fault-tolerant services. SMR models a system as a state machine and replicates multiple copies of these state machines, such that failures are independent (meaning one failure only impacts one state machine). These state machines start with the same initial state, and the subsequent clients’ requests reach every replica in the same order, which applies those commands to transition to the same new state of the state machine (we are assuming deterministic logic that transitions the state machine from one state to the next).

If nothing could fail in a system, life would have been easier. However, real systems fail in a myriad of ways. In this lesson, we will primarily focus on node failures.

Note: SMR is an involved topic. In this abridged version, we will only highlight salient features of SMR. See the extended version for more details.

Node failures

A component is considered faulty when its behavior is inconsistent with its specification. The entire spectrum of failures considered in distributed systems is covered by its two ends.

  • Byzantine failure: This occurs in a node when it exhibits arbitrary behavior with other nodes, with the possibility of appearing to be working fine. Such failures cannot always be detected.

  • Fail-stop failure: This occurs in a node when its response to failure is to change to a state that reliably indicates its failure to other nodes.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.