Introduction to State Machine Replication

Learn about state machine replication—a general framework for building fault-tolerant distributed services.

Motivation

Providing fault-tolerant services to the clients is a desirable property of a system. State machine replication (SMR) is a mechanism to implement fault-tolerant services. SMR models a system as a state machine and replicates multiple copies of these state machines such that failures are independent (meaning one failure only impacts one state machine). These state machines start with the same initial state, and the subsequent clients' requests reach every replica in the same order, which applies those commands to arrive at the same new state of the state machine (we are assuming deterministic logic that transitions the state machine from one state to the next).

The core component of SMR is the atomic broadcast facility, which enables every state machine to get the commands in the same order. NIST explains the purpose of state machine replication (SMR) as follows: “The objective of state machine replication (SMR) is to emulate a centralized service in a distributed, fault-tolerant fashion.” In this chapter, we will learn how to implement SMR when Byzantine and fail-stop failures are possible and how to reconfigure a replica group by excluding the faulty nodes.

Replication is a widely used technique to design fault-tolerant distributed systems where we maintain replicas of data or services. Fault tolerance is a way to achieve higher availability and reliability. To tackle different kinds of failures, the replicated sites should be on different physical servers. Such servers might be in different data centers, which might be far apart.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.