Consensus Prerequisites and Two Generals' Problem

This chapter lays the foundation of the consensus problem in distributed systems. By consensus, we mean that a set of processes start a protocol where each of them has a choice of possible decisions, and all of them reach a common decision by exchanging network messages. Achieving consensus is easy if there are no failures (network or node failures). However, achieving consensus under different kinds of failures becomes challenging (or even impossible). While extensive research has been done to solve the consensus problem, we have picked three celebrated results for discussion in this chapter. These results are as follows:

  • The Two Generals’ Problem

  • The FLP impossibility

  • The Byzantine Generals Problem

We believe the above results give us ample background for the rest of the chapters in this course.

In this lesson, we will define the consensus problem more formally and review two important system models (synchronous and asynchronous) and different kinds of faults (crash-stop and Byzantine faults). After that, we’ll discuss the Two Generals’ Problem.

Consensus

We need consensus for many computational tasks. Let's take the examples of distributed databases and data replication.

The use of databases is commonplace, and many rely on the abstraction of transactions to work correctly. A database transaction is involved when we purchase goods or services from any online store or conduct a bank withdrawal or deposit. Transactions are impossible without a consensus where, for example, money is deducted from one account and deposited into another as one logical unit. The consensus here involves either all parties moving ahead with the transaction and committing it, or none of the parties moving ahead and aborting the translation.

The second example is data replication to increase durability under data losses or corruption threats. Each copy of data must remain consistent at all places, even when we allow data mutations. Doing so is only possible when all replicas have a consensus on specific values.

Formally, the consensus among nn processes that can choose a value from a permissible set of values is to satisfy the following three conditions:

  1. Validity: If every nonfaulty process starts with an initial value vv, their final decision must be vv. This condition helps us exclude trivial solutions to a consensus where, for example, no matter what, every process chooses a fixed value.

  2. Termination: Every participating non-faulty process must make a decision eventually. Upon termination of a consensus protocol, we are sure we have reached a final stable state.

  3. Agreement: The final decision of every process must be the same, hence the consensus.

To formally study the consensus problem, we need to make assumptions about the execution environment of the processes.

Computational models

We are primarily concerned with two major components in a distributed system—the nodes and the network. They can exhibit different characteristics and provide different guarantees to their users. Researchers have broadly grouped them into two classes of synchronous and asynchronous models.

In the synchronous model, we assume the following:

  1. Processes either work in lockstep or have a maximum fixed bound of how far ahead or behind one process can be from the other. This implies that a process takes bounded time to complete its steps. A process might fail but does not further participate in the algorithm once it has failed.

  2. The clocks on the processes are perfectly synchronized or have fixed known bounds on the skew.

  3. The network delivers the message in a known upper-bound time.

Multi-core systems that share the same system clock and the cores of a GPU working in lock-step exhibit the properties of the synchronous model (albeit arguably, such systems might not be considered a distributed system).

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.