Overview

Fischer (1983)Michael J. Fischer. The consensus problem in unreliable distributed systems (a brief survey). In Proceedings of the 1983 International FCT-Conference on Fundamentals of Computation Theory, pages 127-40, London, UK, 1983. Springer-Verlag. mentions that the kinds of protocols that can be deployed to solve the consensus problem “depend heavily on the assumptions made about the model of computation and the kind of faults to which it is prone.” There are different models of distributed systems working on processes that communicate by sending messages over networks. These models are based on fundamental properties and make relevant assumptions about their characteristics and environments. In order to get an efficient algorithm for consensus, one has to understand and consider some aspects of a system’s behavior.

Failure model

When the consensus problem was first sketched by Pease, Shostak, and LamportMarshall Pease, Robert Shostak, and Leslie Lamport. Reaching agreement in the presence of faults. J. ACM, 27(2):228-34, April 1980., they made no explicit differentiation or limitation of the failure model, thus allowing processes to exhibit arbitrary failures. But in the design of such systems and algorithms, determining which kinds of faults to tolerate is an important point. In general, an assumption on the types of failures and also on the timing model is necessary. Failure models may involve crash or Byzantine failuresThe term “Byzantine” for arbitrary or malicious behavior was introduced by Lamport, Shostak, and Pease (1982) in their outstanding paper “The Byzantine Generals Problem”. and timing models that reach from synchronous to asynchronous. These considerations affect the access protocol to be employed and hence may have a major impact on performance.

The aspects of distributed systems that are crucial for the choice of fundamental models are the following:

  • Fail-stop failures: This kind of failure can only exhibit crash failures. The process halts, and remains halted, but this state allows other correct processes to detect its failure, such as using timeouts. In this model, processes may crash in the middle of a step, meaning a process may send a message to only a subset of all processes before crashing.
  • Crash failures: A crash occurs when a process halts and never recovers. The process operates correctly until the crash but afterward stays completely inactive. This kind of failure is not detectable by other processes.
  • Omission failures (channel): In this case, a communication channel fails, causing a message of a process inserted to never arrive at another process.
  • Timing failures: Timing failures occur when synchrony assumptions aren’t fulfilled. This kind of failure is irrelevant in asynchronous systems.
  • Authenticated Byzantine failures: In this model, processes may show Byzantine behavior, but faulty processes aren’t able to forge messages of other processes, since messages are authenticated.
  • Byzantine (arbitrary) failures: Judmayer et al. (2017)Aljosha Judmayer, Nicholas Stifter, Katharina Krombholz, Edgar Weippl, Elisa Bertino, and Ravi Sandhu. Blocks and Chains: Introduction to Bitcoin, Cryptocurrencies, and Their Consensus Mechanisms. Synthesis Lectures on Information Security, Privacy, and Trust. San Rafael, CA, 2017. Morgan & Claypool. describe Byzantine failures as failures that “allow a component to act arbitrarily, and possibly maliciously, from its expected behavior. This includes duplicating or changing message contents, sending unsolicited messages.” Thus, Byzantine failures occur in an unpredictable manner and can exhibit temporarily or permanently all possible faulty behaviors. Such processes are able to forge messages of other processes or act maliciously against the protocol.

This list builds up a hierarchy of the seriousness of failures. While fail-stop failures are based on very strong assumptions, Byzantine failures are based on the weakest assumptions and thus may exhibit any failure previously classified. Hence, we can conclude formally (Stefan Poledna (2007)Stefan Poledna. Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. The Springer International Series in Engineering and Computer Science. Springer US, 2007. Springer., Aljosha Judmayer et al. (2017)Aljosha Judmayer, Nicholas Stifter, Katharina Krombholz, Edgar Weippl, Elisa Bertino, and Ravi Sandhu. Blocks and Chains: Introduction to Bitcoin, Cryptocurrencies, and Their Consensus Mechanisms. Synthesis Lectures on Information Security, Privacy, and Trust. San Rafael, CA, 2017. Morgan & Claypool., George Coulouris et al. (2013)George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair. Distributed Systems. International computer science series. Boston: Addison-Wesley, 2013. Pearson Education Limited. and Michael J. Fischer (1983)Michael J. Fischer. The consensus problem in unreliable distributed systems (a brief survey). In Proceedings of the 1983 International FCT-Conference on Fundamentals of Computation Theory, pages 127-40, London, UK, 1983. Springer-Verlag.):

Proposition 1

Byzantine failuresAuthenticated Byzantine failuresTiming failuresOmission failures (channel)Crash failuresFail-stop failures.\text {Byzantine failures} \supset \text {Authenticated Byzantine failures} \supset \text {Timing failures} \supset \text {Omission failures (channel)} \supset \text {Crash failures} \supset \text {Fail-stop failures}.

Hence, Byzantine failures are considered to be the most difficult class of failures among the failure models, whereas fail-stop failures build the simplest model since they can only exhibit a crash failure, which can be detected by other processes. This is why building fault-tolerant systems considering Byzantine failures is the most difficult system approach.

For reasons of simplification, we assume from now on that all processes are connected by reliable channels, meaning every link between them works properly. Hence, there are two main types of failures that a process may undergo, which are considered in the modeling, namely crash failures (including fail-stop failures) and Byzantine failures. The most common models are heavily based on these two failures. This is a legitimized approach, as Coulouris et al. (2013)George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair. Distributed Systems. International computer science series. Boston: Addison-Wesley, 2013. Pearson Education Limited. justify: “Although a communication channel can exhibit the omission failures,” most failure models do not consider them since “it is possible to build a communication service that masks some of those failures” (cf. this section). In this case, reliable communication is assumed.

According to Coulouris et al. (2013)George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair. Distributed Systems. International computer science series. Boston: Addison-Wesley, 2013. Pearson Education Limited., we can define reliable communication in the following way:

Get hands-on with 1200+ tech skills courses.