Protocols for Maintaining Fault Tolerance: Part II

Understand protocols to maintain fault tolerance in distributed systems by integrating new or repaired state machine replicas. Learn methods for handling logical clocks, real-time clocks, fail-stop failures, and Byzantine failures. This lesson covers how to ensure consistent states and stable request processing in replicated state machines.

We'll cover the following...

Integrating repaired elements
What’s next?

So far we’ve just discussed removing faulty elements and haven’t yet explored adding repaired or new elements to the system. Let's see how we can successfully integrate a new or repaired component into a system of state machine replicas.

Integrating repaired elements

It is not enough for the element being added to be non-faulty. It must also be in the right state to behave consistently with other components. Let's start by introducing some notation:

We define $e[r_i]$ as the state of a non-faulty element e after processing request $r_0$ through $r_i$ . An element $e$ that joins a configuration after request $r_{join}$ must be in the state $e[r_{join}]$ for it to behave consistently after joining so it may successfully become part of the system.

An element is self-stabilizing if its current state is completely defined by a fixed number of previously processed inputs, say $k$ inputs. For such elements, all we need to do is ensure that the element runs long enough to process $k$ inputs and will be in state $e[r_{join}]$ . For non-self-stabilizing elements, we need to do things differently. In the following discussion, we will discuss two such cases:

Logical clocks and fail-stop failures

When using logical clocks and assuming only fail-stop failures, we only require the state of a state machine replica $sm_i$ . The state of $sm_i$ will be correct since we know that $sm_i$ is non-faulty. Let's consider the following three cases in which the integrated element is an output device, a client, or a state machine replica:

...

1.Prologue

2.File Systems

3.Google File System (GFS)

4.Google Colossus File System

5.Facebook's Tectonic File System

6.Databases

7.Google Bigtable

8.Google Megastore

9.Google Spanner

10.Key-value Stores

11.Many-core Key-value Store

12.Scaling Memcache

13.SILT

14.Amazon DynamoDB

15.Concurrency Management

16.Two-phase Locking (2PL)

17.Google Chubby Locking Service

18.ZooKeeper

19.Big Data Processing: Batch to Stream Processing

20.MapReduce

21.Spark

22.Kafka

23.Consensus

24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals

25.Two-phase Commit

26.State Machine Replication

27.Paxos

28.Raft

29.Epilogue

Protocols for Maintaining Fault Tolerance: Part II

Integrating repaired elements

Logical clocks and fail-stop failures