Protocols for Maintaining Fault Tolerance: Part I

Explore how state machine replication protocols maintain fault tolerance by managing faulty replicas and configurations. Understand the conditions for replacing replicas, updating system components, and the role of configurators in detecting failures. This lesson helps you grasp how distributed systems tolerate faults to ensure correct outputs under various failure modes.

We'll cover the following...

Modeling replica replacement
- Replacing replicas on failures
- Replacing output devices and clients
Managing a system of state machine replicas

Our protocols for $t$ fault tolerance in a system provide us with a guarantee that our system will not fail if no more than $t$ replicas fail. With this guarantee, we must ensure that the number of faulty nodes in an ensemble of replicas does not exceed $t$ . We can do this by replacing faulty replicas with non-faulty replicas. Let's formally discuss this.

Modeling replica replacement

We define $P(\tau)$ as the total number of nodes running state machine replicas in an ensemble of replicas and $F(\tau)$ as the number of faulty nodes in that ensemble at time $tau$ . $P(\tau) - F(\tau)$ must be greater than a certain number to guarantee that our system will produce the correct output. Here is how we can formally define this combining condition:

Here, $Enuf = P(\tau)/2$ when Byzantine failures are possible. And $Enuf = 0$ when only fail-stop failures are possible.

If the condition above holds, our system will provide the correct output. This is ensured by having the minimum number of non-faulty nodes present in the system, depending on the respective failure types. For Byzantine failures, we need a majority, which means more than half of the total nodes. Therefore, any integer greater than $P(\tau)/2$ . We only need one non-faulty node for fail-stop failures, which ...

1.Prologue

2.File Systems

3.Google File System (GFS)

4.Google Colossus File System

5.Facebook's Tectonic File System

6.Databases

7.Google Bigtable

8.Google Megastore

9.Google Spanner

10.Key-value Stores

11.Many-core Key-value Store

12.Scaling Memcache

13.SILT

14.Amazon DynamoDB

15.Concurrency Management

16.Two-phase Locking (2PL)

17.Google Chubby Locking Service

18.ZooKeeper

19.Big Data Processing: Batch to Stream Processing

20.MapReduce

21.Spark

22.Kafka

23.Consensus

24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals

25.Two-phase Commit

26.State Machine Replication

27.Paxos

28.Raft

29.Epilogue

Protocols for Maintaining Fault Tolerance: Part I

Modeling replica replacement