Enable Fault Tolerance and Failure Detection

Understand how distributed key-value stores maintain availability and durability during node failures. Design sloppy quorum and hinted handoff mechanisms to handle temporary node outages. Apply Merkle trees for anti-entropy synchronization and gossip protocols for decentralized failure detection.

We'll cover the following...

Handle temporary failures
Handle permanent failures
- Anti-entropy with Merkle trees
Promote membership in the ring to detect failures
Conclusion

Handle temporary failures

Many distributed systems use a strict read/write quorum, where an operation must receive responses from a minimum number of replicas before it can proceed. If enough replicas are unavailable and the quorum cannot be satisfied, the operation fails, reducing availability. To maintain availability during such failures, the system can use a sloppy quorum.

In a sloppy quorum, the first $\text{n}$ healthy nodes from the preference list handle read and write operations. These nodes may not be the designated owners in the consistent hash ring, but they ensure the request is processed.

Example: Consider a configuration where $\text{n = 3}$ . If Node $\text{A}$ is unavailable during a write, the request is sent to the next healthy node, Node $\text{D}$ . Node $\text{D}$ processes the request and stores a hint indicating the data belongs to $\text{A}$ . Once $\text{A}$ recovers, $\text{D}$ forwards the data to $\text{A}$ and removes the local copy.

Handle permanent failures

When a node fails and is replaced or recovers after a long outage, replicas must synchronize to restore full durability. Merkle trees are used to detect inconsistencies between replicas and limit data transfer during anti-entropy synchronization.

A Merkle tree (or hash tree) organizes data hashes hierarchically:

Leaves: Contain hashes of individual key values.
Parent nodes: Contain hashes of their children.

A Merkle tree allows nodes to verify subtrees independently. If the root hashes match, the datasets represented by the trees are identical. If the root hashes differ, the nodes recursively compare child hashes to identify the specific key ranges that diverge. This avoids transferring the entire dataset during comparison.

The following slides explain how Merkle trees work:

Anti-entropy with Merkle trees

Each node maintains a Merkle tree for the key ranges it hosts. To synchronize, two nodes exchange the root hashes of their trees:

Compare root hashes: If they match, the data is consistent, and no further action is needed.
Traverse tree: If hashes differ, the nodes recurse down the left and right children.
Synchronize: The nodes identify the specific leaves that differ and synchronize only the missing or inconsistent data.

The following slides illustrate this process.

Note: The ranges defined in the slides are hypothetical for illustration purposes.

Promote membership in the ring to detect failures

Nodes may go offline temporarily, and in some cases may not return. The system should not immediately rebalance partitions or repair replicas when a single node fails, since most outages are transient. Node addition and removal in the consistent hashing ring should be performed only after confirming a sustained failure or a planned membership change.

Planned commissioning and decommissioning of nodes result in membership changes. These changes are recorded as membership history. Each node persists this membership history locally, and the ring reconciles it across nodes using a gossip protocol. A gossip-based protocol also maintains an eventually consistent view of membership. When two nodes randomly choose one another as their peer, both nodes can efficiently synchronize their persisted membership histories.

Let’s learn how a gossip-based protocol works by considering the following example. Say node $\text{A}$ starts up for the first time, and it randomly adds nodes $\text{B}$ and $\text{E}$ to its token set. The token set has virtual nodes in the consistent hash space and maps nodes to their respective token sets. This information is stored locally on the disk space of the node.

Now, node $A$ handles a request that results in a change, so it communicates this to $\text{B}$ and $\text{E}$ . Another node, $\text{D}$ , has $\text{C}$ and $\text{E}$ in its token set. It makes a change and tells $\text{C}$ and $\text{E}$ . The other nodes do the same process. This way, every node eventually knows about every other node’s information. It’s an efficient way to share information asynchronously, and it doesn’t take up a lot of bandwidth.

Test your knowledge!

Can the gossip-based protocol fail in a system using consistent hashing, and if so, why?

No, the gossip-based protocol never fails as long as nodes are part of the consistent hashing ring.

Yes, it can fail if virtual nodes from the same physical server join the ring independently, causing logical partitioning and inconsistent updates.

Yes, it can fail if the hash function used in consistent hashing produces duplicate keys.

Yes, it can fail if the system uses a single physical node for all virtual nodes.

1 / 2

In this decentralized model, nodes explicitly broadcast when they join or leave the ring. If a node fails to communicate with its peers for a predefined time, the peers mark it as dead.

Conclusion

A fault-tolerant key-value store must handle both temporary and permanent failures. The system uses hinted handoff to preserve availability during brief outages and Merkle trees to synchronize replicas after long-term failures. A gossip protocol maintains a decentralized, eventually consistent view of node membership and health, without introducing a single point of failure.