Enable Fault Tolerance and Failure Detection
Understand how distributed key-value stores maintain availability and durability during node failures. Design sloppy quorum and hinted handoff mechanisms to handle temporary node outages. Apply Merkle trees for anti-entropy synchronization and gossip protocols for decentralized failure detection.
Handle temporary failures
Many distributed systems use a strict read/write quorum, where an operation must receive responses from a minimum number of replicas before it can proceed. If enough replicas are unavailable and the quorum cannot be satisfied, the operation fails, reducing availability. To maintain availability during such failures, the system can use a sloppy quorum.
In a sloppy quorum, the first
Example: Consider a configuration where
This approach is called hinted handoff. It ensures that reads and writes are fulfilled even if a node faces temporary failure.
Note: To handle catastrophic data center failures (e.g., power outages or natural disasters), you must ensure replication extends across distinct data centers.
Handle permanent failures
When a node fails and is replaced or recovers after a long outage, replicas must synchronize to restore full durability. Merkle trees are used to detect inconsistencies between replicas and limit data transfer during anti-entropy synchronization.
A Merkle tree (or hash tree) organizes data hashes hierarchically:
Leaves: Contain hashes of individual key values.
Parent nodes: Contain hashes of their children.
A Merkle tree allows nodes to verify subtrees independently. If the root hashes match, the datasets represented by the trees are identical. If the root hashes differ, the nodes recursively compare child hashes to identify the specific key ranges that diverge. This avoids transferring the entire dataset during comparison.
The following slides explain how Merkle trees work:
Anti-entropy with Merkle trees
Each node maintains a Merkle tree for the key ranges it hosts. To synchronize, two nodes exchange the root hashes of their trees:
Compare root hashes: If they match, the data is consistent, and no further action is needed.
Traverse tree: If hashes differ, the nodes recurse down the left and right children.
Synchronize: The nodes identify the specific leaves that differ and synchronize only the missing or inconsistent data.
The following slides illustrate this process.
Note: The ranges defined in the slides are hypothetical for illustration purposes.
Advantages: Merkle trees minimize disk I/O and network bandwidth because only inconsistent data is transferred.
Disadvantages: When a node joins or leaves the system, key ranges change, requiring the tree hashes to be recalculated.
Next, we need a mechanism to detect when nodes fail so the system can react.
Promote membership in the ring to detect failures
Nodes may go offline temporarily, and in some cases may not return. The system should not immediately rebalance partitions or repair replicas when a single node fails, since most outages are transient. Node addition and removal in the consistent hashing ring should be performed only after confirming a sustained failure or a planned membership change.
Planned commissioning and decommissioning of nodes result in membership changes. These changes are recorded as membership history. Each node persists this membership history locally, and the ring reconciles it across nodes using a gossip protocol. A gossip-based protocol also maintains an eventually consistent view of membership. When two nodes randomly choose one another as their peer, both nodes can efficiently synchronize their persisted membership histories.
Let’s learn how a gossip-based protocol works by considering the following example. Say node
Now, node
Test your knowledge!
Can the gossip-based protocol fail in a system using consistent hashing, and if so, why?
No, the gossip-based protocol never fails as long as nodes are part of the consistent hashing ring.
Yes, it can fail if virtual nodes from the same physical server join the ring independently, causing logical partitioning and inconsistent updates.
Yes, it can fail if the hash function used in consistent hashing produces duplicate keys.
Yes, it can fail if the system uses a single physical node for all virtual nodes.
In this decentralized model, nodes explicitly broadcast when they join or leave the ring. If a node fails to communicate with its peers for a predefined time, the peers mark it as dead.
Conclusion
A fault-tolerant key-value store must handle both temporary and permanent failures. The system uses hinted handoff to preserve availability during brief outages and Merkle trees to synchronize replicas after long-term failures. A gossip protocol maintains a decentralized, eventually consistent view of node membership and health, without introducing a single point of failure.