Detecting Failures [merged with fault tolerance]

Learn how to make our proposed system of key-value store detect failure.

We want our nodes to detect the failure so let’s see how we can add it to our proposed design.

Promoting membership in ring to detect failures

The nodes can be offline for short periods, but they may also indefinitely go offline. Rebalancing partition assignments or fixing unreachable replicas should not be done when a single node goes down. Hence, the addition and removal of nodes from the ring should be done carefully.

Planned commissioning and decommissioning of nodes results in membership changes. These changes form history and are recorded persistently on the storage for each node and reconciled among the ring members using a gossip protocol. A gossip-based protocol also maintains an eventually consistent view of membership. When two nodes randomly choose one another as their peers, both nodes can efficiently synchronize their persisted membership histories.

Let’s understand how a gossip-based protocol works by considering the following example. Say node A starts up for the first time; it adds node B and E in its token set. The token set has virtual nodes in the consistent hash space and maps nodes to their respective token sets. This information is stored locally on the disk space of the node. Now node A handles a request that resulted in a change so it communicates to B and E. Another node D, has C and E in its token set. It makes a change and tells C and E. Similarly, other nodes do the same. This way, every node knows about every other node. It is an efficient way to share information in an asynchronous way, and it does not take up a lot of bandwidth too.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy