Enabling Fault Tolerance and Failure Detection
Learn how we will make key-value store fault-tolerant and able to detect failure.
Handling temporary failures
Typically, distributed systems use a quorum-based approach to handle failures. A quorum is the minimum number of votes required for a distributed transaction to proceed with an operation. If a server is part of the consensus and it becomes down, then we cannot perform the required operation. It affects the availability and durability of our system.
We will use a “sloppy quorum” instead of n
healthy nodes from the preference list handle all read and write operations. The n
healthy nodes may not always be the first n
nodes discovered when moving clockwise in the consistent hash ring.
Let’s consider the following configuration with n
= 3. If node A
is briefly unavailable or unreachable during a write operation, the request will be sent to the next healthy node from the preference list, which is node D
in this case. It ensures the desired availability and durability. After processing the request, the Node D
will include a hint as to which node was the intended receiver (in this case, A
). Once node A
is up and running again, node D
will send the request information to A
so it can update its data. Upon completion of the transfer, D
will remove this item from its local storage without affecting the total number of replicas in the system.
Create a free account to access the full course.
By signing up, you agree to Educative's Terms of Service and Privacy Policy