Failure Detection Strategies

Learn about different failure detection strategies in a distributed environment.

Introduction

In a distributed environment with multiple replicas, failure detection of replicas is a critical component. A failure detector is a component that determines the liveness of a particular node. In other words, it detects whether the node is alive or if it has crashed. Unfortunately, detecting failures in a distributed environment is difficult, as it is impossible to determine whether the replica has crashed or is responding slowly due to network congestion or load.

Strategies

There are multiple strategies to detect failures in replicas:

  • Ping

  • Heartbeat

  • Phi Accrual failure detection

Ping

Ping is a mechanism to send a message to replicas in order to see if they’re still alive. Liveness is indicated by the replica responding with a successful response within a specific time:

  • If the process cannot connect with the replica, it treats the replica as crashed.

  • Similarly, if the replica is slow in responding to the process and can’t respond within the specified time, the process still considers the replica dead.

The ping framework is a boolean decision framework that decides the liveness of a replica based on a single request. However, if the replica is busy processing and the ping request times out, the process still considers the replica dead. Hence, the ping framework relies on the frequency of pings and timeout.

Heartbeat

Heartbeat is similar to the ping framework but is initiated by the replica. In a heartbeat framework, the replica periodically sends messages to the process monitoring the replica’s liveness. If the process does not receive the message from the replica within a given time, the process considers the replica dead.

The downsides to the heartbeat framework are the same as those of the ping framework. Heartbeat is also a boolean decision framework. If the replica is busy processing and doesn’t send a heartbeat request within the specified time, the process considers the replica dead.

Phi Accrual failure detection

While ping and heartbeat frameworks are boolean decision frameworks, Phi Accrual failure detection captures the suspicion level of a replica being dead. Furthermore, the Phi Accrual failure detector predicts the probability of a replica node being alive or dead.

Phi Accrual failure detector maintains a sliding window that stores the arrival time of heartbeats from the replica. It uses this information to predict the approximate arrival time of the next heartbeat from the replica. Once the heartbeat arrives, it compares the arrival time with the approximation it made to compute the suspicion level that indicates the certainty of the detector about the failure.

If the suspicion level reaches a certain threshold, the replica is considered dead. The detector can adjust to changing network conditions by adjusting the suspicion scale.

The detector includes three components:

  • Collector: Collects arrival time information through heartbeats from the replicas.

  • Interpreter: Computes the probability of the arrival of the next heartbeat and uses this information to compute the suspicion level based on the actual arrival time.

  • Action: If the value exceeds the threshold, it marks the replica as dead based on the suspicion level.

Get hands-on with 1200+ tech skills courses.