Search⌘ K
AI Features

Chain Of Failure

Explore how small faults propagate as errors and failures in distributed systems, increasing with tight coupling. Learn the concepts of faults, errors, and failures, and understand why anticipating every failure event is challenging. Examine strategies to manage fault propagation and prepare systems to remain stable despite inevitable cracks.

Independent events

Underneath every system outage is a chain of events like this. One small issue leads to another, which leads to another. Looking at the entire chain of failure after the fact, the failure seems inevitable. If you tried to estimate the probability of that exact chain of events occurring, it would look incredibly improbable. But it looks improbable only if you consider the probability of each event independently. A coin has no memory, so each toss has the same probability, independent of previous tosses.

A failure in one point or layer actually increases the probability of other failures. If the database becomes slow, then the application servers are more likely to run out of memory. Because the layers are coupled, the events are not independent.

Chain of events

Here’s some common terminology we can use to be precise about these chains of events:

Fault

A fault is a condition that creates an incorrect internal state in our software. A fault may be due to a latent bug that gets triggered, or it may be due to an unchecked condition at a boundary or external ...