Cascading Failures

Start of the problem

System failures start with a crack. That crack comes from some fundamental problem. Maybe there’s a latent bug that some environmental factor triggers. Or there could be a memory leak, or some component just gets overloaded. Things to slow or stop the crack are the topics of the next chapter. Absent those mechanisms, the crack can progress and even be amplified by some structural problems.

What is cascading failure

A cascading failure occurs when a crack in one layer triggers a crack in a calling layer.

An Example of cascading failure

An obvious example is a database failure. If an entire database cluster goes dark, then any application that calls the database is going to experience problems of some kind. What happens next depends on how the caller is written. If the caller handles it badly, then the caller will also start to fail, resulting in a cascading failure. Just like we draw trees upside-down with their roots pointing to the sky, our problems cascade upward through the layers. Pretty much every enterprise or web system looks like a set of services grouped into distinct farms or clusters, arranged in layers.

Get hands-on with 1200+ tech skills courses.