Distributed Systems: Building Software for the Real World/

...

Stopping Crack Propagation

Learn the causes of the airline incident failure, and some of the solutions that could have helped the propagation of crack.

We'll cover the following...

Failure modes of the airline incident
Why the callers were blocked
Propagation of crack
Larger scale solutions to contain cracks

Partitioning of servers
Request/Reply architecture

Failure modes of the airline incident

Let’s see how the design of failure modes applies to the grounded airline from before. The airline’s Core Facilities project had not planned out its failure modes. The crack started at the improper handling of the SQLException, but it could have been stopped at many other points. Let’s look at some examples from low-level detail to high-level architecture. Because the pool was configured to block requesting threads when no resources were available, it eventually tied up all request-handling threads. This happened independently in each application server instance.

The pool could have been configured to create more connections if it was exhausted. It also could have ...

Living in Production

The Exception That Grounded an Airline

Stabilize Your System

Stability Antipatterns

Failures And Blockages

Force Multiplier

Stability Patterns

Launching An Online Store

Foundations

Processes on Machines

Interconnect

Control Plane

Security

Design for Deployment

Handling Versions

Case Study: Trampled by Your Own Customers

Adaptation

System Architecture

Information Architecture

Chaos Engineering

Bibliography

Stopping Crack Propagation

Failure modes of the airline incident