Stopping Crack Propagation

Learn the causes of the airline incident failure, and some of the solutions that could have helped the propagation of crack.

Failure modes of the airline incident

Let’s see how the design of failure modes applies to the grounded airline from before. The airline’s Core Facilities project had not planned out its failure modes. The crack started at the improper handling of the SQLException, but it could have been stopped at many other points. Let’s look at some examples from low-level detail to high-level architecture. Because the pool was configured to block requesting threads when no resources were available, it eventually tied up all request-handling threads. This happened independently in each application server instance.

The pool could have been configured to create more connections if it was exhausted. It also could have been configured to block callers for a limited time, instead of blocking forever when all connections were checked out. Either of these would have stopped the crack from propagating.

Why the callers were blocked

At the next level up, a problem with one call in CF caused the calling applications on other hosts to fail. Because CF exposed its services as Enterprise JavaBeans (EJBs), it used RMI. By default, RMI calls will never time out. In other words, the callers blocked waiting to read their responses from CF’s EJBs. The first 20 callers to each instance received exceptions: an SQLException wrapped in an InvocationTargetException wrapped in a RemoteException, to be precise. After that, the calls started blocking.

Propagation of crack

The client could have been written to set a timeout on the RMI sockets. For example, it could have installed a socket factory that calls Socket.setSoTimeout() on all new sockets it creates. At a certain point in time, CF could also have decided to build an HTTP-based web service instead of EJBs. Then the client could set a timeout on its HTTP requests. The clients might also have written their calls so the blocked threads could be jettisoned, instead of having the request-handling thread make the external integration call. None of these were done, so the crack propagated from CF to all systems that used CF.

Larger scale solutions to contain cracks

Partitioning of servers

At a still larger scale, the CF servers themselves could have been partitioned into more than one service group.

Get hands-on with 1200+ tech skills courses.