Fail fast

If slow responses are worse than no response, the worst must surely be a slow failure response. Can there be any bigger waste of system resources than burning cycles and clock time only to throw away the result? If the system can determine in advance that it will fail at an operation, it’s always better to fail fast. That way, the caller doesn’t have to tie up any of its capacity waiting and can get on with other work. How can the system tell whether it will fail? Do we need Deep Learning? Don’t worry, you won’t need to hire a cadre of data scientists.

What violates a fail fast pattern

It’s actually much more mundane than that. There’s a large class of “resource unavailable” failures. For example, when a load balancer makes a connection request but not one of the servers in its service pool is functioning, it should immediately refuse the connection. Some configurations have the load balancer queue the connection request for a while in the hopes that a server will become available in a short period of time. This violates the Fail Fast pattern. The application or service can tell from the incoming request or message roughly what database connections and external integration points will be needed. The service can quickly check out the connections it will need and verify the state of the circuit breakers around the integration points. This is sort of the software equivalent of the chef’s mise en place, or gathering all the ingredients needed to perform the request before it begins. If any of the resources are not available, the service can fail immediately, rather than getting partway through the work.

Another way to fail fast in a web application is to perform basic parameter- checking in the servlet or controller that receives the request, before talking to the database. This would be a good reason to move some parameter checking out of domain objects into something like a Query object.

Case study: Trouble with the fax machine

One of my more interesting projects was for a studio photography company. Part of the project involved working on the software that rendered images for high-resolution printing. The previous generation of this software had a problem that generated more work for humans downstream: if color profiles, images, backgrounds, or alpha masks weren’t available, it “rendered” a black image full of zero-valued pixels. This black image went into the printing pipeline and was printed, wasting paper, chemicals, and time. Quality checkers would pull the black image and send it back to the people at the beginning of the process for diagnosis, debugging, and correction.

Ultimately, they would fix the problem (usually by calling developers to the printing facility) and remake the bad print. Since the order was already late getting out the door, they would expedite the remake, meaning it interrupted the pipeline of work and went to the head of the line. When my team started on the rendering software, we applied the Fail Fast pattern. As soon as the print job arrived, the renderer checked for the presence of every font (missing fonts caused a similar remake, but not because of black images), image, background, and alpha mask. It preallocated memory, so it couldn’t fail an allocation later. The renderer reported any such failure to the job control system immediately, before it wasted several minutes of compute time. Best of all, broken orders would be pulled from the pipeline, avoiding the case of having partial orders waiting at the end of the process.

Once we launched the new renderer, the software-induced remake rate dropped to zero. Orders could still be remade because of other quality problems such as dust in the camera, poor exposure, or bad cropping, but at least our software wasn’t the cause. The only thing we didn’t preallocate was disk space for the final image. We violated steady state under the direction of the customer, who indicated that he had his own rock-solid purging process. Turns out the purging process was one guy who occasionally deleted a bunch of files by hand. Less than one year after we launched, the drives filled up. Sure enough, the one place we broke the Fail Fast principle was the one place our renderer failed to report errors before wasting effort. It would render images, which was several minutes of compute time, and then throw an exception.

Fail fast benefits

Even when failing fast, be sure to report a system failure (resources not available) differently than an application failure (parameter violations or invalid state). Reporting a generic error message may cause an upstream system to trip a circuit breaker just because some user entered bad data and hit “Reload” three or four times. The Fail Fast pattern improves overall system stability by avoiding slow responses. Together with timeouts, failing fast can help avert impending cascading failures. It also helps maintain capacity when the system is under stress because of partial failures.

Tips to remember

Avoid slow responses and fail fast

If the system cannot meet its SLA, inform callers quickly. Don’t make them wait for an error message, and don’t make them wait until they time out. That just makes the problem their problem.

Reserve resources and verify integration points early

In the theme of “don’t do useless work,” make sure a transaction can be completed before starting.If critical resources aren’t available, for example, a popped Circuit Breaker on a required callout, then don’t waste work by getting to that point. The odds of it changing between the beginning and the middle of the transaction are slim.

Use for input validation

Do basic user input validation even before you reserve resources. Don’t bother checking out a database connection, fetching domain objects, populating them, and calling validate() just to find out that a required parameter wasn’t entered.

Get hands-on with 1200+ tech skills courses.