Fail Gracefully

Explore strategies to handle code failures gracefully in production environments. Understand the importance of order of operations, transactions, failure injection, and testing methods like monkey testing to ensure robust and reliable software.

We'll cover the following...

Failure of the code
Order of operations
Transactions
Failure injection
Test monkeys

Failure of the code

What happens when the code fails? Even if we write our part perfectly, there are all kinds of conditions that could cause the overall system to fail. For example:

A rogue mail daemon on a computer, busy sending offers of great wealth from some foreign country, consumes all the RAM and swap. Our next call to malloc() returns ETOOMUCHSPAM.
Java Update 134.001 fills up the system’s hard drive. We call write(), and the system returns ESWITCHTODECAF.
We try to pull data off a tape, but the tape robot is on a ship at sea. Rolling waves cause the robot to drop the tape, and the driver returns EROBOTDIZZY.
Cosmic rays flip a bit in memory, causing memory access to return 0x10000001 instead of 0x1, and you discover that this makes for a very bad parameter to pass into memcpy() after it returns EMEMTRASHED.

You may think, “Yeah, right”, but all of these cases actually happened. (Yes, I had to fix a tape robot controller because it would drop tapes on a Navy ship.) Our code cannot naively assume that the world around it is sane—the world will take every opportunity to prove it wrong.

How our code fails is just as important as how it works. We may not be able to fix the failure, but if nothing else, our code should strive to fail gracefully.

Order of operations

In many textbook and course programs, the environment is a clean slate, and the program runs to completion. In many messy, non-textbook programs, the environment is a rugby match of threads and resources, all seemingly trying to beat each other into submission.

Let’s consider the following example:

We create a list of customer names and addresses that will be fed to a label printer. Our code gets passed a customer ID and a database connection, so we need to query the database for what we need. We create a linked list whose add() method looks like this:

(Yes, I know this example is contrived. Bare with me.)

This code works in the happy path: the new element is put ahead of the list, it will get filled in, and everything will be happy. But what if one of those queries to the database raises an exception? Let’s take a look at the code again.

This code doesn’t fail gracefully. In fact, it does collateral damage by allowing a database failure to destroy the customer list. The culprit is the order of operations:

The list @head and @head.next are absolutely vital to the list’s integrity. These shouldn’t be changed until everything else is ready.
We should fully construct the new object before inserting it into the list.
We should not hold the lock during operations that could block. Assume there are other threads that want to read the list.

Failure injection

So far, we’ve talked about how our code responds to likely failures. For testing purposes, how do we ensure code responds well when an essential resource dies, passes on, is no more, ceases to be, pushes up daisies, and becomes an ex-resource?

The solution is to inject failures using an automated test harness. This is easiest with a mock object framework because we can instruct the mock to return good data several times and then return something bogus or throw an exception. Likewise, in the code under test, we will assert that the appropriate exception is raised.

Let’s revisit our list update problem. Here’s some test code that simulates a valid database response for key 1 and a failure on the query for key 2:

Ruby

require 'rubygems'
require 'test/unit'
require 'mocha'
class ListUpdateTest < Test::Unit::TestCase
 def test_database_failure
   database = mock()
   database.expects(:query).with(1, :name).returns('Anand')
   database.expects(:query).with(1, :address).returns('')
   database.expects(:query).with(2, :name).raises #<callout id="co.raise"/>
   q = ShippingQueue.new(database)
   q.add(1)
   assert_raise(RuntimeError) do
   q.add(2) #<callout id="co.trapped_exception"/>
   end
   # List is still fine
   assert_equal 'Anand', q.head.name #<callout id="co.list_verification"/>
   ssert_equal nil, q.head.next
 end
end

Test monkeys

In test monkeys, sometimes called monkey testing, a user provides random inputs to an application or system to check its behavior or see whether the system or application will crash. We can think through scenarios all day long and build tremendously robust code. Yet, most fool-proof programs can be foiled by a sufficiently talented fool. If we don’t have such a fool handy, the next best thing is a test monkey.

In my first job working on handheld computers, we had a program called Monkey that would inject random taps and drags into the UI layer as if they had come from the touch-screen. It was nothing fancier than that. We’d run Monkey until the system crashed.

Monkey may not have been a talented fool, but a whole bunch of monkeys tapping like mad, 24-hours a day, makes up for lack of talent. The crashes were things we couldn’t have envisioned—that was the point.

In the same way, can we create a test harness that tries to use the program with random (but valid) data? Let it run thousands or millions of cycles and we never know what might turn up. I used this technique on a recent project and discovered that once in a blue moon, a vendor API function would return “unknown” for the state of a virtual machine. What does that mean, they don’t know the state? I had no idea the function could return that. As a result, the program crashed when it happened. Once again, my test monkey helped me catch an error.

1.Getting Started

2.Program for Production

3.Getting Our Tools in Order

4.Be Your Own Manager

5.Team Work

6.Inside the Company

7.Mind Your Business

8.Kaizen

9.Conclusion

Fail Gracefully

Failure of the code

Order of operations

Transactions

Failure injection

Test monkeys