Fail Gracefully

Understand how to fail gracefully while coding.

Failure of the code

What happens when the code fails? Even if we write our part perfectly, there are all kinds of conditions that could cause the overall system to fail. For example:

  • A rogue mail daemon on a computer, busy sending offers of great wealth from some foreign country, consumes all the RAM and swap. Our next call to malloc() returns ETOOMUCHSPAM.

  • Java Update 134.001 fills up the system’s hard drive. We call write(), and the system returns ESWITCHTODECAF.

  • We try to pull data off a tape, but the tape robot is on a ship at sea. Rolling waves cause the robot to drop the tape, and the driver returns EROBOTDIZZY.

  • Cosmic rays flip a bit in memory, causing memory access to return 0x10000001 instead of 0x1, and you discover that this makes for a very bad parameter to pass into memcpy() after it returns EMEMTRASHED.

You may think, “Yeah, right”, but all of these cases actually happened. (Yes, I had to fix a tape robot controller because it would drop tapes on a Navy ship.) Our code cannot naively assume that the world around it is sane—the world will take every opportunity to prove it wrong.

How our code fails is just as important as how it works. We may not be able to fix the failure, but if nothing else, our code should strive to fail gracefully.

Order of operations

In many textbook and course programs, the environment is a clean slate, and the program runs to completion. In many messy, non-textbook programs, the environment is a rugby match of threads and resources, all seemingly trying to beat each other into submission.

Let’s consider the following example:

We create a list of customer names and addresses that will be fed to a label printer. Our code gets passed a customer ID and a database connection, so we need to query the database for what we need. We create a linked list whose add() method looks like this:

def add(customer_id) # BAD BAD BAD, see text
begin
@mutex.lock
old_head = @head
@head = Customer.new
@head.name = @database.query(customer_id, :name)
@head.address = @database.query(customer_id, :address)
@head.next = old_head
ensure
@mutex.unlock
end
end

(Yes, I know this example is contrived. Bare with me.)

This code works in the happy path: the new element is put ahead of the list, it will get filled in, and everything will be happy. But what if one of those queries to the database raises an exception? Let’s take a look at the code again.

This code doesn’t fail gracefully. In fact, it does collateral damage by allowing a database failure to destroy the customer list. The culprit is the order of operations:

  • The list @head and @head.next are absolutely vital to the list’s integrity. These shouldn’t be changed until everything else is ready.
  • We should fully construct the new object before inserting it into the list.
  • We should not hold the lock during operations that could block. Assume there are other threads that want to read the list.

Transactions

In the above section, the example had only one essential bit of state that needed to stay consistent. What about cases where there’s more than one? Let’s consider the classic example of moving money between two bank accounts:

savings.deduct(100)
checking.deposit(100)

What happens if the database fails right after the money has been deducted and the deposit into checking fails? Where did the money go? Perhaps we will try to solve that case by putting it back into the savings account:

# START:gnarly_case
savings.deduct(100) # Happily works
begin
checking.deposit(100) # Fails: database went down!
rescue
begin
# Put money back
savings.deposit(100) # Fails: database still dead
rescue
# Now what???
end
end
# END:gnarly_case

But that doesn’t help if the second deposit() fails, too.

The tool we need here is a transaction. Its purpose is to allow several operations, potentially to several objects, to be either fulfilled completely or rolled back.

Transactions, in this case, would allow our previous example to look like this:

# START:transaction_case
t = Transaction.new(savings, checking)
t.start
# Inject failure
checking.expects(:deposit).with(100).raises
begin
savings.deduct(100)
checking.deposit(100)
t.commit
rescue
t.rollback
end
# END:transaction_case

We’ll usually find transactions in databases because our example scenario is exceedingly common in that field. We may find variations on this theme in various places where systems require an all-or-nothing interlock.

Failure injection

So far, we’ve talked about how our code responds to likely failures. For testing purposes, how do we ensure code responds well when an essential resource dies, passes on, is no more, ceases to be, pushes up daisies, and becomes an ex-resource?

The solution is to inject failures using an automated test harness. This is easiest with a mock object framework because we can instruct the mock to return good data several times and then return something bogus or throw an exception. Likewise, in the code under test, we will assert that the appropriate exception is raised.

Let’s revisit our list update problem. Here’s some test code that simulates a valid database response for key 1 and a failure on the query for key 2:

require 'rubygems'
require 'test/unit'
require 'mocha'
class ListUpdateTest < Test::Unit::TestCase
def test_database_failure
database = mock()
database.expects(:query).with(1, :name).returns('Anand')
database.expects(:query).with(1, :address).returns('')
database.expects(:query).with(2, :name).raises #<callout id="co.raise"/>
q = ShippingQueue.new(database)
q.add(1)
assert_raise(RuntimeError) do
q.add(2) #<callout id="co.trapped_exception"/>
end
# List is still fine
assert_equal 'Anand', q.head.name #<callout id="co.list_verification"/>
ssert_equal nil, q.head.next
end
end
  1. Line 10: Injects RuntimeError exception
  2. Lines 12-13: Call raises and the assert_raise expects it (and will trap the exception)
  3. Lines 18-19: Verifies that the list is still intact as if q.add(2) were never called.

Failure injection of this sort allows us to think through and verify each potential scenario of doom. We can test in this manner just as often as we test the happy path.

Test monkeys

In test monkeys, sometimes called monkey testing, a user provides random inputs to an application or system to check its behavior or see whether the system or application will crash. We can think through scenarios all day long and build tremendously robust code. Yet, most fool-proof programs can be foiled by a sufficiently talented fool. If we don’t have such a fool handy, the next best thing is a test monkey.

In my first job working on handheld computers, we had a program called Monkey that would inject random taps and drags into the UI layer as if they had come from the touch-screen. It was nothing fancier than that. We’d run Monkey until the system crashed.

Monkey may not have been a talented fool, but a whole bunch of monkeys tapping like mad, 24-hours a day, makes up for lack of talent. The crashes were things we couldn’t have envisioned—that was the point.

In the same way, can we create a test harness that tries to use the program with random (but valid) data? Let it run thousands or millions of cycles and we never know what might turn up. I used this technique on a recent project and discovered that once in a blue moon, a vendor API function would return “unknown” for the state of a virtual machine. What does that mean, they don’t know the state? I had no idea the function could return that. As a result, the program crashed when it happened. Once again, my test monkey helped me catch an error.