Learn about human and automation limitations for controlling errors, governors, uses of governors, and a quick wrap up.

Human vs. Automation

In the Force Multiplier lesson, we looked into an outage that Reddit suffered. As a quick reminder, Reddit’s configuration management system restarted a part of its infrastructure management that scales server instances up and down. This was in the middle of a ZooKeeper migration, so the autoscaler read a partial configuration and decided to shut down nearly every machine instance in Reddit.

The flip side of that coin is a job scheduler that spins up too many computational instances in order to process a queue before a deadline. The work still can’t get done fast enough, and, to add insult to injury, the cloud provider’s invoice that month is written in scientific notation. Automation has no judgment. When it goes wrong, it tends to go wrong really quickly. By the time a human perceives the problem, it’s a question of recovery rather than intervention. How can we allow human intervention without putting a human in the loop for everything? We should use automation for things humans are bad at: repetitive tasks and fast response. We should use humans for what automation is bad at: perceiving the whole situation at a higher level.


Believe it or not, we can look to eighteenth-century technology for an answer. Before the era of steam engines, power came from muscles, (human or animal). Steam engineers quickly discovered that it is possible to run machines so fast that the metal breaks. Parts fly apart from tension or they seize up under compression. Bad things happen to the machines and to anyone nearby. The solution was the governor. A governor limits the speed of an engine. Even if the source of power could drive it faster, the governor prevents it from running at unsafe RPMs.

Role of governors

We can create governors to slow the rate of actions. Reddit did this with its autoscaler by adding logic that says it can only shut down a certain percentage of instances at a time. A governor is stateful and time-aware. It knows what actions have been taken over a period of time. It should also be asymmetric. Most actions have a “safe” direction and an “unsafe” one. Shutting down instances is unsafe. Deleting data is unsafe. Blocking client IP addresses is unsafe.

We will often find a tension between definitions of “safe.” Shutting down instances is unsafe for availability, while spinning up instances is unsafe for cost. These forces don’t cancel each other out. Instead, they define a U-shaped curve where going too far in either direction is bad. That means actions may also be safe within a defined range but unsafe outside the range. Our AWS budget may allow for a thousand EC2 instances, but if the autoscaler starts heading toward two thousand, then it needs to slow down. We can think about this U-shaped curve as defining the response curve for the governor. Inside the safe zone, the actions are fast. Outside the range, the governor applies increasing resistance.

The whole point of a governor is to slow things down enough for humans to get involved. Naturally that means connecting to monitoring both to alert humans that there’s a situation and to give them enough visibility to understand what’s happening.

Tips to remember

Slow things down to allow intervention

When things are about to go off the rails, we often find automation tools pushing the throttle to its limit. Humans are better at situational thinking, so we need to create opportunities for us to intervene.

Apply resistance in the unsafe direction

Some actions are inherently unsafe. Shutting down, deleting, and blocking things are all likely to interrupt service. Automation will make them go fast, so we should apply a Governor to provide humans with time to intervene.

Consider a response curve

Actions may be safe within a defined range. Outside that range they should encounter increasing resistance by slowing down the rate by which they can occur.

Wrapping up

In time, even shockingly unlikely combinations of circumstances will eventually occur. If you ever catch yourself saying, “The odds of that happening are astronomical,” or some similar utterance, consider this: a single small service might do ten million requests per day over three years, for a total of 10,950,000,000 chances for something to go wrong. That’s more than ten billion opportunities for bad things to happen. Astronomical observations indicate there are four hundred billion stars in the Milky Way galaxy. Astronomers consider a number “close enough” if it’s within a factor of 10. Astronomically unlikely coincidences happen all the time.

Failures are inevitable. Our systems, and those we depend on, will fail in ways large and small. Stability antipatterns amplify transient events. They accelerate cracks in the system. Avoiding the antipatterns does not prevent bad things from happening, but it will help minimize the damage when bad things do occur.

Judiciously applying these stability patterns results in software that stays up, come hell or high water. The key to applying these patterns successfully is judgment. Examine the software’s requirements cynically. View other enterprise systems with suspicion and distrust. Any of them can betray the system. Identify the threats, and apply stability patterns appropriate to each threat. Our production environments don’t much resemble just a desktop or laptop computer any more. Everything is different, from network configuration and performance to security restrictions and runtime limits. In the next part of this book, we’ll look at design for production operations.

Get hands-on with 1200+ tech skills courses.