Adopting Your Own Monkey

Learn about vulnerabilities in chaos monkey and its prerequisites, limiting chaos tests, and defining a healthy system state.

Vulnerabilities with chaos monkey

When Chaos Monkey launched, most developers were surprised by how many vulnerabilities it uncovered. Even services that had been in production for ages turned out to have subtle configuration problems. Some of them had cluster membership rosters that grew without bounds. Old IP addresses would stay on the list, even though the owner would never be seen again (or worse, if that IP came back it was as a different service)!


First of all, chaos engineering efforts can’t kill companies or customers. In a sense, Netflix had it easy. Customers are familiar with pressing the play button again if it doesn’t work the first time. They’ll forgive just about anything except cutting off the end of Stranger Things. If every single request in the system is irreplaceably valuable, then chaos engineering is not the right approach.

The whole point of chaos engineering is to disrupt things in order to learn how the system breaks. We must be able to break the system without breaking the bank!

Limiting chaos test exposure

We also want a way to limit the exposure of a chaos test. Some people talk about the “blast radius,” meaning the magnitude of bad experiences both in terms of the sheer number of customers affected and the degree to which they’re disrupted. To keep the blast radius under control, we often want to choose the affected customers based on a set of criteria. It may be as simple as “every 10,000th request will fail” at the start, but we’ll soon need more sophisticated selections and controls.

We’ll need a way to track a user and a request through the tiers of our system, and a way to tell if the whole request was ultimately successful or not. That trace serves two purposes. If the request succeeds, then we’ve uncovered some redundancy or robustness in the system. The trace will tell us where the redundancy saves the request. If the request fails, the trace will show us where that happened, too.

Defining healthy systems

We also have to know what “healthy” looks like, and from what perspective. Is our monitoring good enough to tell when failure rates go from 0.01 percent to 0.02 percent for users in Europe but not in South America? Be wary that measurements may fail when things get weird, especially if monitoring shares the same network infrastructure as production traffic. Also, as Charity Majors, CEO of Honeycomb. io says, “If you have a wall full of green dashboards, that means your monitoring tools aren’t good enough.” There’s always something weird going on.

Finally, make sure to have a recovery plan. The system may not automatically return to a healthy state when we turn off the chaos. So we will need to know what to restart, disconnect, or otherwise clean up when the test is done.

Designing the experiment

Let’s say we’ve got great measurements in place. Our A/B testing system can tag a request as part of a control group or a test group. It’s not quite time to randomly kill some boxes yet. First we need to design the experiment, beginning with a hypothesis. The hypothesis behind Chaos Monkey was, “Clustered services should be unaffected by instance failures.” Observations quickly invalidated that hypothesis. Another hypothesis might be, “The application is responsive even under high latency conditions.”

As the hypothesis is formed, think about it in terms of invariants that the system is expected to uphold even under turbulent conditions. Focus on externally observable behavior, not internals. There should be some healthy steady state that the system maintains as a whole.

Get hands-on with 1200+ tech skills courses.