The Principles and the Process

The principles

Build a hypothesis around steady-state

Usually, we usually want to build a hypothesis around the steady-state behavior. What that means is that we want to define how our system, or a part of it, looks like. Then, we want to perform some potentially damaging actions on the network, applications, nodes, or any other component of the system. These actions are, most of the time, very destructive. We want to create violent situations that will confirm that our state, the steady-state hypothesis, still holds. In other words, we want to validate that our system is in a specific state, performs some actions, and finishes with the same validation to confirm that the state of our system did not change.

Simulate real-world events

We want to try to do chaos engineering based on real-world events. It would be pointless to test things that are not likely to happen. Instead, we want to focus on replicating events that are likely to happen in our system. Our applications will go down, our networking will be disrupted, and our nodes will not be fully available all the time, and we want to check how our system behaves in these situations.

Run experiments in production

We want to run chaos experiments in production. As I mentioned before, we could do it in a non-production system, but that is mostly for practice and for gaining confidence in chaos experiments. We want to experiment in production because that’s the “real” system. That’s the system at its best, and our real users are interacting with it. If we just perform chaos experiments during staging or integration, we cannot get a real picture of how the system in production behaves.

Automate experiments and run them continuously

We want to automate our experiments to run continuously. It would be pointless to run an experiment only once because we could never be sure when the right moment is: When is the system in conditions under which it would produce some negative effect? Therefore, we should run the experiments continuously. That can mean every hour, every few hours, every day, every week, or every time some event is happening in our cluster. Maybe we want to run experiments every time we deploy a new release or every time we upgrade the cluster. In other words, experiments are either scheduled to run periodically, or they are executed as part of continuous delivery pipelines.

Minimize blast radius

Finally, we want to reduce the blast radius. In the beginning, we want to start small and to have a relatively small blast radius of the things that might explode. Over time, as we are increasing confidence in our work, we might expand that radius. Eventually, we might reach a level where we’re doing experiments across the whole system, but that comes later. In the beginning, we want to start small. We want our scope to be tiny.

The summary of the principles we discussed is as follows.

  • Build a hypothesis around a steady-state
  • Simulate real-world events
  • Run experiments in production
  • Automate experiments and run them continuously
  • Minimize blast radius

The process

Now that we have defined chaos engineering and the principles behind it, we can turn our attention towards the process. This section is repetitive.

To begin, we want to define a steady-state hypothesis. We want to know how the system looks before and after some actions. We want to confirm the steady-state, and then simulate some real-world events. After the events, we want to confirm the steady-state again. We also want to collect metrics, observe dashboards, and have alerts that notify us when our system misbehaves. Ultimately, we’re trying very hard to disrupt the steady-state, and the less damage we’re able to do, the more confidence we will have in our system.


The summary of the process we discussed is as follows.

  1. Define the steady-state hypothesis
  2. Confirm the steady-state
  3. Produce or simulate “real world” events
  4. Confirm the steady-state
  5. Use metrics, dashboards, and alerts to confirm that the system as a whole is behaving correctly.

In the next lesson, we will go over a checklist of the chaos experiments that we will carry out.

Create a free account to view this lesson.

By signing up, you agree to Educative's Terms of Service and Privacy Policy