Chaos Engineering 101: principles, process, and examples

Nov 02, 2020 - 15 min read
Amanda Fawcett
editor-page-cover

As the web has grown increasingly complex alongside technologies like cloud computation, distributed systems, and microservices, system failures are harder to predict. To prevent outages, companies large and small have turned to chaos engineering as a solution.

Chaos engineering lets you predict and identify potential failures by breaking things on purpose. This way, you can find and fix failures before they become outages. Chaos engineering is a growing trend for DevOps and IT teams. Even companies like Netflix and Amazon use these principles in product development.

If you are new to chaos engineering, you’re in the right place. Today, we will introduce its principles in depth and show you how to get started with Kubernetes.

We will learn:



Learn how to destroy your systems productively.

Learn the principles of chaos engineering with Kubernetes with this deep dive into chaos experiments, such as destroying a network, draining nodes, testing availability, and more.

The DevOps Toolkit: Kubernetes Chaos Engineering



What is Chaos Engineering?

Chaos engineering is a discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. With chaos engineering, we intentionally try to break our system under certain stresses to determine potential outages, locate weakness, and improve resiliency.

Chaos engineering is different from software testing or fault injection. Chaos engineering is used for all sorts of requirements and unpredictable situations, including traffic spikes, race conditions, and more.

With chaos engineering, we are trying to learn how an entire system reacts when an individual component is failing.

For example, chaos engineering can help answer functionality questions like these:

  • What happens when a service is not accessible, one way or another?
  • What is the result of outages when an application receives too much traffic or when it is not available?
  • Will we experience cascading errors when a single point of failure crashes an app?
  • What happens when our application goes down?
  • What happens when there is something wrong with networking?

History: Chaos Engineering was first developed at Netflix in 2008 when their subscription streaming service was transitioned to the public cloud. Netflix’s engineers noted that they needed new ways of testing this system for resiliency.

Chaos Monkey was created in 2010 for that purpose. Since then, chaos engineering has grown, and companies like Google, Facebook, Amazon, and Microsoft have implemented similar testing models.


Benefits of Chaos Engineering

Chaos engineering offers many benefits that other forms of software testing or failure testing cannot. Failure tests can only examine a single condition in a binary breakdown. This doesn’t allow us to test a system under unprecedented or unexpected stresses.

Chaos engineering, on the other hand, can account for complex, diverse, and real-world issues or outages. With chaos engineering, we can fix issues and gain new insights about an application for future improvements.

Chaos experiments help to reduce failures and outages while improving our understanding of our system design. Chaos engineering improves a service’s availability and durability, so customers are less disrupted by outages. Chaos engineering can also help prevent revenue losses and lower maintenance costs at the business level.


Chaos Engineering Tools

Before we start defining and running chaos experiments, we need to pick a tool. Chaos engineering is not yet a segment of the market that is well established and developed. Nevertheless, there are several tools we can pick from.

One of the most notable tools for chaos engineering is Simian Army, developed by Netflix. Simian Army is best for services in the cloud and AWS. It can generate failures and detect abnormalities. Chaos Monkey from Netflix is a resiliency tool for instances of random failures.

PowerfulSeal is a powerful tool for testing Kubernetes clusters, and Litmus can be used for stateful workloads on Kubernetes. Pumba is used with Docker for chaos testing and network emulation. Gremlin offers a Chaos Engineering platform that now supports testing on Kubernetes clusters.

Chaos Dingo is commonly used for Microsoft Azure, and Chaos HTTP Proxy can be used to introduce failures into HTTP requests.


svg viewer

Principles and Process of Chaos Engineering

As more teams have conducted experiments over the years, they’ve learned how to most effectively apply chaos engineering approaches to their systems. These best practices have become the core principles of chaos engineering. Let’s discuss the core principles of chaos engineering that every team should implement in their experiments.


Build a hypothesis around steady-state

You want to build a hypothesis around a steady-state behavior. Then, you want to perform potentially damaging actions on the network latency, applications, nodes, or any other component of the system.

You want to create violent situations to confirm that our steady-state hypothesis holds. you aim to validate that when our system is in a specific state, it performs certain actions, and finishes with the same validation to confirm that the state did not change.


Simulate real-world events

You want to do chaos engineering based on real-world events. In other words, only replicate events that are likely to happen in our system. This includes an application crash, network disruption will go down, or node failure.


Run experiments in production

You want to run chaos experiments in production. you want to experiment in production since that is the “real” system. If you perform chaos experiments only during staging or integration, you cannot get a real picture of how the system in production behaves.


Automate experiments and run them continuously

You want to automate our experiments to run continuously or be executed as part of continuous delivery pipelines. This could mean every hour, every few hours, every day, every week, or every time some event is happening in our system. You also want to run experiments every time you are deploying a new release.


Minimize blast radius

You should reduce the blast radius of our experiments. When you start with chaos experiments, you want to start small and build up as you gain confidence in a system. Eventually, you should do experiments across the whole system.


Summary of Principles

  • Build a hypothesis around a steady-state
  • Simulate real-world events
  • Run experiments in production
  • Automate experiments and run them continuously
  • Minimize blast radius

Chaos Engineering Process

The general process for chaos engineering looks as follows:

  • Define a steady-state hypothesis: You need to start with an idea of what can go awry. Start with a failure to inject and predict an outcome for when it is running live.
  • Confirm the steady-state and simulate some real-world events: Perform tests using real-world scenarios to see how your system behaves under particular stress conditions or circumstances.
  • Confirm the steady-state again: We need to confirm what changes occurred, so checking it again gives us insights into system behavior.
  • Collect metrics and observe dashboards: You need to measure your system’s durability and availability. It is best practice to use key performance metrics that correlate with customer success or usage. We want to measure the failure against our hypothesis by looking at factors like impact on latency or requests per second.
  • Make changes and fix issues: After running an experiment, you should have a good idea of what is working and what needs to be altered. Now we can identify what will lead to an outage, and we know exactly what breaks the system. So, go fix it, and try again with a new experiment.