As the web has grown increasingly complex alongside technologies like cloud computation, distributed systems, and microservices, system failures are harder to predict. To prevent outages, companies large and small have turned to chaos engineering as a solution.
Chaos engineering lets you predict and identify potential failures by breaking things on purpose. This way, you can find and fix failures before they become outages. Chaos engineering is a growing trend for DevOps and IT teams. Even companies like Netflix and Amazon use these principles in product development.
If you are new to chaos engineering, you’re in the right place. Today, we will introduce its principles in depth and show you how to get started with Kubernetes.
We will learn:
Learn the principles of chaos engineering with Kubernetes with this deep dive into chaos experiments, such as destroying a network, draining nodes, testing availability, and more.
The DevOps Toolkit: Kubernetes Chaos Engineering
Chaos engineering is a discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. With chaos engineering, we intentionally try to break our system under certain stresses to determine potential outages, locate weakness, and improve resiliency.
Chaos engineering is different from software testing or fault injection. Chaos engineering is used for all sorts of requirements and unpredictable situations, including traffic spikes, race conditions, and more.
With chaos engineering, we are trying to learn how an entire system reacts when an individual component is failing.
For example, chaos engineering can help answer functionality questions like these:
History: Chaos Engineering was first developed at Netflix in 2008 when their subscription streaming service was transitioned to the public cloud. Netflix’s engineers noted that they needed new ways of testing this system for resiliency.
Chaos Monkey was created in 2010 for that purpose. Since then, chaos engineering has grown, and companies like Google, Facebook, Amazon, and Microsoft have implemented similar testing models.
Chaos engineering offers many benefits that other forms of software testing or failure testing cannot. Failure tests can only examine a single condition in a binary breakdown. This doesn’t allow us to test a system under unprecedented or unexpected stresses.
Chaos engineering, on the other hand, can account for complex, diverse, and real-world issues or outages. With chaos engineering, we can fix issues and gain new insights about an application for future improvements.
Chaos experiments help to reduce failures and outages while improving our understanding of our system design. Chaos engineering improves a service’s availability and durability, so customers are less disrupted by outages. Chaos engineering can also help prevent revenue losses and lower maintenance costs at the business level.
Before we start defining and running chaos experiments, we need to pick a tool. Chaos engineering is not yet a segment of the market that is well established and developed. Nevertheless, there are several tools we can pick from.
One of the most notable tools for chaos engineering is Simian Army, developed by Netflix. Simian Army is best for services in the cloud and AWS. It can generate failures and detect abnormalities. Chaos Monkey from Netflix is a resiliency tool for instances of random failures.
PowerfulSeal is a powerful tool for testing Kubernetes clusters, and Litmus can be used for stateful workloads on Kubernetes. Pumba is used with Docker for chaos testing and network emulation. Gremlin offers a Chaos Engineering platform that now supports testing on Kubernetes clusters.
Chaos Dingo is commonly used for Microsoft Azure, and Chaos HTTP Proxy can be used to introduce failures into HTTP requests.
As more teams have conducted experiments over the years, they’ve learned how to most effectively apply chaos engineering approaches to their systems. These best practices have become the core principles of chaos engineering. Let’s discuss the core principles of chaos engineering that every team should implement in their experiments.
You want to build a hypothesis around a steady-state behavior. Then, you want to perform potentially damaging actions on the network latency, applications, nodes, or any other component of the system.
You want to create violent situations to confirm that our steady-state hypothesis holds. you aim to validate that when our system is in a specific state, it performs certain actions, and finishes with the same validation to confirm that the state did not change.
You want to do chaos engineering based on real-world events. In other words, only replicate events that are likely to happen in our system. This includes an application crash, network disruption will go down, or node failure.
You want to run chaos experiments in production. you want to experiment in production since that is the “real” system. If you perform chaos experiments only during staging or integration, you cannot get a real picture of how the system in production behaves.
You want to automate our experiments to run continuously or be executed as part of continuous delivery pipelines. This could mean every hour, every few hours, every day, every week, or every time some event is happening in our system. You also want to run experiments every time you are deploying a new release.
You should reduce the blast radius of our experiments. When you start with chaos experiments, you want to start small and build up as you gain confidence in a system. Eventually, you should do experiments across the whole system.
Summary of Principles
- Build a hypothesis around a steady-state
- Simulate real-world events
- Run experiments in production
- Automate experiments and run them continuously
- Minimize blast radius
The general process for chaos engineering looks as follows: