What is Chaos Engineering?
A software can start as a single webpage and then mature into a full-fledge website whose traffic is spread across thousands of people. Every day, software development matures extensively to result in complex distributed systems being deployed. Before deployment, it is difficult to check how the software will perform under chaotic/unreasonable conditions. Therefore, it is hard to build confidence in your deployed software.
To counter this, Netflix pioneered an approach called Chaos Engineering.
Chaos Engineering is the discipline of experimenting on a system in production to build confidence in the its capability to withstand turbulent conditions.
Principles of Chaos Engineering
Establish the steady-state of the system under normal conditions. The overall system’s throughput, error rates, latency percentiles, etc. could all be metrics of interest representing steady-state behavior. This steady-state should persist under any abnormal behavior. The extent to which the steady-state is maintained is directly proportional to the confidence in the system.
Chaos can be invoked by any real-world event that could potentially change the system’s state. Consider events that correspond to hardware failures (like servers dying), software failures (like malformed responses), and non-failure events (like a spike in traffic or a scaling event).
Devise experiments carefully to monitor these variables. To guarantee the authenticity of how the system is exercised and how it is relevant to the currently deployed system, Chaos strongly prefers to experiment directly on production traffic.
Running experiments one after another is a tedious task, but it is necessary to gather data from a large number of runs to generate meaningful insights. Therefore it is imperative to automate the experiment and generate results. Chaos Engineering builds automation into the system to drive both orchestration and analysis.
Since the experiment will be using production traffic, the customers using that software will have to deal with unusual delays and abnormal behavior. It is the responsibility and obligation of the Chaos Engineer to ensure that fallout from experiments is minimized and contained, unlike Homer Simpson.
Click here to listen to a podcast where Netflix engineers Haley Tucker and Aaron Blohowiak discuss Chaos Engineering.
Free Resources