The Simian Army

Explore how Netflix’s Simian Army applies chaos engineering techniques to test and improve the resilience of distributed systems. Learn about the role of tools like Chaos Monkey in automating failure induction and recovery, and how organizations can manage chaos testing through opt-in and opt-out processes to enhance system robustness.

We'll cover the following...

Chaos monkey
Robustness
Opt-in or opt-out?

Chaos monkey

Probably the best known example of chaos engineering is Netflix’s Chaos Monkey. Every once in a while, the monkey wakes up, picks an autoscaling cluster, and kills one of its instances. The cluster should recover automatically. If it doesn’t, then there’s a problem and the team that owns the service has to fix it.

The Chaos Monkey tool was born during Netflix’s migration to Amazon’s AWS cloud infrastructure and a microservice architecture. As services proliferated, engineers found that availability could be jeopardized by an increasing number of components. Unless they found a way to make the whole service immune to component failures, they would be doomed. So every cluster needed to autoscale and recover from failure ...

1.Living in Production

2.The Exception That Grounded an Airline

3.Stabilize Your System

4.Stability Antipatterns

5.Failures And Blockages

6.Force Multiplier

7.Stability Patterns

8.Launching An Online Store

9.Foundations

10.Processes on Machines

11.Interconnect

12.Control Plane

13.Security

14.Design for Deployment

15.Handling Versions

16.Case Study: Trampled by Your Own Customers

17.Adaptation

18.System Architecture

19.Information Architecture

20.Chaos Engineering

21.Bibliography

The Simian Army

Chaos monkey