Automation Goes Really Fast

Learn about different outages and autoscaling services, the difference between human and automation, and the drawbacks of automations.

AWS postmortem

Another fascinating bit of information shows up in Amazon’s AWS post mortem:

“While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.”

Reddit outage

This part stuck out because it closely resembled the outage that Reddit.com suffered in August 2016 2^{2}. After that outage, Reddit reported the event was precipitated by its autoscaling service. It observed a partially migrated ZooKeeper database that claimed Reddit only needed a tiny fraction of the servers it was running. The autoscaler dutifully shut down the rest of the servers.

The role of automation

A common thread running through these outages is that the automation is not being used simply to enact the will of a human administrator. Rather, it is more like industrial robotics: the control plane senses the current state of the system, compares it to the desired state, and effects changes to bring the current state into the desired state.

Get hands-on with 1200+ tech skills courses.