Automation Goes Really Fast
Explore the challenges of automation in distributed systems by analyzing real-world outages caused by overly rapid automated processes. Understand the role of the control plane in managing system capacity, and discover how integrating human judgment with automation can help maintain system stability and prevent critical failures.
We'll cover the following...
AWS postmortem
Another fascinating bit of information shows up in Amazon’s AWS post mortem:
“While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.”
Reddit outage
This part stuck out because it closely resembled the outage that Reddit.com suffered in August 2016 ...