Automation Goes Really Fast
Explore the challenges of automation in distributed systems by analyzing real-world outages caused by overly rapid automated processes. Understand the role of the control plane in managing system capacity, and discover how integrating human judgment with automation can help maintain system stability and prevent critical failures.
We'll cover the following...
We'll cover the following...
AWS postmortem
Another fascinating bit of information shows up in Amazon’s AWS post mortem:
“While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to ...