System Failure, Not Human Error
Learn about the Amazon outage, how human error has its consequences and how anomalies can be interpreted.
We'll cover the following
Amazon outage
Amazon clearly states that:
"[a]n authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
Parsing that just a little bit, we can understand that someone mistyped a command. First and foremost, whoever that was has our deepest sympathies. I felt that shock and horror when I realized that I, personally, had just caused an outage. It’s a terrible feeling. But there’s much more that we should learn from this.
Human error
Take a moment to read or reread that postmortem. The words “human error” don’t appear anywhere. It’s hard to overstate the importance of that. This is not a case of humans failing the system. It’s a case of the system failing humans. The administrative tools and playbooks allowed this error to happen. They amplified a minor error into enormous consequences. We must regard this as a system failure. The term “system” here means the whole system, S3 plus the control plane software and human processes to manage it all.
The second thing to note is that the playbook involved here had apparently been used before. But it hadn’t previously resulted in front-page news. Why not? For whatever reason, it worked before. We should try to learn from the successes as well as the failures. When the playbook was previously used, were the conditions different? There could be variations in any of the following:
-
Who executed it? Did someone verify their work?
-
Were there revisions to the playbook? Sometimes error-checking steps get relaxed over time.
-
What feedback did the underlying system provide? Feedback may have helped avert previous problems.
Observing anomalies
We tend to have postmortem reviews of incidents with bad outcomes. Then we look for causes, and any anomaly either gets labeled as a root cause or a contributing factor. But many times those same anomalies are present during ordinary operations, too. We give them more weight after an outage because we have the benefit of hindsight.
We also have many opportunities to learn from successful operations. Anomalies are present all the time, but most of the time they don’t cause outages. Let’s devote some effort to learning from those. Have post mortems for successful changes. See what variations or anomalies happened. Find out what nearly failed. Did someone type an incorrect command but catch it before executing? That’s a near miss. Find out how they caught it. Find out what safety net could have helped them catch it or stop it from doing harm.
Get hands-on with 1200+ tech skills courses.