Search⌘ K

Postmortem

Explore how to conduct a detailed postmortem investigation after an airline system outage. Understand how to analyze logs, verify configurations, and identify root causes to prevent repeated failures. This lesson helps you grasp the challenges of troubleshooting and managing perception during critical distributed system incidents.

Looking into the problem

At 10:30 a.m. Pacific Time, eight hours after the outage started, our account representative, Tom (not his real name) called for a postmortem.

In operations, “post hoc, ergo propter hoc,” Latin for “you touched it last,” turns out to be a good starting point most of the time. It’s not always right, but it certainly provides a place to begin looking. In fact, when Tom called me, he asked me to fly there to find out why the database failover caused this outage. Once I was airborne, I started reviewing the problem ticket and preliminary ...