The Outage
Explore how to manage a major airline system outage by prioritizing service restoration and diagnosing critical dependencies. Understand the importance of targeted interventions like restarting specific application servers to recover check-in kiosks and IVR systems swiftly, maintaining uptime during high-demand periods.
Services stopped
At about 2:30 a.m., all the check-in kiosks went red on the monitoring console. Every single one, everywhere in the country, stopped servicing requests at the same time.
Red signals
A few minutes later, the IVR servers went red too. Not exactly panic time, but pretty close, because 2:30 a.m. Pacific time is 5:30 a.m. Eastern time, which is prime time for commuter flight check-in on the Eastern seaboard. The operations center immediately opened a Severity 1 case and got the local team on a conference call.
Restore services
In any incident, the first priority is always to restore service. Restoring service takes precedence over investigation. If we can collect some data ...