Waiting for Godot

Learn about how a system fails due to poor management, features, and a careless QA process.

The start of the incident

It isn’t enough to write the code. Nothing is done until it runs in production. Sometimes the path to production is a smooth and open highway. Other times, especially with older systems, it’s a muddy track festooned with potholes, bandits, and checkpoints with border guards. This was one of the bad ones. I turned my grainy eyes toward the clock on the wall. The hands pointed to 1:17 a.m. On the Polycom, someone was reporting status. It’s a DBA. One of the SQL scripts didn’t work right, but he “fixed” it by running it under a different user ID.

The wall clock didn’t mean much at the time. Our Lamport clock was still stuck a little before midnight. The playbook had a row that said SQL scripts would finish at 11:50 p.m. We were still on the SQL scripts, so logically we were still at 11:50 p.m. Before dawn, we needed our playbook time and solar time to converge in order for this deployment to succeed.

