Waiting for Godot
Understand the challenges faced during software deployment in production environments. Learn how coordinated efforts, playbook processes, and monitoring help manage incidents and improve deployment efficiency. The lesson highlights the importance of refining deployment practices to reduce errors, save costs, and increase system reliability.
The start of the incident
It isn’t enough to write the code. Nothing is done until it runs in production. Sometimes the path to production is a smooth and open highway. Other times, especially with older systems, it’s a muddy track festooned with potholes, bandits, and checkpoints with border guards. This was one of the bad ones. I turned my grainy eyes toward the clock on the wall. The hands pointed to 1:17 a.m. On the Polycom, someone was reporting status. It’s a DBA. One of the SQL scripts didn’t work right, but he “fixed” it by running it under a different user ID.
The wall clock didn’t mean much at the time. Our Lamport clock was still stuck a little before midnight. The playbook had a row that said SQL scripts would finish at 11:50 p.m. We were still on the SQL scripts, so logically we were still at 11:50 p.m. Before dawn, we needed our playbook time and solar time to converge in order for this deployment to succeed.
The first row in the playbook started yesterday afternoon with a round of status reports from each area:
- Dev
- QA
- Content
- Merchants
- Order management and so on.
The go or no-go meeting
Somewhere on the first page of the playbook we had a “go or no-go” meeting at 3 p.m. Everyone gave the deployment a go, although QA said that they ...