Comparing options

Lean about unknown behavior of applications, scheduling requests, and recovery-oriented computing.

Conway’s law

Brainstorming ensued. Numerous proposals were thrown up and shot down, generally because the application code’s behavior under those circumstances was unknown. It quickly became clear that the only answer was to stop making so many requests to check schedule availability. With the weekend’s marketing campaign centered around free home delivery, we knew requests from the users were not about to slow down. We had to find a way to throttle the calls. The order management system had no way to do that. We saw a glimmer of hope when we looked at the code for the store. It used a subclass of the standard resource pool to manage connections to order management.

In fact, it had a separate connection pool just for scheduling requests. I’m not sure why the code was designed with a separate connection pool for that, probably an example of Conway’s law, but it saved the day and the retail weekend. Because it had a component just for those connections, we could use that component as our throttle.

If the developers had added an enabled property, it would have been simple to set that to false. Maybe we could do the next best thing, though. A resource pool with a zero maximum is effectively disabled anyway. I asked the developers what would happen if the pool started returning null instead of a connection. They replied that the code would handle that and present the user with a polite message stating that delivery scheduling was not available for the time being. Good enough.

Does the condition respond to treatment?

One of my Perl scripts could set the value of any property on any component. As an experiment, I used the script to set max for that resource pool (on just one DRP) to zero, and I set checkoutBlockTime to zero. Nothing happened. No change in behavior at all. Then I remembered that max has an effect only when the pool is starting up. I used another script, one that could invoke methods on the component, to call its stopService() and startService() methods. DRP started handling requests again! Of course, because only one DRP was responding, the load manager started sending every single page request to that one DRP. It was crushed like the last open beer stand at a World Cup match. But at least we had a strategy.

Recovery-Oriented Computing

The Recovery-Oriented Computing (ROC) project was a joint Berkeley and Stanford research project a^{a}. The project’s founding principles are as follows:

  • Failures are inevitable, in both hardware and software.
  • Modeling and analysis can never be sufficiently complete. A priori prediction of all failure modes is not possible.
  • Human action is a major source of system failures.

Their research runs contrary to much of the prior work in system reliability. Whereas most work focuses on eliminating the sources of failure, ROC accepts that failures will inevitably happen—a major theme in this course! Their investigations aim to improve survivability in the face of failures.

The concepts of ROC were ahead of their time in 2005. Now they seem natural in the world of microservices, containers, and elastic scaling.

I ran my scripts, this time with the flag that said “all DRPs.” They set max and checkoutBlockTime to zero and then recycled the service. The ability to restart components, instead of entire servers, is a key concept of recovery-oriented computing. Although we didn’t have the level of automation that ROC proposes, we were able to recover service without rebooting the world. If we had needed to change the configuration files and restart all the servers, it would have taken more than six hours under that level of load. Dynamically reconfiguring and restarting just the connection pool took less than five minutes (once we knew what to do). Almost immediately after my scripts finished, we saw user traffic getting through. Page latency started to drop. About ninety seconds later, the DRPs went green in SiteScope. The site was back up and running.

Get hands-on with 1200+ tech skills courses.