Outage Amplification

Learn about force multiplier, outage amplification, autoscaler, service discovery system, platform management service, control plane.

Force Multiplier

Like a lever, automation allows administrators to make large movements with less effort. It’s a force multiplier.

Reddit outage

On August 11, 2016, link aggregator Reddit.com suffered an outage. It was unavailable for approximately 90 minutes and had degraded service for about another 90 minutes 8^{8}. In their postmortem, Reddit admins described a conflict between deliberate, manual changes and their automation platform:

  1. First, the admins shut down their autoscaler service so that they could upgrade a ZooKeeper cluster 9^{9}.

  2. Sometime into the upgrade process, the package management system detected the autoscaler was off and restarted it.

  3. The autoscaler came back online and read the partially migrated ZooKeeper data. The incomplete ZooKeeper data reflected a much smaller environment than was currently running.

  4. The autoscaler decided that too many servers were running. It therefore shut down many application and cache servers. This is the start of the downtime.

  5. Sometime later, the admins identified the autoscaler as the culprit. They overrode the autoscaler and started restoring instances manually. The instances came up, but their caches were empty. They all made requests to the database at the same time, which led to a dogpile on the database. Reddit was up but unusably slow during this time.

  6. Finally, the caches warmed sufficiently to handle typical traffic. The long nightmare ended and users resumed downvoting everything they disagreed with. In other words, normal activity resumed.

Autoscaler was down

The most interesting aspect of this outage is that it emerged from a conflict between the automation platform’s and administrator’s beliefs about the expected state of the system. When the package management system reactivated the autoscaler, it had no way to know that the autoscaler was expected to be down. Likewise, the autoscaler had no way to know that its source of truth (ZooKeeper) was temporarily unable to report the truth. The automation systems were stuck between two conflicting sets of instructions.

Service discovery system

A similar condition can occur with service discovery systems. A service discovery service is a distributed system that attempts to report on the state of many distributed systems to other distributed systems. When things are running normally, they work as shown in the figure below.

Get hands-on with 1200+ tech skills courses.