Facebook, WhatsApp, Instagram, Oculus Outage - 2021-10-04 COPY

Learning from major Facebook outage.

On October 4, 2021 at 15:39 UTC, the social network Facebook and its subsidiaries (Messenger, Instagram, WhatsApp, Mapillary, Oculus) experienced a global outage for about six hours. The popular media reported the impact of this failure prominently (for example NYT reported: “Gone in Minutes, Out for Hours: Outage Shakes Facebook”). According to one estimate, this outage cost Facebook about $100 million in revenue losses, and many billions due to declining stock of the company.

We now see the sequence of events that caused this global problem.

Sequence of Events

  • A routine maintenance system needed to find out the spare capacity on Facebook’s backbone network.
  • Due to a configuration error, the maintenance system disconnected all the datacenters from each other on the backbone network. There was another automated configuration review tool, that missed the above problem.
  • The authoritative Domain Name Systems (DNS) of Facebook had a health-check rule that if it can not reach to Facebook’s internal data centers, then it stops replying to client DNS queries by withdrawing the routes.
  • When networks routes (where Facebook’s authoritative DNS were hosted) were withdrawn, soon all cached mapping of human readable names to IPs timed out at all public DNS resolvers. (When a client resolved www.facebook.com, the DNS resolver first goes to one of the root DNS servers, that provides list of authoritative DNS servers for dot com. Resolver connects to one of them, and they provide IPs for the authoritative DNS servers for facebook. But now after route withdrawal, it is impossible to reach them)
  • Now no one is able to reach Facebook and its subsidaires.

Analyses

  • Withdrawal or additional of network routes is a relatively common activity. Though confluence of bugs (first a faulty configuration, and then a bug in an audit tool not able to detect such a problem) triggered a chain of events resulting in cascading failures (where one failure can trigger another failure, ultimately bring the whole system down).
  • It seems curious that why it took six hours to restore the service. Wasn’t it easy to re-announce the withdrawn routes? At the scale of Facebook, rarely anything is done manually, and there are automated systems to do changes. Probably the internal tools relied on the DNS infrastructure and when all datacenters are offline from the back-bone, it would have been virtually impossible to use those tools and a manual intervention would have been necessary. Manually bootstrapping a system of this scale is not easy. The usual physical and digital security mechanism that were in place made it a slow process to manually intervene.
  • In retrospect, it might seem odd that why authoritative DNS system will disconnect themselves if internal datacenters are not accessible? This is another example where a very rare event (none of the datacenters being accessible) happened, triggering another event.
  • Facebook has been an early advocate of automation of network configuration changes, effectively saying that software can do a better job of running network than humans (who are more prone to errors). But software can have bugs, such as this one.

Lessons Learned

  • There can be hidden single-point-of-failure in complex systems. Probably the best defense about such faults is to have the operations team ready for such an occurrence by regular training. Thinking clearly under high-stress situations becomes necessary to deal with such events.
  • As systems get bigger, they become more complex and they have emergent behaviours. To understand overall behaviour of the system, it might not be sufficient to understand the behaviour of its components. Cascading failures can arise. This is one point to keep the system design as simple as possible for the current needs and evolve the design slowly. Unfortunately there is no silver bullet to deal with this problem except accepting such possibility, continuous monitoring, ability to solve issues when they arise, and learning from the failures by improving the system.
  • Some third-party services rely on Facebook for single sign-on. When the outage occurred, while the third party services were up and running but their client were unable to use them because Facebook’s login facility was also not available. This is another example of assuming that some service will always remain available, and a hidden single-point-of-failure.
  • Few services are so robustly designed and perfected over time that its clients start assuming that the service is and will be 100% available. DNS is one such service that has very carefully crafted and often designer assume it will never fail. Probably hosint DNS to independent third-party providers might be a way to guard against such problems. DNS allows multiple authoritative servers, and an organization might have many at different places. Though, DNS at the scale of Facebook is not simple and tightly connected to their backbone infrastructure and changes frequently. Delegating such a piece to an independent third party is expensive, and might reveal internal service details. So there can be tradeoff between business and robustness needs.
  • There can be some surprising trade-offs. An example here is need for data security and need for rapid manual repair. So many physical and digital safeguards were in place that manual intervention was slow. This is catch 22 like situation—lowering the security needs can cause immense trobels and slow repir for such event can also make it hard for the companies. The hope is that need for such repair is a very rare event.
  • Failure of large players disrupt the whole Intenet. The third party public resolvers (for example from Goolge and Cloudflare) saw a surge in the load due to unsuccessful DNS retires.
  • Restarting a large service is not as easy as flipping of a switch. When load was suddenly become almost zero (because clients were not able to reach the service), turning them suddenly up might mean many mega-watt uptick in power use. It might cause issues for the electric grid. Complex systems usually have a steady state, and if they go out of that steady state care must be taken to bring them back.

Exercise

What you might do to safeguard against the kind of series of faults experienced by Facebook?

Possible solutions

  • Netwrok verfication has recently gained momentum and shown its promise to catch bugs early on. Such tools use an abstract model of the infrastructure.
  • We might have more than one layers. of auditing. A second layers might use a simulator to make sure that after the configuration changes, critical network infrastructure remains available/reachable from multiple global vantage points.
  • Every effort should be taken to reduce the scope of a configuration change to avoid cascading effects.
  • Critical infrastructure might be programmed such a way that if something bad happens, they could return to last known good state (though it is easier said than done due to sheer number of such components).

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy