Facebook, WhatsApp, Instagram, Oculus Outage - 2021-10-04 COPY

Learning from major Facebook outage.

On October 4, 2021 at 15:39 UTC, the social network Facebook and its subsidiaries (Messenger, Instagram, WhatsApp, Mapillary, Oculus) experienced a global outage for about six hours. The popular media reported the impact of this failure prominently (for example NYT reported: “Gone in Minutes, Out for Hours: Outage Shakes Facebook”). According to one estimate, this outage cost Facebook about $100 million in revenue losses, and many billions due to declining stock of the company.

We now see the sequence of events that caused this global problem.

Sequence of Events

A routine maintenance system needed to find out the spare capacity on Facebook’s backbone network.
Due to a configuration error, the maintenance system disconnected all the datacenters from each other on the backbone network. There was another automated configuration review tool, that missed the above problem.
The authoritative Domain Name Systems (DNS) of Facebook had a health-check rule that if it can not reach to Facebook’s internal data centers, then it stops replying to client DNS queries by withdrawing the routes.
When networks routes (where Facebook’s authoritative DNS were hosted) were withdrawn, soon all cached mapping of human readable names to IPs timed out at all public DNS resolvers. (When a client resolved www.facebook.com, the DNS resolver first goes to one of the root DNS

...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Facebook, WhatsApp, Instagram, Oculus Outage - 2021-10-04 COPY

Sequence of Events