AWS Wide Spread Outage 2021-12-07 COPY

This five hour outage made headings in the popular media:

https://www.ft.com/content/6eff4cae-5df7-4746-a068-d4f6a18ba285

"From angry Adele fans to broken robot vacuums: AWS outage ripples through US

Millions hit by suspension of services that use cloud computing, including Amazon deliveries Please use the sharing tools found via the share button at the top or side of articles. Copying articles to share with others is a breach of FT.com T&Cs and Copyright Policy. Email licensing@ft.com to buy additional rights. Subscribers may share up to 10 or 20 articles per month using the gift article service. More information can be found at https://www.ft.com/tour. https://www.ft.com/content/6eff4cae-5df7-4746-a068-d4f6a18ba285

It started early on Tuesday morning. Robot vacuums ceased sucking, WiFi cameras stopped watching and eager Tinder daters were left unable to “swipe right” on their smartphone apps.

An outage at Amazon Web Services, the cloud arm of Amazon, had rippled through the online economy, crippling services used by millions of people.

Among the most distraught were fans of the British singer Adele who had been hoping to snap up the first tickets to her upcoming Las Vegas residency.

“Due to an Amazon Web Services (AWS) outage impacting companies globally,” the ticket seller Ticketmaster explained, “all Adele Verified Fan Presales originally scheduled for today have been moved to tomorrow.”

The disruption highlighted the degree to which many of the internet’s most popular services rely on cloud computing infrastructure handled by a very small number of large companies.

According to Gartner, 80 per cent of the cloud market is handled by just five companies. Amazon, with a 41 per cent share of the cloud computing market, is the very biggest.

“They’ve had some very large outages,” said Servaas Verbiest from Sungard Availability Services, a company that provides “disaster recovery” for multiple cloud platforms. “What makes AWS more exposed is the sheer volume of business they have.”

Within Amazon itself on Tuesday, the unthinkable occurred: grounded delivery drivers were unable to load packages and deliver to customers’ doorsteps, just as the peak Christmas season begins to step up.

Drivers at multiple facilities across the country were sent home with pay. With little to do, many of them logged on to social media to enjoy the moment while it lasted — some dreading whatever workload may await once systems were back up and running.

An “impairment of several network devices” in one of its server regions — US-EAST-1 — was the “root cause” of the disruption, Amazon said in a message posted to the AWS status page, which monitors the operational health of its global network of interconnected computers.

Amazon did not comment on the disruption to its deliveries.

Business Insider quoted an internal memo detailing a flood of traffic from an “as yet unknown source”.

Publicly, the company logged the first issues at 9.37am US Pacific time on Tuesday morning, though users of affected services had complained of problems before then. By 3pm, AWS said it had been able to mostly restore service.

Several of the sites first affected appeared to have been able to reroute traffic to alternative servers. Whether or not outages created longer-lasting problems for companies depended on the degree to which executives prioritised diversifying their cloud computing providers, added Verbiest.

“If you’ve embraced the ecosystem, and you’ve got everything in AWS, you’re in a sit-and-wait scenario,” he said.

While high-profile outages can be a boon for competitors such as Google and Microsoft, Verbiest stressed the bar to switching service providers was high.

“It’s difficult to say that one outage is going to sway people to one cloud platform or another, because every cloud provider has outages. It’s just about how long are they and how do they resolve them when they happen?”

Recommended Inside BusinessRichard Waters Every company may soon be a cloud company

In November 2020, the US-EAST-1 region was also at the heart of an AWS outage affecting many of the same websites. In that case, a fault with an Amazon system called Kinesis was said to be the culprit.

This time, according to DownDetector.com, which uses identifies websites and services that are struggling or failing to load, affected companies included McDonald’s, PayPal-owned payments service Venmo, delivery service DoorDash and video conferencing platform Zoom.

The disruption to Amazon Prime Video and Amazon Music would appear to benefit Netflix and Spotify. However, both rivals also use AWS and were similarly affected.

iRobot, the creators of the autonomous Roomba vacuum, apologised to users who could not log into the device’s app.

One apparent Roomba owner quipped on Twitter: “My wife is going to kill me if the foyers aren’t mopped before she gets home.”

"

From Bloomberg

Following article raised interesting points.

  • Using replicated services from multiple cloud providers might sound a good idea but doing so is expensive, and many services are not easily portable.
  • If architected wrong, using multiple cloud providers might mean that organization exposes to all the outages from all providers.
  • Apple also use AWS for its services, but there was no mention of Apple service outages due to this AWS outage. That hints that they might be using multiple datacenters (and others probably choosing the cheapest one to save money)?

" Hi, it’s Spencer in Seattle. Amazon Web Services outage wreaked havoc across the internet. But first…

Today’s top tech news:

Instagram’s chief will appear before Congress Wednesday Google sues two Russians it claims help run a massive botnet Meta is reorganizing its research department after scrutiny over its findings The day the internet paused Amazon Web Services had one of its worst outages ever on Tuesday. The problems cascaded through the company’s retail operation—which uses software and apps that run on AWS—during the holiday shopping rush. Vans sat idle. Drivers were sent home. Packages piled up at the worst possible time.

But that was just where the damage started. The failure also took down a sizable portion of the internet. Here were just a few of the problems:

Visitors to Walt Disney Co. theme parks had difficulty checking in online. Ticketmaster had to postpone selling seats to Adele’s 2022 tour. Traders at home couldn’t use Robinhood or cryptocurrency platform Coinbase. Streaming services like Netflix and Disney+ were down. Tinder went dark. It wasn’t all bad. One colleague told me her app for the Equinox gym crashed, so she pivoted from push-ups to multiple cups of chocolate pudding.

Tuesday offered the kind of jolt that reminds us how many products and services are centralized in common data centers run by just a handful of big tech companies like Amazon.com Inc., Microsoft Corp. and Alphabet Inc.’s Google. It’s also a reminder of the vulnerability of much of the infrastructure underpinning day-to-day life online.

It could still be a few days before Amazon discovers and reveals precisely what went wrong. But by Tuesday night, the company said it had resolved a network device issue that led to the outage. More information should follow, since most of the industry discloses the causes of big failures to help avoid repeats. For example, in 2017 a major AWS outage was attributed days later to an employee who goofed while trying to fix a bug in a billing system.

What happens next? Tuesday’s disruptions will likely reinvigorate industry debate around “multi-cloud” strategies, an idea that a company should duplicate its services across multiple cloud computing providers so no one crash puts your company out of commission. Some, like Forrester analyst Brent Ellis, believe that will help companies side-step big web outages. “It’s a decision large enterprises have to make or they’ll inevitably be in a situation where they’re down for several hours,” he said.

But other experts, like Corey Quinn, chief cloud economist at the Duckbill Group, believe that such precautions won’t work. “The idea of being in multiple clouds for resilience is a red herring,” Quinn said. “They end up going down three times as often because they now have exposure to everyone’s outages, not just AWS’s.”

As companies try to figure out how to protect themselves from future outages, the fallout of Tuesday’s chaos for Amazon itself will likely be limited. Packages will get back on track. And customers will keep using AWS—along with other cloud providers that have suffered downtime—because it’s often still a cheaper, more reliable alternative than running similar operations in-house.

In the end, periodic cloud failures have started to feel a little like a large-scale, digital version of a regular power outage: A momentary crisis, then the lights flicker back on, and you get back to your routine. —Spencer Soper "

Lamport quip

Outages like above remind us of Lamport’s saying:

“``A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable’’ is a famous Lamport quip.”

ToDo

AWS outage in US-East-1 on 2021-12-08 affected Amazon warehouses as well. Initial root-cause are some network devices. We will need to wait for detailed post-mortum report to decide if it is worthwhile to put in our course: https://www.cnet.com/tech/services-and-software/aws-outage-means-major-sites-are-down-in-some-east-coast-cities/

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy