5 Ways to Improve Resilience in the Cloud

Home/

Newsletter/

Cloud/

In a world increasingly dependent on the cloud, every engineer should know how to design for resilience.

7 mins read

Apr 25, 2025

Your cloud system will fail. It's inevitable.

It even happens to the biggest tech companies.

In 2011, AWS suffered a major outage in one of its North Virginia availability zones, bringing down big names like Reddit and Quora. Amid the outage, one company managed to keep its services running: Netflix.

How did Netflix do it? They anticipated failure and built for it from the start. They had already tested their infrastructure’s resilience using a tool called Chaos Monkey, which randomly terminates instances in production to ensure the system can withstand instance failures without impacting customers.

This case study indicates resilience isn’t about luck—it’s engineered. And in a world increasingly dependent on the cloud, every engineer should know how to design for resilience.

The Educative Newsletter

Speedrun your learning with the Educative Newsletter

Level up every day in just 5 minutes!

Level up every day in just 5 minutes. Your new skill-building hack, curated exclusively for Educative subscribers.

Tech news essentials – from a dev's perspective

In-depth case studies for an insider's edge

The latest in AI, System Design, and Cloud Computing

Essential tech news & industry insights – all from a dev's perspective

Battle-tested guides & in-depth case studies for an insider's edge

The latest in AI, System Design, and Cloud Computing

Written By:

Fahim ul Haq

Free Edition

The IAM oversight that could sink your security

Learn how to manage access to your AWS resources using AWS IAM policies.

14 mins read

Jan 7, 2025