Economical Fault Tolerance

Design a disaster recovery plan if we can compromise availability.

Something is always failing in a large distributed system. One important aspect of safeguarding against failures is to durably store data. Not all data is produced equal—some is primary data (at times called the ground truth) and derived data (at times called soft state). We can re-generate the soft state and we might not need to store replicated the soft state.

If the soft state is large, it can take a long time to reproduce it. AWS’s S3 service is an example. In Feb 2017 a cascading event (https://aws.amazon.com/message/41926/) made it necessary to restart parts of S3 servers to re-generate object indices. S3 was not re-started for many years and this process took many hours.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy