Enabling Fault Tolerance and Failure Detection

Learn how we will make key-value store fault-tolerant and able to detect failure.

Handling temporary failures

Typically, distributed systems use a quorum-based approach to handle failures. A quorum is the minimum number of votes required for a distributed transaction to proceed with an operation. If a server is part of the consensus and it becomes down, then we cannot perform the required operation. It affects the availability and durability of our system.

We will use a “sloppy quorum” instead of strict quorumUsually, a leader manages the communication among the participants of the consensus. The participants send acknowledgment after committing a successful write. Upon receiving these acknowledgments, the leader responds to the client. But the drawback is that the participants are easily affected by the network outage. If the leader is temporarily down and the participants cannot reach it, they declare the leader dead. Now a new leader has to be re-elected. Such frequent elections have a negative impact on performance as the system spends more time picking a leader than accomplishing any actual work. membership. In the sloppy quorum, the first n healthy nodes from the preference list handle all read and write operations. The n healthy nodes may not always be the first n nodes discovered when moving clockwise in the consistent hash ring.

Let’s consider the following configuration with n = 3. If node A is briefly unavailable or unreachable during a write operation, the request will be sent to the next healthy node from the preference list, which is node D ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Enabling Fault Tolerance and Failure Detection

Handling temporary failures