...

Failure in the World of Distributed Systems

Let's see why failures occur in distributed systems, and how we can detect them.

We'll cover the following...

One reason for failure
One mechanism to detect failure
- Trade-off for the small timeout value
- Trade-off for the large timeout value
Failure detector
- Properties that categorize failure detectors
- A perfect failure detector

We should understand that it is challenging to identify failure because of all the characteristics of a distributed system that the Difficulties Designing Distributed Systems lesson described. One of them is the asynchronous nature of the network.

One reason for failure

The asynchronous nature of the network in a distributed system can make it very hard for us to differentiate between a crashed node and a node that is just really slow to respond to requests.

One mechanism to detect failure

Timeouts is the main mechanism we can use ...

Before Getting Started

Introduction to Distributed Systems

Basic Concepts and Theorems

Distributed Transactions

Achieving Isolation

Achieving Atomicity

Concluding Distributed Transactions

Consensus

Time

Order

Networking

Security

Security Protocols

From Theory to Practice

Case Study 1: Distributed File Systems

Case Study 2: Distributed Coordination Service

Case Study 3: Distributed Data Stores

Case Study 4: Distributed Messaging System

Case Study 5: Distributed Cluster Management

Case Study 6: Distributed Ledger

Case Study 7: Distributed Data Processing Systems

Practices & Patterns

Communication Patterns

Coordination Patterns

Data Synchronization

Shared-nothing Architectures

Distributed Locking

Compatibility Patterns

Dealing with Failure

Distributed Tracing

Concluding this Course

Failure in the World of Distributed Systems

One reason for failure

One mechanism to detect failure