Search⌘ K
AI Features

Introduction

Explore the complexities of writing code for distributed systems that run across multiple networked computers. Understand partial failures, fault tolerance, and the trade-offs between supercomputers and commodity hardware clusters. Gain insights into designing reliable software atop unreliable infrastructure.

We'll cover the following...

Introduction

Writing code on a single node is fairly straightforward but the moment we switch to writing code that runs on multiple computers connected by a network (distributed systems), the number of ways faults and failures can occur is numerous, nondeterministic and unpredictable. For example:

  • Misconfiguration of network switches

  • Accidental power cycles

  • Power distribution unit (PDU) failures

  • Backbone failures for the entire datacenter

  • Power failure for the entire datacenter

Distributed systems also suffer from partial failures, where a part of the system experiences failure but not the entire system. A distributed system may continue to work intermittently ...