Introduction
Explore the complexities of writing code for distributed systems that run across multiple networked computers. Understand partial failures, fault tolerance, and the trade-offs between supercomputers and commodity hardware clusters. Gain insights into designing reliable software atop unreliable infrastructure.
We'll cover the following...
Introduction
Writing code on a single node is fairly straightforward but the moment we switch to writing code that runs on multiple computers connected by a network (distributed systems), the number of ways faults and failures can occur is numerous, nondeterministic and unpredictable. For example:
Misconfiguration of network switches
Accidental power cycles
Power distribution unit (PDU) failures
Backbone failures for the entire datacenter
Power failure for the entire datacenter
Distributed systems also suffer from partial failures, where a part of the system experiences failure but not the entire system. A distributed system may continue to work intermittently ...