Introduction
Explore the complexities of distributed systems by understanding common challenges such as network misconfigurations, partial failures, and power issues. Learn the design trade-offs between using supercomputers and commodity hardware clusters, and how systems are built to operate reliably despite hardware unreliability.
We'll cover the following...
Introduction
Writing code on a single node is fairly straightforward but the moment we switch to writing code that runs on multiple computers connected by a network (distributed systems), the number of ways faults and failures can occur is numerous, nondeterministic and unpredictable. For example:
Misconfiguration of network switches
Accidental power cycles
Power distribution unit (PDU) failures
Backbone failures for the entire datacenter
Power failure for the entire datacenter
Distributed systems also suffer from partial failures, where a part of the system experiences failure but not the entire system. A distributed system may continue to work intermittently ...