Introduction
We'll cover the following
Introduction
Writing code on a single node is fairly straightforward but the moment we switch to writing code that runs on multiple computers connected by a network (distributed systems), the number of ways faults and failures can occur is numerous, nondeterministic and unpredictable. For example:
Misconfiguration of network switches
Accidental power cycles
Power distribution unit (PDU) failures
Backbone failures for the entire datacenter
Power failure for the entire datacenter
Distributed systems also suffer from partial failures, where a part of the system experiences failure but not the entire system. A distributed system may continue to work intermittently when experiencing a partial failure, which makes reasoning about such systems all the more difficult.
Building Large Scale Systems
There are two choices when it comes to designing systems that can solve problems beyond the resource capabilities of a single-node-run-of-the-mill computer. Either we can build a very high end expensive machine that has several times the compute and network resources of a single node (think supercomputers) or that we could use a collection of commodity computers connected by a network to solve large scale problems. Both options have their own programming models in which the problems are expressed and then solved.
Get hands-on with 1200+ tech skills courses.