Writing code on a single node is fairly straightforward but the moment we switch to writing code that runs on multiple computers connected by a network (distributed systems), the number of ways faults and failures can occur is numerous, nondeterministic and unpredictable. For example:

  • Misconfiguration of network switches

  • Accidental power cycles

  • Power distribution unit (PDU) failures

  • Backbone failures for the entire datacenter

  • Power failure for the entire datacenter

Distributed systems also suffer from partial failures, where a part of the system experiences failure but not the entire system. A distributed system may continue to work intermittently when experiencing a partial failure, which makes reasoning about such systems all the more difficult.

Building Large Scale Systems

There are two choices when it comes to designing systems that can solve problems beyond the resource capabilities of a single-node-run-of-the-mill computer. Either we can build a very high end expensive machine that has several times the compute and network resources of a single node (think supercomputers) or that we could use a collection of commodity computers connected by a network to solve large scale problems. Both options have their own programming models in which the problems are expressed and then solved.

Get hands-on with 1200+ tech skills courses.