Some More Things to Discover

Besides the things we highlighted in the previous lesson, look at some more things to discover.

At the risk of being unfair to other systems and material out there, we would like to mention CockroachDB as one system that has a lot of public material demonstrating how they have used theoretical concepts in practice. Some concrete examples are implementation of pipelined consensus and a parallelized version of two-phase commit that required a single round-trip instead of two before acknowledging a commit. Some resources that contain a lot of practical information to build and operate distributed systems are the Amazon Builders Library and papers by HamiltonJ. Hamilton, “On Designing and Deploying Internet-Scale Services,”Proceedings of the 21st Large Installation System Administration Conference (LISA ’07), 2007. and BrewerE. A. Brewer, “Lessons from Giant Scale Services,” IEEE Internet Computing, Volume 5, No. 4, 2001., with learnings of practitioners that have built large-scale systems.

The chapters on practices and patterns discussed about how systems can deal with failure. Unfortunately, two types of failure are frequently neglected when building or operating distributed systems, even though they are quite common:

  • Gray failuresP. Huang et al., “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems,” Proceedings of the 16th Workshop on Hot Topics in Operating Systems, 2017.
  • Partial failuresC. Lou, P. Huang, and S. Smith, “Understanding, Detecting and Localizing Partial Failures in Large System Software,” 17th USENIX Symposium on Networked Systems Design and Implementation, 2020

Gray failures do not manifest cleanly as a binary indication . They are more subtle and can be observed differently by different parts of a system. Partial failures are those in which only parts of a system fail in a way that has serious consequences equivalent to a full failure of the system, sometimes due to a defect in the design.

These types of failures can be very common in distributed systems due to many moving parts. They can have serious consequences, so it is essential for people who build and run distributed systems to internalize these concepts and look out for them in the systems they build and operate.

Note: Another important topic that we did not cover at all is the formal verification of systems.

We can use many formal verification techniques and tools to prove safety and liveness properties of systems with TLA+M. A. Kuppe, L. Lamport, and D. Ricketts, “The TLA+ Toolbox,” arXiv:1912.10633, 2019.. It is one of the most commonly used across the software industry, Amazon is another one.

It is important to note that users of these formal verification methods have acknowledged publicly that these methods have helped them discover bugs in their designs but have also helped them significantly reason about the behavior of their systems in a better way.

Get hands-on with 1200+ tech skills courses.