Fault Tolerance

Learn about fault tolerance in distributed systems.

Nobody should transform their perfectly working simple system into a distributed one without convincing reasons. There has to be a thorough discussion and evaluation before finally deciding to go for a distributed system.

On the other hand, building a distributed system correctly does provide us with some advantages which we would not have in a simple system.

In this chapter, we’ll explore the core goals that you need to keep in mind to build a distributed system the correct way.

Let’s start with fault tolerance.

Fault tolerance in distributed systems

Anything that can go wrong, will go wrong.

Edward A. Murphy Jr. (1918-1990)

In the world of distributed systems, Murphy’s Law is not just a saying, it’s a fact.

Things go wrong in all kinds of ways.

To build a robust system, your system needs to be able to handle adverse scenarios. And a robust system like this does not come easily—or for free. It requires energy, effort, and problem-solving. As a developer, you will have to put extensive effort into achieving fault tolerance.

To explain just what fault tolerance is, let’s think of a scenario.

It’s Monday morning, the beginning of a brand new week. You’re working from a local coffee shop, sipping your favorite morning beverage, and feeling well-rested, fresh and totally motivated to do some awesome work!

Hours later, as you are finishing up your work for the day, all of a sudden your screen goes black. Your laptop’s power is gone! And it won’t turn on no matter what you do!

Well, it turns out your motherboard crashed. You need to repair it or get a new laptop. Whatever you do, your important work is blocked for some time.

What you are experiencing is a fault in your laptop. In this case, it’s probably a system failure that is permanent.

Now imagine if the same thing happens in the backend of the My Cool App.

The backend of the My Cool App consists of many nodes now. Multiple server programs are running on tens of machines. Then there are database replicas to serve read requests efficiently. You have also set up a data sync mechanism so that the database nodes are always consistent.

Now, what if something goes wrong in one or more of your system nodes?

Note: From a distributed system’s perspective, when things go wrong in any part of the system, we say that there are faults.

A few examples of things that could go wrong in the system of My Cool App:

  • One of the nodes running the server program is suddenly handling too many connections. As a result, the node runs out of the RAM it needs to function correctly and the node crashes.
  • Network communication between two database nodes dropped which means the data sync among your database nodes has stopped.
  • Hardware components like a disk drive or logic board chips got damaged in any of the nodes of your system.
  • Your server nodes are unable to handle a sudden spike in traffic, resulting in crashes.
  • There was a new feature added to My Cool App. The code that was deployed contained a nullPointerException bug which caused the crash of server programs running on the nodes.

And there are many more possibilities not included in the above list.

If there is a fault in any part of a single-node system, it makes sense that your whole backend would be down. But when you build a distributed system, there’s an expectation that the system should be capable of tolerating faults in different parts of the system.

The idea of being able to tolerate faults in a distributed system is called fault tolerance.

Why we need fault tolerance

Even if a server is burning to the ground, the system is working just fine
Even if a server is burning to the ground, the system is working just fine

If you get some experience working with distributed systems, you’ll discover that things just go wrong for the weirdest reasons.

Hardware fails. Software has bugs. Networks drop or crash. Things just go wrong.

Maybe everything in your system is perfectly fine. But then you’ll realize that some dependency you have in your system faces an issue. Maybe an external API is changed and the API contract is invalid.

The point is, you can never assume everything will be okay all the time. Things will go wrong. Faults will happen.

This is why we need fault tolerance in our distributed systems. Otherwise, one small bump in any part of the system will bring the whole thing down—And surely nobody wants that. If this type of full-system crash is allowed to happen, then there is no point in building distributed systems in the first place.

Key takeaways

  • If there is something in your system not functioning as expected, you possibly have a fault.
  • Distributed systems have to be able to tolerate faults so that they can continue to function even if faults appear.
  • As a system owner, you should always expect faults and remain prepared accordingly. In the case of handling faults, prevention is better than cures.
  • Distributed systems have to be fault tolerant by nature.