How can we persist states?

Note: Having fault-tolerance and high availability is of no use if we lose the application state during rescheduling.

Having a state is unavoidable, and we need to preserve it no matter what happens to our applications, servers, or even a whole data center.

The way to preserve the state of our applications depends on their architecture. Some store data in memory and rely on periodic backups. Others are capable of synchronizing data between multiple replicas so that a loss of one instance does not result in loss of data. Most, however, rely on the disk to store their state. We’ll focus on that group of stateful applications.

If we are to build fault-tolerant systems, we need to make sure that failure of any part of the system is recoverable. Since speed is of the essence, we cannot rely on manual operations to recuperate from failures. Even if we could, no one wants to sit in front of a screen, waiting for something to fail, only to bring it back to its previous state.

Kubernetes failure handling

We already saw that Kubernetes would, in most cases, recuperate from a failure of an application, of a server, or even of a whole data center. It’ll reschedule Pods to healthy nodes. We also experienced how AWS and kOps accomplish more or less the same effect on the infrastructure level. AutoS Scaling GGroups will recreate failed nodes, and since they are provisioned with kOps startup processes, new instances will have everything they need, and they will join the cluster.

The only thing that prevents us from saying that our system is (mostly) highly available and fault tolerant is that we did not solve the problem of persisting state across failures. That’s the subject we’ll explore next.

We’ll try to preserve our data no matter what happens to our stateful applications or the servers where they run.

Creating a Kubernetes cluster

We’ll start by recreating a similar cluster as the one we used in the previous chapter.

Get hands-on with 1200+ tech skills courses.