Backing Up Swarms

Let’s learn to back up a swarm to recover the swarm in the event of corruption.

Recovering a swarm from a backup is an extremely rare scenario. However, business-critical environments should always be prepared for worst-case scenarios.

Why do you need backups?

You might be asking why backups are necessary if the control plane is already replicated and highly-available (HA).

To answer that question, consider the scenario where a malicious actor deletes all of the Secrets on a swarm. HA cannot help in this scenario as the Secrets will be deleted from the cluster store that is automatically replicated to all manager nodes. In this scenario, the highly-available replicated cluster store works against you — quickly propagating the delete operation. In this scenario, you can either recreate the deleted objects from copies kept in a source code repository, or you can attempt to recover your swarm from a recent backup.

Managing your swarm and applications declaratively is a great way to prevent the need to recover from a backup. For example, storing configuration objects outside of the swarm in a source code repository will enable you to redeploy things like networks, services, secrets, and other objects. However, managing your environment declaratively and strictly using source control repositories requires discipline.

Anyway, let’s see how to backup a swarm.

Getting started with backups

Swarm configuration and state are stored in /var/lib/docker/swarm on every manager node. The configuration includes; Raft log keys, overlay networks, Secrets, Configs, Services, and more. A swarm backup is a copy of all the files in this directory.

As the contents of this directory are replicated to all managers, you can, and should, perform backups from multiple managers. However, as you have to stop the Docker daemon on the node you are backing up, it’s a good idea to perform the backup from non-leader managers. This is because stopping Docker on the leader will initiate a leader election. You should also perform the backup at a quiet time for the business, as stopping a manager can increase the risk of the swarm losing quorum if another manager fails during the backup.

Warning!

The procedure we’re about to follow is designed for demonstration purposes and you’ll need to tweak it for your production environment. It also creates a couple of swarm objects so that a later step can prove the restore operation worked.

Warning: The following operation carries risks. You should also ensure you perform test backup and restore operations regularly and test the outcomes.

Backing up

The following commands will create the following two objects so you can prove the restore operation:

  • An overlay network called “Unimatrix-01”
  • A Secret called “missing drones” containing the text “Seven of Nine”

Get hands-on with 1200+ tech skills courses.