Search⌘ K

Exploring High Availability and Fault Tolerance of a Cluster

Explore how to validate high availability and fault tolerance in a Kubernetes cluster by simulating node failures. Learn to manage worker node instances with AWS EC2 and kOps, observe automatic recovery through Auto Scaling Groups, and understand the processes that restore a cluster to its desired state after an instance termination.

The cluster would not be reliable if it’s not fault-tolerant. kOps intends to do that, but we’re going to validate that anyway.

Terminating a worker node

Let’s retrieve the list of worker node instances.

Shell
aws ec2 \
describe-instances | jq -r \
".Reservations[].Instances[] \
| select(.SecurityGroups[]\
.GroupName==\"nodes.$NAME\")\
.InstanceId"

We use aws ec2 describe-instances to retrieve all the instances (five in total). The output is sent to jq, which filters them by the security group dedicated to worker nodes.

The output is as follows:

Shell
i-063fabc7ad5935db5
i-04d32c91cfc084369

We’ll terminate one of the worker nodes. To do that, we’ll pick a random one and retrieve its ID. ...