Exploring High Availability and Fault Tolerance of a Cluster
Explore how to validate high availability and fault tolerance in a Kubernetes cluster by simulating node failures. Learn to manage worker node instances with AWS EC2 and kOps, observe automatic recovery through Auto Scaling Groups, and understand the processes that restore a cluster to its desired state after an instance termination.
We'll cover the following...
The cluster would not be reliable if it’s not fault-tolerant. kOps intends to do that, but we’re going to validate that anyway.
Terminating a worker node
Let’s retrieve the list of worker node instances.
We use aws ec2 describe-instances to retrieve all the instances (five in total). The output is sent to jq, which filters them by the security group dedicated to worker nodes.
The output is as follows:
We’ll terminate one of the worker nodes. To do that, we’ll pick a random one and retrieve its ID. ...