Alerting on Unschedulable or Failed Pods
Knowing whether our applications are having trouble responding quickly to requests, whether they are being bombed with more requests than they could handle, whether they produce too many errors, and whether they are saturated, is of no use if they are not even running. Even if our alerts detect that something is wrong by notifying us that there are too many errors or that response times are slow due to an insufficient number of replicas, we should still be informed if, for example, one or even all the replicas failed to run. In the best-case scenario, such a notification would provide additional info about the cause of an issue. In a much worse situation, we might find out that one of the replicas of the DB is not running. That would not necessarily slow it down, nor would it produce any errors. However, it would put us in a situation where data could not be replicated (additional replicas are not running), and we might face a total loss of its state if the last standing replica fails as well.
There are many reasons why an application would fail to run. There might not be enough unreserved resources in the cluster.
Cluster Autoscaler will deal with that problem if we have it. But, there are many other potential issues. Maybe, the image of the new release is not available in the registry. Or perhaps, the Pods are requesting PersistentVolumes that cannot be claimed. As you might have guessed, the list of the things that might cause our Pods to fail, be unschedulable or in an unknown state, is almost infinite.
We cannot address all of the causes of problems with Pods individually. However, we can be notified if the phase of one or more Pods is
Pending. Over time, we might extend our self-healing scripts to address some of the specific causes of those statuses. For now, our best first step is to be notified if a Pod is in one of those phases for a prolonged period of time (e.g., fifteen minutes). Alerting as soon as the status of a Pod indicates a problem would be silly because that would generate too many false positives. We should get an alert and choose how to act only after waiting for a while, thus giving Kubernetes time to fix an issue. We should perform some reactive actions only if Kubernetes fails to remedy the situation.
Over time, we’ll notice some patterns in the alerts we’re receiving. When we do, alerts should be converted into automated responses that will remedy selected issues without our involvement. We already explored some of the low hanging fruits through
Cluster Autoscaler. For now, we’ll focus on receiving alerts for all other cases, and failed and unschedulable Pods are a few of those. Later on, we might explore how to automate responses. But, that moment is not now, so we’ll move forward with yet another alert that will result in a notification to Slack.
Let’s open the
Prometheus's graph screen.
Please type the expression that follows and click the Execute button.
The output shows us each of the Pods in the cluster. If you take a closer look, you’ll notice that there are five results for each Pod, one for each of the five possible phases. If you focus on the
phase field, you’ll see that there is an entry for
Unknown. So, each Pod has five results, but only one has the value
1, while the values of the other four are all set to