Advanced Kubernetes Techniques: Monitoring, Logging, Auto-Scaling/

...

Alerting on Unschedulable or Failed Pods

In this lesson, we will see how to handle alerts in the case of Unschedulable or Failed Pods.

We'll cover the following...

- Cause of unschedulable or failed pods
- Generating an alert after a while
- - Retrieve the number of Pods in each of the phases
- - Intentionally failing a pod
- Creating another alert to notify us when pods fail

Cause of unschedulable or failed pods #

Knowing whether our applications are having trouble responding quickly to requests, whether they are being bombed with more requests than they could handle, whether they produce too many errors, and whether they are saturated, is of no use if they are not even running. Even if our alerts detect that something is wrong by notifying us that there are too many errors or that response times are slow due to an insufficient number of replicas, we should still be informed if, for example, one or even all the replicas failed to run. In the best-case scenario, such a notification would provide additional info about the cause of an issue. In a much worse situation, we might find out that one of the replicas of the DB is not running. That would not necessarily slow it down, nor would it produce any errors. However, it would put us in a situation where data could not be replicated (additional replicas are not running), and we might face a total loss of its state if the last standing replica fails as well.

There are many reasons why an application would fail to run. There might not be enough unreserved resources in the cluster. Cluster Autoscaler will deal with that problem if we have it. But, there are many other potential issues. Maybe, the image of the new release is not available in the registry. Or perhaps, the Pods are requesting PersistentVolumes that cannot be claimed. As you might have guessed, the list of the things that might cause our Pods to fail, be unschedulable or in an unknown state, is almost infinite.

Generating an alert after a while #

We cannot address all of the causes of problems with Pods individually. However, we can be notified if the phase of one or more Pods is Failed, Unknown, or Pending. Over time, we might extend our self-healing scripts to address some of the specific causes of those statuses. For now, our best first step is to be notified if a Pod is ...

Before Getting Started

Autoscaling Deployments and StatefulSets

Auto-Scaling Nodes Of A Kubernetes Cluster

Collecting and Querying Metrics and Sending Alerts

Debugging Issues Discovered Through Metrics and Alerts

Extending HorizontalPodAutoscaler With Custom Metrics

Visualizing Metrics And Alerts

Collecting And Querying Logs

Conclusion

Alerting on Unschedulable or Failed Pods

Cause of unschedulable or failed pods #

Generating an alert after a while #