Alerting on Error-related Issues
In this lesson, we will discuss the issues related to the Error Key Metric.
Monitor the rate of errors compared to the total number of requests #
We should always be aware of whether our applications or the system is producing errors. However, we cannot start panicking at the first occurrence of an error since that would generate too many notifications that we’d likely end up ignoring. Errors happen often, and many are caused by issues that are fixed automatically or are due to circumstances that are out of our control. If we are to perform an action on every error, we’d need an army of people working 24/7 only on fixing issues that often do not need to be fixed. As an example, entering into a “panic” mode because there is a single response with code in 500 range would almost certainly produce a permanent crisis. Instead, we should monitor the rate of errors compared to the total number of requests and react only if it passes a certain threshold. After all, if an error persists, that rate will undoubtedly increase. On the other hand, if it continues being low, it means that the issue was fixed automatically by the system (e.g., Kubernetes rescheduled the Pods from the failed node) or that it was an isolated case that does not repeat.
Retrieve and separate requests from their statuses #
Our next mission is to retrieve requests and separate them from their statuses. If we can do that, we should be able to calculate the rate of errors.
We’ll start by generating a bit of traffic.
for i in {1..100}; do
curl "http://$GD5_ADDR/demo/hello"
done
open "http://$PROM_ADDR/graph"
We sent a hundred requests and opened the Prometheus
's graph screen.
Let’s see whether the nginx_ingress_controller_requests
metric we used previously provides the statuses of the requests.
Please type the expression that follows, and press the Execute button.
nginx_ingress_controller_requests
We can see all the data recently scraped by Prometheus
. If we pay closer attention to the labels, we can see that, among others, there is status
. We can use it to calculate the percentage of those with errors (e.g., 500 range) based on the total number of requests. We already saw that we can use the ingress
label to separate calculations per application, assuming that we are interested only in those that are public-facing.
Get hands-on with 1200+ tech skills courses.