Using Internal Metrics to Debug Potential Issues

We’ll resend requests with slow responses again so that we get to the same point where we started this chapter.

for i in {1..20}; do
    DELAY=$[ $RANDOM % 10000 ]
    curl "http://$GD5_ADDR/demo/hello?delay=$DELAY"
done

open "http://$PROM_ADDR/alerts"

We sent twenty requests that will result in responses with random duration (up to ten seconds). Later on, we opened the Prometheus' alerts screen.

A while later, the AppTooSlow alert should fire (remember to refresh your screen), and we have a (simulated) problem that needs to be solved. Before we start panicking and do something hasty, we’ll try to find the cause of the issue.

Please click the expression of the AppTooSlow alert.

Issue with nginx_ingress_controller_request_duration_seconds #

We are redirected to the graph screen with the pre-populated expression from the alert. Feel free to click the Expression button, even though it will not provide any additional info, apart from the fact that the application was fast, and then it slowed down for some inexplicable reason. You will not be able to gather more details from that expression. You will not know whether it’s slow on all methods, whether only a specific path responds slow, nor much of any other application-specific details. Simply put, the nginx_ingress_controller_request_duration_seconds metric is too generic. It served us well as a way to notify us that the application’s response time increased, but it does not provide enough information about the cause of the issue. For that, we’ll switch to the http_server_resp_time metric Prometheus is retrieving directly from go-demo-5 replicas.

Switch to http_server_resp_time metric #

Please type the expression that follows, and press the Execute button.

sum(rate(
    http_server_resp_time_bucket{
        le="0.1",
        kubernetes_name="go-demo-5"
    }[5m]
)) /
sum(rate(
    http_server_resp_time_count{
        kubernetes_name="go-demo-5"
    }[5m]
))

Switch to the Graph tab, if you’re not there already.

That expression is very similar to the queries we wrote before when we were using the nginx_ingress_controller_request_duration_seconds_sum metric. We are dividing the rate of requests in the 0.1 seconds bucket with the rate of all the requests.

In my case (screenshot below), we can see that the percentage of fast responses dropped twice. That coincides with the simulated slow requests we sent earlier.

Get hands-on with 1000+ tech skills courses.