Eliminating Repeating Alerts

Learn how to eliminate repeating alerts in the given data.

In the previous lesson, we eliminated potential anomalies with less than 10 entries. Using thresholds, we were able to remove some non-interesting anomalies.

Let’s have a look at the data for status code 400 after applying the threshold:

status_code400400period2020-08-01 18:00:00+002020-08-01 17:59:00+00entries2412zscore6.8237772054730687.445849602151508alerttt

The first alert happened at 17:59, and a minute later, the z-score was still high with a large number of entries, and so we classified the next rows at 18:00 as an anomaly as well.

If you think of an alerting system, we want to send an alert only when an anomaly first happens. We do not want to send an alert every minute until the z-score comes back below the threshold. In this case, we only want to send one alert at 17:59. We do not want to send another alert a minute later at 18:00.

Let’s remove alerts where the previous period was also classified as an alert:

WITH calculations_over_window AS (
SELECT
status_code,
period,
entries,
AVG(entries) OVER status_window as mean_entries,
STDDEV(entries) OVER status_window as stddev_entries
FROM
server_log_summary
WINDOW status_window AS (
PARTITION BY status_code
ORDER BY period
ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
)
),
with_zscore AS (
SELECT
*,
(entries - mean_entries) / NULLIF(stddev_entries::float, 0) as zscore
FROM
calculations_over_window
),
with_alert AS (
SELECT
*,
entries > 10 AND zscore > 3 AS alert
FROM
with_zscore
),
with_previous_alert AS (
SELECT
*,
LAG(alert) OVER (PARTITION BY status_code ORDER BY period) AS previous_alert
FROM
with_alert
)
SELECT
status_code,
period,
entries,
zscore,
alert
FROM
with_previous_alert
WHERE
alert AND NOT previous_alert
ORDER BY
period DESC;

By eliminating alerts that were already triggered we get a very small list of anomalies that may have happened during the day. Looking at the results we can see what anomalies we would have discovered:

  • Anomaly in status code 400 at 17:59: We also found that one earlier.
400 status code entries
400 status code entries
  • Anomaly in status code 500: We spotted this one on the chart when we started.
500 status code entries
500 status code entries
  • Anomaly in status code 404: This is a hidden anomaly that we did not know about until now.
404 status code entries
404 status code entries

The query can now be used to fire alerts when it encounters an anomaly.