Monitoring and Alerting with Prometheus
Understand the importance of monitoring and alerting in chaos engineering with Prometheus. This lesson covers how to observe metrics, set up dashboards, and implement alerting systems like AlertManager to detect system failures reliably. Gain insights on why mastering these tools is essential before running chaos experiments in Kubernetes environments.
We'll cover the following...
As I already mentioned, the critical ingredient that Chaos Toolkit does not provide is notifications whether a part of the system failed. Steady-state hypotheses are focused on what we know, and they are usually limited to a single application, network, storage, or node. By their nature, they are limited in their scope.
As you already know, we do need a proper monitoring system. We need to gather the metrics, and we are already doing that ...