Design of a Monitoring System

Define the scope and requirements for a distributed monitoring system in system design. Identify key signals to track, such as service health, network latency/packet loss, and host-level hardware errors. Outline a high-level architecture with metric collectors, a query API, and a time-series database for storage.

We'll cover the following...

Requirements
Building blocks we will use
High-level design

Requirements

The monitoring system must track the following:

Critical local process crashes.
Resource usage anomalies (CPU, memory, disk, network) in specific processes.
Overall server health, including load averages and resource consumption.
Hardware faults, such as memory failures or disk degradation.
Connectivity to critical external services, like network file systems.
Data center hardware status, including network switches and load balancers.
Power consumption at the server, rack, and data center levels.
Power events affecting servers, racks, or the data center.
Routing information and DNS status.
Network latency within and across data centers.
Network status at peering points.
Global service health across data centers (e.g., CDN performance).

Automated monitoring identifies anomalies and notifies the alert manager or updates a dashboard. Cloud providers offer similar health status pages:

AWS: https://health.aws.amazon.com/health/status
Azure: https://status.azure.com/en-us/status
Google: https://status.cloud.google.com/