Design of a Monitoring System

Define the scope and requirements for a distributed monitoring system in system design. Identify key signals to track, such as service health, network latency/packet loss, and host-level hardware errors. Outline a high-level architecture with metric collectors, a query API, and a time-series database for storage.

Requirements

The monitoring system must track the following:

  • Critical local process crashes.

  • Resource usage anomalies (CPU, memory, disk, network) in specific processes.

  • Overall server health, including load averages and resource consumption.

  • Hardware faults, such as memory failures or disk degradation.

  • Connectivity to critical external services, like network file systems.

  • Data center hardware status, including network switches and load balancers.

  • Power consumption at the server, rack, and data center levels.

  • Power events affecting servers, racks, or the data center.

  • Routing information and DNS status.

  • Network latency within and across data centers.

  • Network status at peering points.

  • Global service health across data centers (e.g., CDN performance).

Automated monitoring identifies anomalies and notifies the alert manager or updates a dashboard. Cloud providers offer similar health status pages: