System Design- 15 min design solution

It’s challenging to know what’s happening at the hardware or application level when our infrastructure is distributed across multiple locations and includes many servers. Components can run into failures, response latency overshoot, overloaded or unreachable hardware, and containers running out of resources, among others. Multiple services are running in such an infrastructure, and anything can go awry. 

When one of the services goes down, it can be the reason for other services to crash, and as a result, the application is unavailable to users. If we don’t know what went wrong early, it could take us a lot of time and effort to debug the system manually. Moreover, for larger services, we need to ensure that our services are working within our agreed service-level agreements. We need to catch important trends and signals of impending failures as early warnings so that any concerns or issues can be addressed.

## Requirements

Let's sum up what we want our monitoring system to do for us:



* Monitor critical local processes on a server for crashes.

* Monitor any anomalies in the use of CPU/memory/disk/network bandwidth by a process on a server.

* Monitor overall server health, such as CPU, memory, disk, network bandwidth, average load, and so on.

* Monitor hardware component faults on a server, such as memory failures, failing or slowing disk, and so on.

* Monitor the server’s ability to reach out-of-server critical services, such as network file systems and so on.

* Monitor all network switches, load balancers, and any other specialized hardware inside a data center.

* Monitor power consumption at the server, rack, and data center levels.

* Monitor any power events on the servers, racks, and data center.

* Monitor routing information and DNS for external clients.

* Monitor network links and paths' latency inside and across the data centers. 

* Monitor network status at the peering points.

* Monitor overall service health that might span multiple data centers—for example, a CDN and its performance.


We want automated monitoring that identifies an anomaly in the system and informs the alert manager or shows the progress on a dashboard. Cloud service providers provide a health status of their services:


* AWS: https://health.aws.amazon.com/health/status
* Azure: https://status.azure.com/en-us/status
* Google: https://status.cloud.google.com/

Learn to design a monitoring service to monitor server-side errors.

__default

Design Server-side Monitoring Service

Monitoring

Requirements