...

/

Design Server-side Monitoring Service

Design Server-side Monitoring Service

Learn to design a monitoring service to monitor server-side errors.

Monitoring

It’s challenging to know what’s happening at the hardware or application level when our infrastructure is distributed across multiple locations and includes many servers. Components can run into failures, response latency overshoot, overloaded or unreachable hardware, and containers running out of resources, among others. Multiple services are running in such an infrastructure, and anything can go awry.

When one of the services goes down, it can be the reason for other services to crash, and as a result, the application is unavailable to users. If we don’t know what went wrong early, it could take us a lot of time and effort to debug the system manually. Moreover, for larger services, we need to ensure that our services are working within our agreed service-level agreements. We need to catch important trends and signals of impending failures as early warnings so that any concerns or issues can be addressed.

Monitoring helps in analyzing such complex infrastructure where something is constantly failing. Monitoring distributed systems entails gathering, interpreting, and displaying data about the interactions between processes that are running at the same time. It assists in debugging, testing, performance evaluation, and having a bird’s-eye view over multiple services.

We will learn to design a monitoring service that focuses on server-side errors. These errors are usually visible to monitoring services as they occur on servers. Such errors are reported as error 5xx in HTTP response codes.

Requirements

Let’s sum up what we want our monitoring system to do for us:

  • Monitor critical local processes on a server for crashes.

  • Monitor any anomalies in the use of CPU/memory/disk/network bandwidth by a process on a server.

  • Monitor overall server health, such as CPU, memory, disk, network bandwidth, average load, and so on.

  • Monitor hardware component faults on a server, such as memory failures, failing or slowing disk, and so on.

  • Monitor the server’s ability to reach out-of-server critical services, such as network file systems and so on.

  • Monitor all network switches, load balancers, and any other specialized hardware inside a data center.

  • Monitor power consumption at the server, rack, and data center levels.

  • Monitor any power events on the servers, racks, and data center.

  • Monitor routing information and DNS for external clients.

  • Monitor network links and paths’ ...

Access this course and 1400+ top-rated courses and projects.