...

>

Introduction to Distributed Monitoring

Introduction to Distributed Monitoring

Explain the role of distributed monitoring in system design for preventing cascading failures and outages. Define the two primary fault categories: server-side errors and client-side errors. Describe how effective monitoring reduces operational costs and improves system reliability.

Need for monitoring

A single service failure can disrupt the execution of dependent systems. To prevent cascading failures, monitoring provides early warnings and helps identify the root cause of faults.

Consider a scenario where a user uploads a video, intro-to-system-design, to YouTube:

  • The UI service (server A) receives the video and passes data to service 2 (server B).

  • Service 2 writes to the database and stores the video in blob storage.

  • Service 3 (server C) manages replication between database X and database Y.

If service 3 fails, ...