Introduction to Distributed Monitoring
Explain the role of distributed monitoring in system design for preventing cascading failures and outages. Define the two primary fault categories: server-side errors and client-side errors. Describe how effective monitoring reduces operational costs and improves system reliability.
Need for monitoring
A single service failure can disrupt the execution of dependent systems. To prevent cascading failures, monitoring provides early warnings and helps identify the root cause of faults.
Consider a scenario where a user uploads a video, intro-to-system-design, to YouTube:
The
UIservice (server A) receives the video and passes data to service 2 (server B).Service 2 writes to the database and stores the video in blob storage.
Service 3 (server C) manages replication between database X and database Y.
If service 3 fails, ...