Observability in Distributed Systems: Metrics, Logs, and Traces

Explore how observability uses metrics logs and traces to reveal internal states of distributed systems. Understand the distinct roles of these data types and how combining them supports diagnosing complex issues, improving incident response, and guiding architectural decisions for scalable resilient systems.

We'll cover the following...

Introduction to observability
Core pillars of observability
Tools for observability
Connecting observability to System Design goals
Conclusion

When a single user request travels through a dozen microservices, pinpointing the source of an error becomes a significant challenge.

Traditional debugging methods, designed for monolithic applications, are insufficient in the complex, interconnected world of modern distributed systems. This complexity demands a deeper, more intuitive understanding of a system’s internal state.

This lesson explores observability, the practice that enables engineers to understand a system’s internal state by examining its external outputs. It is the key to moving from reactive problem-solving to proactive system improvement.

With this mental model in place, we can now deeply examine observability and its role in distributed systems.

Introduction to observability

Observability is a property of a system that allows its internal state to be inferred from its external outputs. In other words, a highly observable system allows us to understand what is happening inside it by examining the data it emits.

Monitoring and observability are related but distinct concepts. Monitoring tells you when something is wrong, while observability helps you understand why it is wrong.

Monitoring is the practice of collecting and analyzing data against predefined thresholds to determine the health of a system. It focuses on known failure modes. For example, a monitoring check might query a /health endpoint and trigger an alert if the HTTP response code is not 200 OK. While this indicates that a predefined condition has been met, it does not provide insight into unexpected or novel issues.

Observability, in contrast, goes beyond predefined checks. It enables engineers to infer the internal state of a system based on its external outputs.

Internal state: This includes runtime information, such as the size of a request queue, the current value of a variable, the state of a connection pool, or active feature flags.
External outputs: These are the telemetry data the system emits, such as metrics, traces, and logs. ...

1.Introduction to System Design

2.Distributed System Fundamentals

3.Communication in Distributed Systems

4.Storage and Data Management

5.Security in System Design

6.Trade-Offs and Real-World Design Principles

7.Wrapping Up Fundamentals of System Design

Observability in Distributed Systems: Metrics, Logs, and Traces

Introduction to observability