Search⌘ K

Observability in Distributed Systems: Metrics, Logs, and Traces

Learn how to build resilient distributed systems using the core pillars of observability: metrics, logs, and traces.

When a single user request travels through a dozen microservices, pinpointing the source of an error becomes a significant challenge.

Traditional debugging methods, designed for monolithic applications, are insufficient in the complex, interconnected world of modern distributed systems. This complexity demands a deeper, more intuitive understanding of a system’s internal state.

This lesson explores observability, the practice that enables engineers to understand a system’s internal state by examining its external outputs. It is the key to moving from reactive problem-solving to proactive system improvement.

The three components of observability
The three components of observability

With this mental model in place, we can now deeply examine observability and its role in distributed systems.

Introduction to observability

Observability is a property of a system that allows its internal state to be inferred from its external outputs. In other words, a highly observable system allows us to understand what is happening inside it by examining the data it emits.

Monitoring and observability are related but distinct concepts. Monitoring tells you when something is wrong, while observability helps you understand why it is wrong.

Monitoring is the practice of collecting and analyzing data against predefined thresholds to determine the health of a system. It focuses on known failure modes. For example, a monitoring check might query a /health endpoint and trigger an alert if the HTTP response code is not 200 OK. While this indicates that a predefined condition has been met, it does not provide insight into unexpected or novel issues.

Observability, in contrast, goes beyond predefined checks. It enables engineers to infer the internal state of a system based on its external outputs.

  • Internal state: This includes runtime information, such as the size of a request queue, the current value of a variable, the state of a connection pool, or active feature flags.

  • External outputs: These are the telemetry data the system emits, such as metrics, traces, and logs.

Note: Monitoring is for “known unknowns” (e.g., we set an alert for high CPU usage, a problem we anticipate). Observability is for “unknown unknowns” (e.g., debugging a novel issue we never anticipated).

A highly observable system is one where we can diagnose unforeseen problems without needing to ship new code to gather more information. This is the fundamental requirement for operating complex ...