Become the Highest Paid Engineer at Your Company/

...

/

Triage First

Whether services time out or dependencies collapse, things will break. Your goal isn’t just to fix them, but to ensure your team is equipped to do so without paging you on your well-deserved PTO.

As a Staff+ engineer, you have to lead triage calmly and design systems that are debuggable, observable, and resilient.

To do that, you’ll lean on four foundational pillars:

Logs: Surface-level signals of what’s gone wrong.
Traces: A map of how requests flow through distributed systems.
Circuit breakers: Smart failure control for dependent systems.
Bulkheads: Isolation mechanisms to prevent blast radius spread.

Master these, and you won’t just debug faster—you design systems that heal faster.

1. Logs: The first line of sight

Logs give insight into what the system was doing right before something broke. When used properly, logs help us reconstruct the timeline, understand what the system was doing, and locate where things went wrong.

Think of logs as the ultimate, always-on version of the original debugging technique: the humble print() statement.

Good logs are:

Structured (ideally JSON).
Context-rich.
Consistently formatted across services.
Contain useful context like time stamps, log levels, trace IDs, user IDs, and environment information.

Quality > quantity with logs: Avoid over-logging or logging sensitive data.

During triage, searching for logs with ERROR or WARN levels within a specific time range can help quickly identify root causes.

Let’s look at examples to understand the importance of a structured log.

Unstructured logs

The users are getting 500 errors from the login page. The engineering team checks the logs and sees the following:

Now the team can filter logs by level, trace_id, or error_code and immediately triangulate the failing service.

2. Traces: Following the request’s journey

Logs alone aren’t enough when we have dozens of microservices.

Distributed tracing complements logs by connecting dots across services. Traces show us the entire journey of a request, from the frontend to multiple backend services, with each operation recorded and visualized.

Each trace includes spans, which represent operations like API calls, database queries, or cache lookups.
Spans include metadata like latency, status codes, and child-parent relationships.

Comparing successful and failing traces helps isolate failure points much faster than scrolling through logs alone. Using tools like OpenTelemetry, Jaeger, or Datadog, you can see where latency occurs, which service failed, and why.

3. Circuit breakers: Fail fast, Don’t cascade

Imagine a scenario where your payment processor goes down. Without safeguards, your backend could hang indefinitely, burning CPU and threads. Circuit breakers prevent this by proactively stopping calls to a failing dependency after repeated failures.

Inspired by electrical circuits, this pattern uses three states:

Closed: Normal behavior
Open: Traffic is blocked
Half-open: Trial period to check if the service has recovered

Circuit breakers isolate failures and prevent cascading outages. Libraries like Resilience4j (Java), gobreaker (Go), and pybreaker (Python) are commonly used.

Cascading failure example

Your app calls a third-party weather API. It starts timing out.

Soon:

Threads get stuck waiting.
Memory usage spikes.
Other unrelated requests start failing.

This is called a cascading failure.

With the following circuit breaker in place:

The breaker opens when 50% of recent requests fail within the rolling window (e.g., 5 failures out of 10 requests).
All further calls get blocked in favor of breaker's (optional) fallback (e.g. a “Weather data unavailable” message).
After the timeout period, the breaker half-opens, allowing limited requests to pass to check if the service is back online
Most importantly, your service stays up. Users see a degraded but working experience.

4. Bulkheads: Isolate to contain damage

Bulkheading is a pattern derived from ship design—dividing a ship into watertight compartments so that a breach in one area doesn’t sink the whole vessel.

In software, it means isolating different parts of the system by:

Using separate thread pools for critical vs non-critical traffic.
Applying rate limits per endpoint.
Assigning compute quotas by customer tier.

For example, if admin and customer endpoints share the same thread pool and an admin report hangs, it could lock customers out of the app. With bulkheads in place, each component stays isolated, so a flood in one won’t sink the whole ship.

Done right, bulkheads let your system degrade gracefully instead of crashing completely. It’s one of the most underrated resilience patterns out there.

John Quest: Log analysis drill

Objective: Analyze logs to find the root cause of a failure.

Sample data:

{"timestamp":"2025-09-17T10:00:01Z", "level":"INFO", "trace_id":"abc123", "service":"auth", "message":"Login attempt for user1"}
{"timestamp":"2025-09-17T10:00:02Z", "level":"ERROR", "trace_id":"abc123", "service":"auth", "message":"Database timeout on user lookup"}
{"timestamp":"2025-09-17T10:00:05Z", "level":"ERROR", "trace_id":"abc123", "service":"frontend", "message":"500 returned to user"}
{"timestamp":"2025-09-17T10:00:06Z", "level":"INFO", "trace_id":"def456", "service":"auth", "message":"Login attempt for user2"}
{"timestamp":"2025-09-17T10:00:07Z", "level":"INFO", "trace_id":"def456", "service":"auth", "message":"Login successful"}