How to Capture and Review LLM Traces for Reliable Evaluation
Explore how to reliably capture end-to-end large language model traces including user inputs, system prompts, and intermediate steps. Learn best practices for storing immutable structured data and designing custom annotation tools that simplify reviewing traces. Gain skills to build a minimum viable evaluation setup for systematic error analysis and reliable LLM evaluation workflows.
The previous lesson focused on interpreting system behavior through traces and error analysis. That work assumes access to meaningful traces. In practice, this is often the first point of failure. Teams collect incomplete logs, lose intermediate steps, or lack sufficiently diverse examples to learn from. Without a reliable way to generate and capture traces, even well-designed evaluation frameworks stall.
This lesson focuses on deliberately producing traces. The first part covers how to capture complete, end-to-end traces from the system and review them efficiently using custom workflows and tools. The second part focuses on generating inputs, including synthetic data, that force the system to exercise different behaviors. By the end of the lesson, readers should know how to reliably create and work with the traces needed for error analysis and evaluation at scale.
How do I capture traces from my system?
As we have discussed, to evaluate a large language model (LLM) system meaningfully, you need more than outputs. You need a complete record of how each response was produced. Capturing traces means designing the system so that every user interaction can be inspected end-to-end. This is not primarily a tooling problem. It is a workflow and instrumentation problem.
At a minimum, every user interaction should correspond to a single trace. That trace should include:
The user input
Any system or developer prompts
All model calls
Retrieved context
Tool invocations and their results
The final response returned to the user
If any of these steps are missing, the evaluation becomes unreliable. You may know something went wrong, but you will not know where or why.
A common mistake is relying on partial logs, such as logging only the final prompt or the final output, rather than the entire log. This creates the illusion of observability while hiding the most important failures. If a retriever returns the wrong document, a tool produces an unexpected result, or an intermediate prompt subtly shifts behavior, those details matter.
Note: A trace must capture the full execution, not just the endpoints.
What does a real trace look like in practice?
Consider a customer support chatbot that rewrites incoming emails in a professional tone. A complete trace for a single interaction includes the following:
The original customer email.
The system prompt, which defines tone and policy constraints.
A classification step that labels customer mood.
A retrieval step that fetches internal guidelines.
The rewritten email.
Any post-processing or safety checks.
If the final email escalates unnecessarily, the trace allows you to see whether the failure originated from misclassification, incorrect retrieval, or generation itself. Without the full trace, it is easy to blame the wrong component.
Traces should be captured as immutable snapshots. Reconstructing traces later by stitching together logs is fragile and often misses context. Instead, design the system so that traces are created as interactions happen and stored as coherent records.
Where do traces usually get lost?
Most trace failures occur at system boundaries. Tool calls may be logged in one place, model calls in another, and retrieval results in a separate location. When these pieces are not tied together, you lose the ability to reason about execution as a whole.
Another common issue is conditional logic. Branches taken only in edge cases are often under-instrumented. For example, fallback prompts, error handlers, or retries may not be recorded consistently. These paths are precisely where failures tend to cluster, which makes their absence especially costly.
Finally, traces are often sampled too aggressively or truncated for cost reasons. While this may be necessary in production at scale, avoid it early on. During development and evaluation, completeness matters more than efficiency. You can optimize storage and sampling later, once you understand what matters.
What principles lead to higher-quality traces?
Regardless of which tools or frameworks you use, a few principles consistently lead to better traces:
Treat traces as durable artifacts rather than ad hoc debug logs.
Capture traces at the system level, not inside individual components.
Ensure every model call, tool call, and retrieval step is traceable.
Prefer structured data over free-form text logs.
Make traces easy to inspect manually before worrying about dashboards.
Many tools are available to assist with trace collection. Frameworks like LangChain and LlamaIndex include built-in tracing and callback systems.
Observability platforms such as LangSmith, Arize Phoenix, Braintrust, and Weights & Biases offer trace storage, visualization, and evaluation features. These tools differ in scope and complexity, but they all share the same underlying principles: capturing every step, storing structured data, and facilitating easy inspection.
For now, focus on designing your system so that complete traces are possible at all; the choice of platform matters less than the discipline of capturing complete information.
How should I review traces?
For most teams, building a custom annotation tool is the right choice. It is often the single highest-leverage investment you can make in the evaluation workflow. With modern AI-assisted development tools, it is possible to build a tailored review interface in hours rather than weeks. Teams that take this approach tend to iterate much faster because the tool matches their data, domain, and workflow exactly.
Custom tools
Custom annotation tools are effective because they consolidate all relevant context into a single location. Instead of jumping between logs, dashboards, and spreadsheets, reviewers can see the full trace, metadata, and notes together. More importantly, custom tools can render outputs in a way that reflects the actual product. Emails look like emails, chat messages look like chat transcripts, and code looks like code. This reduces cognitive load and makes failures easier to spot.
Off-the-shelf annotation platforms
Off-the-shelf annotation platforms are appropriate when coordinating dozens of distributed annotators, managing permissions, or meeting enterprise access requirements. Even then, configuration overhead and workflow constraints often hinder team productivity. As a general rule, when a small group of domain experts handles most reviews, a custom tool often outperforms a generic one.
What makes a good custom interface for reviewing traces?
A good annotation interface keeps humans in a state of flow. The goal is not to support every possible feature, but to make reviewing many traces fast, clear, and mentally lightweight. The best interfaces are simple, domain-specific, and optimized for the reviewer’s actual task.
At a minimum, a good interface should:
Render traces in a domain-appropriate way, showing outputs as users would see them.
Support fast navigation and clear progress.
Make it easy to focus on likely failures through filtering or grouping.
Reduce context switching by keeping relevant metadata visible.
Stay minimal and add features only when they clearly reduce friction.
You don’t need to build a perfect tool upfront. Many teams start with notebook-based interfaces and gradually evolve them into lightweight applications as needs become clearer. What matters most is that reviewing traces feels easy enough that people do it regularly.
The purpose of an annotation tool is not to look impressive. It helps humans see patterns, notice failures, and make decisions quickly. If reviewing fifty traces feels slow or exhausting, the tool is getting in the way. When done well, a custom interface fades into the background, allowing error analysis to drive progress.
How do you set up a minimum evaluation?
You do not need a sophisticated platform to start evaluating traces. The minimum viable setup is simpler than most teams expect: a method for storing traces as structured data and an interface that allows humans to review them quickly. Everything else, including dashboards, automated scoring, and integrations, can come later. What matters first is that you can capture complete traces and inspect them without friction.
What components make up a practical minimum setup?
A practical minimum setup has three components:
A trace format that captures every step of execution, including user input, system prompts, intermediate steps such as classification or retrieval, and the final output. Store these traces as structured JavaScript Object Notation (JSON) rather than free-form logs.
Use a simple interface that renders traces in a domain-appropriate way. Emails should look like emails, code should look like code, and conversations should look like conversations.
Define an annotation workflow that lets reviewers mark traces as pass or fail, categorize failures, and take notes. This does not require a database or a deployment pipeline. A local script that reads JSON files and renders them in a browser is sufficient to get started.
Why does ease of review matter more than tooling complexity?
The goal of this setup is to make reviewing traces feel easy enough that people actually do it. If loading a trace requires switching between three tools, or if outputs render as raw text instead of formatted content, reviewers will avoid the work. Keyboard shortcuts for navigation, clear progress indicators, and the ability to filter by status all reduce friction. These details matter more than the completeness of features. A minimal tool that people use daily will teach you more than a sophisticated platform that sits idle.
The interactive tool above demonstrates these principles. It shows a three-panel layout with a trace list that supports filtering and progress tracking on the top, a detailed trace view in the center that renders outputs as users would see them, and an annotation panel on the bottom for recording judgments.
This is not a production system. It is a starting point. Many teams begin with something similar and evolve it as their needs become clearer. The key is to start reviewing traces now, using whatever setup you can build in a few hours, rather than waiting for perfect tooling.
What’s next?
Now that you can reliably capture traces and review them efficiently, the next challenge is scale. In practice, you will quickly accumulate more traces than you can manually inspect. Not all traces are equally informative, and not all failures are equally important. Reviewing everything is neither realistic nor necessary.
In the next lesson, we’ll focus on how to sample traces intelligently and decide which examples deserve closer attention. This includes strategies for selecting representative data, prioritizing likely failure cases, and using signals from production to guide review. The goal is to spend human attention where it creates the most learning, rather than spreading it thin across random examples.