Search⌘ K
AI Features

Traces and Error Analysis Explained

Explore how to read and analyze traces, which capture the full sequence of an LLM system's operations, to identify and categorize failure modes. Understand how error analysis helps prioritize issues and guides improvements in system design and prompts, laying the groundwork for scalable, effective AI evaluation.

In the evaluation framework introduced earlier, this lesson focuses on human-led evaluation. We are not currently using unit tests or model-based judges. Instead, a human reviews traces, identifies failures, and groups them into meaningful categories. This step establishes a shared understanding that all subsequent evaluations are based on.

The scope of evaluation in this lesson encompasses the end-to-end LLM system, rather than the base model in isolation. Traces reflect everything that shapes behavior, including prompts, retrieval, tools, and orchestration logic. At this stage, the goal is not to judge model quality, but to understand where the system fails and why. In practice, most issues uncovered at this stage lead to changes in prompts or system design rather than model selection.

Although this process starts manually, it is designed to scale. A small number of traces reviewed by a single domain expert is usually enough to surface the dominant failure modes. Those patterns later become candidates for automated checks or model-based evaluation. Human error analysis is the foundation that makes scaling evaluation possible, rather than arbitrary.

This lesson focuses on two connected activities. First, it covers how to read traces, which are complete records of how the system moves from input to output. Second, it introduces error analysis: the process of reviewing many traces to identify patterns in how and why the system fails. Traces provide visibility. Error analysis turns that visibility into actionable insights.

What is a trace?

A trace is the complete record of what happens from a user’s first input to the system’s final response. It includes every user message, every assistant reply, any retrieved documents, tool calls, intermediate steps, and decisions made along the way. If your system uses multiple agents, tools, or retrieval steps, a trace captures all of that in one place. Instead of just seeing the final answer, you see how the system arrived there.

Example

For example, consider an email assistant rewriting a customer complaint to sound professional. The final output might appear polite, but the trace could show that the model retrieved an internal account-termination policy, misinterpreted the customer’s message as hostile, and selected examples that emphasized enforcement language. Without visibility into that sequence of steps, the failure appears to be a tone issue. With the trace, it becomes clear that the system followed a flawed path well before the final wording was generated.

Why traces matter

Traces matter because most LLM failures are not obvious from the final output alone. Two answers can look similar on the surface while being produced for very different reasons. One may be correct for the right reasons, while another is accidentally correct or fails in edge cases. Without traces, diagnosis becomes guesswork. With traces, teams can ask concrete questions, such as which document was retrieved, which tool was called, or which assumption caused the model to shift tone or logic.

Most systems internally break traces into smaller steps, sometimes called spans, such as individual model calls or retrieval operations. You do not need to worry about this structure early on. What matters is that you can reconstruct the full path from input to output and inspect each step in context.

Many tools help capture and inspect traces, including products such as LangSmith, Arize Phoenix, Braintrust, and others. You do not need to choose or understand these tools yet. They are not part of the course. Throughout this course, we introduce tooling only when it becomes useful for a specific task. For now, it is enough to understand what a trace is and why access to traces is essential for meaningful evaluation.

What is error analysis?

Traces give you visibility into what your system is doing, but visibility alone is not enough. Evaluation requires understanding patterns across many traces, not just inspecting individual executions. Error analysis turns raw traces into structured insight by systematically identifying how and why the system fails.

Why is error analysis so important?

Error analysis is the most important activity in LLM evaluation. Before choosing metrics, building evaluators, or adopting tools, you need to understand how your system actually fails in practice. It helps you identify failure modes that are specific to your application, users, and data. Without error analysis, evaluation efforts often default to generic metrics that do not improve real user outcomes.

Error analysis process

At a practical level, error analysis follows a simple but disciplined process:

  1. Create a dataset of representative traces: Start with real user interactions when possible. If you lack production data, generate synthetic data to bootstrap the process.

  2. Review traces and write open-ended notes: A human reviewer, ideally a single domain expert acting as a benevolent dictator, writes free-form observations about what went wrong.

  3. Focus on the first failure in each trace: Upstream errors often cause multiple downstream issues, so identifying the earliest failure avoids double counting symptoms.

  4. Group failures into a taxonomy: Cluster similar observations into distinct categories such as tone issues, incorrect retrievals, unsafe assumptions, or instruction misinterpretation.

  5. Count and prioritize failure modes: Measure how often each category appears and decide which failures are most important to address.

  6. Iterate until saturation: Continue reviewing traces until new examples stop revealing new failure modes, often after around one hundred traces.

What does effective error analysis look like in practice?

As a concrete example, consider reviewing one hundred traces from an AI assistant that rewrites customer emails to sound professional. After open coding and grouping observations, a simple failure taxonomy might include:

  • Tone escalation: Responses appear polite but introduce threatening or punitive language.

  • Blame attribution: The assistant subtly shifts responsibility onto the customer without justification.

  • Policy overreach: Internal policies or enforcement actions appear in customer-facing messages.

  • Context misinterpretation: Emotional cues such as frustration are misread as hostility.

  • Instruction drift: The response partially ignores the user’s request, such as prioritizing firmness over professionalism.

After categorizing all traces, you might find that tone escalation occurs in 22 percent of cases, blame attribution in 15 percent, and policy overreach in 8 percent. This immediately shows where to focus. Instead of guessing which metric to track or which prompt to tweak, you now have a ranked list of real, user-impacting failure modes. These categories later become candidates for targeted evaluations, automated checks, or human review guidelines.

Error analysis is not a one-time task. Revisit it regularly as your system, data, and users evolve. Over time, you will develop intuition for where failures tend to occur and how to sample traces more efficiently. Most importantly, error analysis ensures that the evaluations you build later are grounded in real system behavior rather than abstract or counterproductive metrics.

What is a minimum viable evaluation setup?

A minimum viable evaluation setup starts with error analysis, not infrastructure. You do not need dashboards, complex metrics, or specialized tools to begin. Set aside thirty minutes to manually review twenty to fifty LLM outputs whenever you make a meaningful system change. This tight feedback loop keeps you grounded in real behavior rather than assumptions. For most small to medium-sized teams, the best setup relies on a single domain expert who defines what good and bad look like. Acting as a benevolent dictator, this expert provides consistent judgment based on a deep understanding of users and the domain, whether that expertise comes from psychology, law, customer support, or another field.

When should you involve more than one reviewer?

If evaluating a single interaction requires five subject-matter experts, the product scope is likely too broad. Larger organizations or products that span multiple domains may eventually require multiple annotators, but this should be the exception rather than the starting point. When multiple reviewers are involved, agreement can be measured using inter-annotator techniques. Even then, subjective judgment still plays a role, and a single expert is often sufficient in early stages. It is generally better to start simple and add complexity only when the domain clearly demands it.

What tools do you actually need to evaluate effectively?

Whenever possible, use coding tools to support the evaluation process. They make it easier to review traces, analyze outputs, visualize patterns, and iterate quickly, and many teams build lightweight annotation interfaces directly inside notebooks. The essential point is that effective evaluation does not require heavy tooling. It requires disciplined review and clear ownership of quality, which you can establish long before investing in sophisticated infrastructure.

What’s next?

The fastest way to undermine this lesson is to rush into metrics, dashboards, or automation before closely examining real traces. Keep the loop tight. Review examples frequently, write things down, and resist the urge to abstract too early. If you find yourself debating quality without pointing to concrete traces, that is usually a signal to slow down and return to error analysis.

The concepts in this lesson form the foundation for the rest of the course. Traces show what the system is doing, error analysis explains how to interpret it, and a minimum viable evaluation setup provides a way to act on those observations. All later topics, including metrics, LLM-as-judge approaches, automation, and production monitoring, build on this core. Without this foundation, later techniques tend to be brittle and misleading. With it, they are far more reliable and effective.