Search⌘ K
AI Features

What Kinds of LLM Evaluations Should You Run

Explore the various kinds of LLM evaluations essential for building reliable AI systems. Understand fast automated checks, deeper model-based or human reviews, and A/B tests. Learn to balance cost, speed, and user trust to ensure system quality beyond polished demos and avoid common pitfalls.

Consider a startup that builds an AI assistant to rewrite customer complaint emails in a professional tone. The tool impresses early users, and the demo looks polished: messy, frustrated emails go in, calm and structured replies come out.

After a strong demo, the startup treats polished email outputs as sufficient evidence of quality and ships without a formal evaluation or review process. Weeks later, a major client deploys the tool for customer support, where the system rewrites complaint or escalation emails to sound polite while subtly shifting blame to the user and implying potential consequences for the account. Several messages are sent before the issue is detected, resulting in a spike in complaints and a threatened client account.

When the team investigates, they realize they have almost no visibility into how the model behaves. They can see the outputs, but not the prompts, examples, or assumptions that drive tone shifts. They lack evaluations for professionalism, blame attribution, and escalation language, as well as sampled traces to understand how the assistant responds to emotionally charged inputs.

Fixing the issue becomes largely trial-and-error. The team tweaks prompts, adds examples, and bans phrases without a clear signal of what improves behavior or introduces regressions. They quickly realize that a convincing demo does not replace systematic evaluation, and that reliability has to be engineered rather than assumed.

What kinds of evaluations should I run?

When you hear “LLM evaluations,” you might assume they refer to a single sophisticated metric or a dashboard that declares a system good or bad. In practice, evaluation is not one thing. It is a set of activities performed at different stages of development. Rigorous and systematic evaluation is the most crucial aspect of building reliable AI systems. Many mature teams treat “evals and curation” as the center of the system, rather than an afterthought. Most of the work is not spent training models or tweaking prompts, but rather deciding what “good” looks like and verifying whether the system meets that standard.

Evaluations can be organized into three tiers, ordered from lowest to highest cost:

  1. Fast automated checks

  2. Deeper model-based or human evaluations

  3. A/B tests with real users

What are unit tests in AI evaluation?

Unit tests in AI evaluation are fast, automated checks that verify basic expectations, such as:

  • The output follows the required format.

  • The model refuses disallowed requests.

  • The prompt still produces structured JSON.

These evaluations are inexpensive and easy to run, so teams often execute them on every code or prompt change. They will not tell you whether a product is exceptional, but they are very effective at catching obvious breakages before users ever see them.

What are model-based and human evaluations in AI evaluation?

Most debugging happens at this stage. Teams examine real examples, or traces, and ask more nuanced questions, such as:

  • Is this response helpful?

  • Is the tone appropriate?

  • Did the model make a correct judgment given the input?

These evaluations are higher in cost because they require human review or additional model inference. As a result, teams run them on a fixed cadence rather than continuously. Teams often build strong unit test coverage first, since these evaluations take more time to design, run, and interpret.

What is A/B testing in AI evaluation?

A/B testing compares two versions of a system with real users. It is the most expensive and risky form of evaluation because it directly affects user experience and perception. For that reason, teams typically reserve it for major product or model changes. There is no strict rule for when to introduce each type. Instead, teams continuously balance speed, cost, user trust, and learning. This trade-off is not unique to AI. It closely mirrors the same decisions product teams already make, with the key difference that LLM behavior is harder to reason about without deliberate evaluation.

This course focuses primarily on fast automated checks and model-based or human evaluations. These are the areas where most teams struggle and where most learning, debugging, and product improvement take place. A/B testing and production experiments are covered later and treated as higher-cost tools that build on a strong foundation of traces, error analysis, and systematic review. When the first two checks are solid, the third becomes far more effective and far less risky.

Note: In this course, “unit test” for LLM systems refers broadly to fast, automated checks, such as schema validation, JSON formatting checks, refusal checks, and other small assertions that catch obvious breakages before humans ever see them.

In later lessons, you will see these same ideas reappear under different names, including assertions, schema checks, minimal reproduction tests, and micro-tests. Despite the labels, they all belong to the same category of fast automated checks.

Throughout the course, when a new evaluation technique is introduced, we will indicate whether it belongs to fast automated checks, model-based checks, or human checks. This maintains the consistency and ease of application of the three-tier framework introduced here as the system expands.

What's next?

The startup story that opened this lesson is not an outlier. It is the default outcome when teams mistake a functional demo for a reliable system. The failure was not inevitable; instead, it was the predictable result of shipping without visibility into how the system actually behaves. In the coming lessons, we will begin building that visibility by learning to read traces, which provide a complete record of every action your system takes between a user’s first message and the final response.