What Kinds of LLM Evaluations Should You Run
Explore the different kinds of evaluations necessary to build trustworthy LLM systems. Understand fast automated unit tests, deeper model-based and human evaluations, and A/B testing with real users. Learn how these layered approaches help detect failures early, ensure professionalism, and maintain system reliability throughout development and production.
After a strong demo, the startup treats polished email outputs as sufficient evidence of quality and ships without a formal evaluation or review process. Weeks later, a major client deploys the tool for customer support, where the system rewrites complaint or escalation emails to sound polite while subtly shifting blame to the user and implying potential consequences for the account. Several messages are sent before the issue is detected, resulting in a spike in complaints and a threatened client account.
When the team investigates, they realize they have almost no visibility into how the model behaves. They can see the outputs, but not the prompts, examples, or assumptions that drive tone shifts. They lack evaluations for professionalism, blame attribution, and escalation language, as well as sampled traces to understand how the assistant responds to emotionally charged inputs.
Fixing the issue becomes largely trial-and-error. The team tweaks prompts, adds examples, and bans phrases without a clear signal of what improves behavior or introduces regressions. They quickly realize that a convincing demo does not replace systematic evaluation, and that reliability has to be engineered rather than assumed.
What kinds of evaluations should I run?
When you hear “LLM evaluations,” you might assume they refer to a single sophisticated metric or a dashboard that declares a system good or bad. In practice, evaluation is not one thing. It is a set of activities performed at different stages of development. Rigorous and systematic evaluation is the most crucial aspect of building reliable AI systems. Many mature teams treat “evals and curation” as the center of the system, rather than an afterthought. Most of the work is not spent training models or tweaking prompts, but rather deciding what “good” looks like and verifying whether the system meets that standard.
Evaluations can be organized into three tiers, ordered from lowest to highest cost:
Fast automated checks
Deeper model-based or human evaluations
A/B tests with real users
What are unit tests in AI evaluation?
Unit tests in AI evaluation are fast, automated checks that verify basic expectations, such as:
The output follows the required format.
The model refuses disallowed requests.
The prompt still produces structured JSON.
These evaluations are inexpensive and easy to run, so teams often execute them on every code or prompt change. They will not tell you whether a product is exceptional, but they are very effective at catching obvious breakages before users ever see them.
What are model-based and human evaluations in AI evaluation?
Most debugging happens at this stage. Teams examine real examples, or traces, and ask more nuanced questions, such as:
Is this response helpful?
Is the tone appropriate?
Did the model make a correct judgment given the input?
These evaluations are higher in cost because they require human review or additional model inference. As a result, teams run them on a fixed cadence rather than continuously. Teams often build strong unit test coverage first, since these evaluations take more time to design, run, and interpret.
What is A/B testing in AI evaluation?
A/B testing compares two versions of a system with real users. It is the most expensive and risky form of evaluation because it directly affects user experience and perception. For that reason, teams typically reserve it for major product or model changes. There is no strict rule for when to introduce each type. Instead, teams continuously balance speed, cost, user trust, and learning. This trade-off is not unique to AI. It closely mirrors the same decisions product teams already make, with the key difference that LLM behavior is harder to reason about without deliberate evaluation.
This course focuses primarily on fast automated checks and model-based or human evaluations. These are the areas where most teams struggle and where most learning, debugging, and product improvement take place. A/B testing and production experiments are covered later and treated as higher-cost tools that build on a strong foundation of traces, error analysis, and systematic review. When the first two checks are solid, the third becomes far more effective and far less risky.
Note: In this course, “unit test” for LLM systems refers broadly to fast, automated checks, such as schema validation, JSON formatting checks, refusal checks, and other small assertions that catch obvious breakages before humans ever see them.
In later lessons, you will see these same ideas reappear under different names, including assertions, schema checks, minimal reproduction tests, and micro-tests. Despite the labels, they all belong to the same category of fast automated checks.
Throughout the course, when a new evaluation technique is introduced, we will indicate whether it belongs to fast automated checks, model-based checks, or human checks. This maintains the consistency and ease of application of the three-tier framework introduced here as the system expands.
Who is this course for?
This course is designed for individuals building products with large language models who want those systems to be reliable, debuggable, and safe to deploy. You might be:
An engineer integrating LLMs into production systems
A product manager overseeing AI-powered features
A founder or technical leader responsible for deployment decisions
Anyone shipping LLM-powered features to real users
Prerequisites:
Familiarity with LLM APIs, prompts, and model outputs
No deep machine learning background required
Throughout the course, we repeatedly return to the three types of evaluation introduced earlier: fast automated checks, deeper model-based and human reviews, and higher-stakes experiments with real users. The goal is to help you understand not just what these evaluations are, but when and why to use each type.
What will you learn in this course?
By the end of this course, you will have a clear mental model for how to evaluate LLM-based systems at every stage of development, from early prototypes to production monitoring. You will learn how to reason about failures, design evaluations that matter, and avoid common traps that make teams overconfident in fragile systems. Most importantly, you will learn how to use evaluations as a tool for learning and iteration, not just as a gatekeeper.
Specifically, you will learn how to:
Understand what LLM evaluations are, how traces work, and how to set up a minimum viable evaluation process
Build a business case for investing in evaluations and choose the right level of rigor for your product
Perform effective error analysis and surface failures beyond user complaints
Collect, sample, and generate data for evaluation, including when and how to use synthetic data
Design robust evaluations, choose appropriate metrics, and use LLMs as judges responsibly
Collaborate with humans in the loop, automate parts of your eval workflow, and monitor systems in production
Evaluate complex systems such as RAG pipelines, multi-turn conversations, and agentic workflows
A quick note on metrics: Many people entering LLM evaluation expect traditional NLP metrics such as BLEU, ROUGE, BERTScore, or embedding similarity to play a central role. In practice, these metrics are primarily useful for comparing base models on fixed datasets, rather than for evaluating actual product behavior.
Throughout this course, we focus on a trace-first, failure-driven approach because most real-world failures stem from reasoning gaps, missing context, tool misuse, or workflow errors that similarity metrics cannot capture.
This course is designed to provide you with both a framework and concrete tactics that you can apply immediately to your own AI systems.