Evaluation Criteria
Explore how to define and run automated, repeatable evaluation tests for AI agents using the Google ADK framework. Understand core evaluation concepts, including test case design and diverse scoring criteria that assess an agent's reasoning, response quality, factual grounding, and safety. Discover how to apply static and dynamic evaluation methods, including user simulation for conversational agents, to ensure reliability and trustworthiness in production-ready AI applications.
We have successfully designed, built, and refactored a multi-agent Research Assistant. Our agent team functions as planned. However, in any professional software development life cycle, a critical question follows a successful build: How do we prove that it works correctly, and how do we ensure it continues to work correctly as we make changes over time?
Manually testing the agent with a few queries is a good start, but it is not scalable, repeatable, or objective. To build trust in our system and adopt a true engineering discipline, we need a systematic approach to automated evaluation. This lesson introduces the Google ADK’s powerful, built-in evaluation framework. We will learn how to define and run repeatable, objective tests against our agent, allowing us to systematically measure its quality, correctness, and reliability.
Core concepts of ADK evaluation
The ADK’s evaluation system is driven by a command-line interface adk eval and is primarily configured using two types of JSON files. Understanding the role of each is the first step in creating a robust testing strategy.
The EvalSet
This file defines what to test. An EvalSet is a collection of one or more test cases, which are called EvalCase objects. Each test case specifies the inputs that will be sent to the agent during the test run. At a minimum, this includes the user’s initial prompt. A test case can also include other crucial data, such as the expected outcome (like a reference answer or a specific sequence of tool calls) and any context that should be used to judge the agent’s response.
A turn in an EvalSet is a rich data structure that can include:
User content: It is the user’s query or prompt for that turn.
Expected intermediate tool use trajectory: It is a list of the tool calls we expect the agent to make in a specific order to correctly respond to the user’s query. This is the ground truth for the agent’s reasoning process.
Expected intermediate agent responses: In a multi-agent system, these are the natural language responses from worker agents as they are orchestrated by the main agent. Capturing these can be critical for ensuring the correct delegation path was taken.
Final response: It is the expected golden final text response from the agent for that turn.
Let’s examine the structure of a generic .evalset.json file to understand these components in practice.
// Note: Comments are for explanation; they should be removed for a valid JSON file.{"eval_set_id": "generic_eval_set_id","name": "Generic Evaluation Set Name","description": "A description of what this evaluation set is for.","eval_cases": [{"eval_id": "generic_eval_case_id","conversation": [{"invocation_id": "unique_invocation_identifier","user_content": { // The user content for this turn"parts": [{"text": "The user's initial prompt for this turn."}],"role": "user"},"final_response": { // The expected final response"parts": [{"text": "The expected 'golden' final text response from the agent."}],"role": "model"},"intermediate_data": {"tool_uses": [ // The expected intermediate tool use trajectory{"name": "expected_tool_name_1","args": {"parameter_name": "parameter_value"}},{"name": "expected_tool_name_2","args": {}}],"intermediate_responses": [] // Holds expected intermediate agent responses}}],"session_input": {"app_name": "your_app_name","user_id": "your_user_id"}}]}
The EvalConfig
This file defines how to judge the agent’s performance. The EvalConfig specifies one or more evaluation criteria that the ADK will use to score the agent’s behavior against the test cases in the EvalSet. By separating the what from the how, we can run the same set of test cases against our agent but judge them using different criteria, depending on what aspect of performance we want to measure.
With this understanding of the core configuration files, let’s now take a comprehensive overview of the specific criteria we can use in our EvalConfig to judge our agent’s performance.
A comprehensive overview of evaluation criteria
The ADK provides a rich suite of built-in evaluation criteria, each designed to measure a different facet of an agent’s performance. Many of these criteria use a powerful LLM-as-a-Judge approach, where another LLM is used to score the agent’s behavior against a set of rules, providing a nuanced assessment that goes beyond simple string matching. Let’s explore the available criteria, grouped by what they measure.
Criteria for process and reasoning
These criteria do not look at the agent’s final answer. Instead, they focus on the process the agent followed to get there, making them essential for validating the agent’s internal logic.
Tool trajectory average score
The tool_trajectory_avg_score criterion is essential for verifying that the agent is following the correct plan. It compares the actual sequence of tools called by the agent against a list of expected calls that we provide in our EvalSet. It can be configured with ...