Search⌘ K
AI Features

Evaluating the Research Assistant

Explore the practical steps to evaluate a multi-agent Research Assistant built with Google ADK. Learn how to define test cases, set quality benchmarks, and assess your agent's logical process, final response quality, and factual accuracy. This lesson guides you to create evaluation configurations and execute tests that ensure your AI agent performs reliably and meets professional standards.

Having established a strong theoretical understanding of the ADK’s evaluation framework, its core concepts, and its comprehensive suite of criteria, it is time to put that knowledge into practice. In this lesson, we will shift from theory to hands-on application. We will build a complete, end-to-end evaluation suite for our multi-agent Research Assistant, defining the specific test cases and quality benchmarks needed to objectively measure its performance and ensure its reliability.

For our specific single-turn, tool-using Research Assistant, not all criteria are equally relevant. User simulation, for instance, is designed for multi-turn conversations. Therefore, we will focus our hands-on evaluation on a curated but powerful set of criteria that directly measure the quality and correctness of our agent’s workflow:

  1. tool_trajectory_avg_score: It verifies that our controller_agent delegates tasks to the worker agents in the correct logical order.

  2. rubric_based_final_response_quality_v1: It checks if the agent’s final output meets a specific quality bar that we define.

  3. hallucinations_v1: It ensures the agent’s synthesized report is factually grounded in the information it gathered from its tools.

Let’s begin creating the necessary configuration files for this purpose.

Create the EvalSet file

First, we need to define what we will test. In the file editor, we will create a new file. The ADK framework uses a specific naming convention for these files: <eval_set_id>.evalset.json. The file name must match the eval_set_id defined inside the JSON file, as this allows the adk eval command to correctly identify and load the test suite. This file will contain a single test case that provides the prompt for our agent and, crucially, defines the expected sequence of tool calls that the agent should make. This sequence is what the ADK will use to verify our agent’s internal reasoning and delegation process.

{
"eval_set_id": "research_assistant_eval_set_v1",
"name": "Evaluation set for the multi-agent research assistant",
"description": "Single-turn evaluation of the controller agent for the query 'Write a report on climate change'.",
"eval_cases": [
{
"eval_id": "research_climate_change_case_01",
"session_input": {
"app_name": "multi_agent_researcher",
"user_id": "eval_user_01"
},
"conversation": [
{
"invocation_id": "inv_01",
"user_content": {
"role": "user",
"parts": [
{
"text": "Write a report on climate change"
}
]
},
"final_response": {
"role": "model",
"parts": [
{
"text": "A coherent, well-structured report on climate change that explains causes, impacts, and mitigation strategies."
}
]
},
"intermediate_data": {
"tool_uses": [
{
"name": "wikipedia_researcher",
"args": {
"request": "climate change"
}
},
{
"name": "arxiv_researcher",
"args": {
"request": "climate change"
}
},
{
"name": "web_searcher",
"args": {
"request": "climate change"
}
}
]
}
}
]
}
]
}
The research_assistant_eval_set_v1.evalset.json file

Code explanation ...