Search⌘ K
AI Features

Evaluating the Research Assistant

Explore the practical steps to evaluate a multi-agent Research Assistant built with Google ADK. Learn how to define test cases, set quality benchmarks, and assess your agent's logical process, final response quality, and factual accuracy. This lesson guides you to create evaluation configurations and execute tests that ensure your AI agent performs reliably and meets professional standards.

Having established a strong theoretical understanding of the ADK’s evaluation framework, its core concepts, and its comprehensive suite of criteria, it is time to put that knowledge into practice. In this lesson, we will shift from theory to hands-on application. We will build a complete, end-to-end evaluation suite for our multi-agent Research Assistant, defining the specific test cases and quality benchmarks needed to objectively measure its performance and ensure its reliability.

For our specific single-turn, tool-using Research Assistant, not all criteria are equally relevant. User simulation, for instance, is designed for multi-turn conversations. Therefore, we will focus our hands-on evaluation on a curated but powerful set of criteria that directly measure the quality and correctness of our agent’s workflow:

  1. tool_trajectory_avg_score: It verifies that our controller_agent delegates tasks to the worker agents in the correct logical order.

  2. rubric_based_final_response_quality_v1: It checks if the agent’s final output meets a specific quality bar that we define.

  3. hallucinations_v1: It ensures the agent’s synthesized report is factually grounded in the information it gathered from its tools.

Let’s begin creating the necessary configuration files for this purpose.

Create the

...