Selection, Reflection, and Human Feedback
Explore the key processes of reward selection, reflection, and human feedback integration in an AI reward-learning system. Understand how to implement a deterministic selector agent, analyze reward performance with reflection agents, utilize human feedback to improve results, and manage iteration loops to optimize learning and decision-making effectively.
In the previous lesson, the system finished evaluating reward candidates. For the current iteration, we have multiple trained policies, quantitative metrics, rollout visualizations, and a structured summary in ctx.session.state["candidate_results"].
At this stage, the system must answer a simple but critical question:
Which reward should we carry forward?
That responsibility belongs to SelectorAgent.
Selecting the best reward candidate
Reward selection is implemented in agents/selector_agent.py. This agent neither trains policies nor generates rewards. Its sole purpose is to:
Read evaluation results
Apply a selection rule
Update the shared state with the chosen “best” reward.
Start by opening the file and reviewing the agent’s dependencies. We’ll start with the imports:
import jsonfrom loguru import loggerfrom typing import AsyncGeneratorfrom google.adk.agents import BaseAgentfrom google.adk.agents.invocation_context import InvocationContextfrom google.adk.events import Event
From these imports alone, we can infer the agent’s role:
It reads structured data (
json).It does not depend on training or environment tools.
It interacts only with shared state and ADK.
This is intentional. Selection should be lightweight and deterministic.
Next, we define the SelectorAgent:
class SelectorAgent(BaseAgent):async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
As with previous agents, this is an ADK agent, executed inside the loop, driven entirely by shared state. By the time this agent runs, the evaluation for the iteration has already been completed.
The first thing the SelectorAgent does is retrieve evaluation results.
candidate_results_json = ctx.session.state["candidate_results"]candidate_results = json.loads(candidate_results_json)
This is why we stored results as JSON in the previous lesson: it’s serializable, stable across agents, and easy to inspect or log.
At this point, candidate_results is a list of dictionaries, one per candidate, containing scores, metrics, artifact paths, and reward code.
Next, we select the candidate with the highest score.
best = max(candidate_results, key=lambda x: x["score"])
This line reflects a deliberate design choice: selection is purely evidence-based, with no heuristics, learned weighting, or LLM involvement. The selector trusts the evaluator’s scoring function and makes a clear, reproducible decision.
Once a winner is chosen, we extract what downstream agents need.
best_reward_code = best["reward_code"]best_score = best["score"]best_candidate_id = best["candidate"]
This is the minimal information required to inform reflection, seed the next iteration, and log progress.
Now we update the shared state so later agents can build on this decision.
ctx.session.state["best_reward_code"] = best_reward_codectx.session.state["best_score"] = best_scorectx.session.state["best_candidate_id"] = best_candidate_id
From this point on, the system has a current best reward, reflection agents can analyze it, and the next iteration can improve upon it.
Finally, we log the outcome for inspection.
logger.info(f"[SelectorAgent] Selected candidate {best_candidate_id} "f"with score={best_score:.4f}")
This log entry becomes part of the execution trace we inspected in the first lesson of this chapter.
As with all ADK agents, we signal completion by yielding an event.
yield Event(author=self.name, content=None)
At this point, ...