Reward Generation and Evaluation Loop
Explore the implementation of a reward generation and evaluation loop in AI agents similar to Eureka. Learn to generate multiple reward candidates with LLMs, parse and safely compile code, train policies using PPO, evaluate candidates, and record detailed metrics and artifacts. Understand how this method transforms reward hypotheses into measurable performance to guide iterative improvements.
We'll cover the following...
- Implementing reward generation
- Implementing candidate evaluation
- Imports: Tools that do the heavy lifting
- CandidateEvaluatorAgent entry point: Reading iteration context
- Parsing candidates from LLM output (robustly)
- Fail safely when parsing produces nothing
- Per-candidate evaluation loop: Save code first, then try to execute it
- Compile + validate reward code in a sandbox
- Train and evaluate (delegated to tools/rl_runner.py)
- Save the rollout HTML (non-critical if it fails)
- Save metrics, training metadata, and policy params
- Record results and update the leaderboard
- Handle per-candidate failures without stopping the iteration
- Convert results into JSON and store them in shared state
- Summary
Implementing reward generation
Start with the first step in each loop iteration: reward generation. In this system, the RewardDesigner has a single job:
It builds a prompt that includes the task spec + environment code (+ feedback from the previous iteration).
It calls an OpenAI model to generate K reward candidates.
It stores the raw generated text in the shared state so the evaluator can parse it next.
Everything we do below supports that flow.
Setting up the OpenAI client and prompt constants
Let’s start at the top of agents/llm_agents.py. Before defining the agent class, we set up the OpenAI client and a few constants that control output formatting.
import osimport jsonfrom openai import OpenAIfrom pydantic import PrivateAttrfrom loguru import loggerfrom google.adk.agents import BaseAgentfrom google.adk.agents.invocation_context import InvocationContextfrom google.adk.events import Eventfrom typing import AsyncGeneratorclient = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))CANDIDATE_DELIM = "### CANDIDATE ###"DESIGNER_SYSTEM = "You are a precise reward-function code generator for JAX/Brax environments."
Here’s what we’re doing (as implementers):
We initialize
clientonce at import time so the agent can reuse it for every iteration.CANDIDATE_DELIMis a parsing contract. The evaluator relies on this exact delimiter to split candidates.DESIGNER_SYSTEMnarrows model behavior. We don’t want explanations, markdown, or “tips.” We want a reward code.
This is the first place where you see an important design pattern:
We enforce reliability by making the LLM output machine-parseable, not human-friendly.
Building the reward generation prompt
Next, we define _designer_prompt(...). This function is where we “program” the reward designer’s behavior.
def _designer_prompt(task_spec, env_code, best_reward_code, reflection, K, candidate_results: str | None = None):"""Build the prompt for the Reward Designer.From iteration 2 onward, we include an explicit "Query with Feedback" section(policy training/eval results + reflection) similar to the Eureka paper diagram."""
This function supports two modes:
Iteration 1 (no best reward yet) → generate initial candidates
Iteration 2 + (we have feedback) → improve the best reward so far
Let’s look at how that branching is implemented.
Prompt mode: Improving an existing best reward
If a best reward exists, we instruct the model to improve it (not start over).
if best_reward_code:improvement_instruction = f"""IMPORTANT: You MUST generate {K} IMPROVED versions of the BEST REWARD SO FAR below.- Each candidate should be a PROGRESSIVE IMPROVEMENT or VARIATION of the best reward- Build upon the successful aspects identified in the feedback/reflection- Try different approaches to address issues mentioned in the feedback/reflection- Do NOT generate completely new rewards from scratch - they must be based on the best reward below"""
This is a key EUREKA-like design choice in this implementation:
We treat the best reward as the current parent.
We ask for variations, not resets.
We keep improvement grounded in evidence (reflection + results). ... ...