Search⌘ K
AI Features

Reward Generation and Evaluation Loop

Explore the implementation of a reward generation and evaluation loop in AI agents similar to Eureka. Learn to generate multiple reward candidates with LLMs, parse and safely compile code, train policies using PPO, evaluate candidates, and record detailed metrics and artifacts. Understand how this method transforms reward hypotheses into measurable performance to guide iterative improvements.

Implementing reward generation

Start with the first step in each loop iteration: reward generation. In this system, the RewardDesigner has a single job:

  • It builds a prompt that includes the task spec + environment code (+ feedback from the previous iteration).

  • It calls an OpenAI model to generate K reward candidates.

  • It stores the raw generated text in the shared state so the evaluator can parse it next.

Everything we do below supports that flow.

Setting up the OpenAI client and prompt constants

Let’s start at the top of agents/llm_agents.py. Before defining the agent class, we set up the OpenAI client and a few constants that control output formatting.

import os
import json
from openai import OpenAI
from pydantic import PrivateAttr
from loguru import logger
from google.adk.agents import BaseAgent
from google.adk.agents.invocation_context import InvocationContext
from google.adk.events import Event
from typing import AsyncGenerator
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
CANDIDATE_DELIM = "### CANDIDATE ###"
DESIGNER_SYSTEM = "You are a precise reward-function code generator for JAX/Brax environments."

Here’s what we’re doing (as implementers):

  • We initialize client once at import time so the agent can reuse it for every iteration.

  • CANDIDATE_DELIM is a parsing contract. The evaluator relies on this exact delimiter to split candidates.

  • DESIGNER_SYSTEM narrows model behavior. We don’t want explanations, markdown, or “tips.” We want a reward code.

This is the first place where you see an important design pattern:

We enforce reliability by making the LLM output machine-parseable, not human-friendly.

Building the reward generation prompt

Next, we define _designer_prompt(...). This function is where we “program” the reward designer’s behavior.

def _designer_prompt(task_spec, env_code, best_reward_code, reflection, K, candidate_results: str | None = None):
"""
Build the prompt for the Reward Designer.
From iteration 2 onward, we include an explicit "Query with Feedback" section
(policy training/eval results + reflection) similar to the Eureka paper diagram.
"""

This function supports two modes:

  • Iteration 1 (no best reward yet) → generate initial candidates

  • Iteration 2 + (we have feedback) → improve the best reward so far

Let’s look at how that branching is implemented.

Prompt mode: Improving an existing best reward

If a best reward exists, we instruct the model to improve it (not start over).

if best_reward_code:
improvement_instruction = f"""IMPORTANT: You MUST generate {K} IMPROVED versions of the BEST REWARD SO FAR below.
- Each candidate should be a PROGRESSIVE IMPROVEMENT or VARIATION of the best reward
- Build upon the successful aspects identified in the feedback/reflection
- Try different approaches to address issues mentioned in the feedback/reflection
- Do NOT generate completely new rewards from scratch - they must be based on the best reward below"""

This is a key EUREKA-like design choice in this implementation:

  • We treat the best reward as the current parent.

  • We ask for variations, not resets.

  • We keep improvement grounded in evidence (reflection + results). ... ...