Search⌘ K
AI Features

Agent Design, Code Structure, and Output Demonstration

Understand how to design and orchestrate a reward learning agent inspired by NVIDIA's EUREKA system. Learn the iterative process of generating, evaluating, selecting, and reflecting reward functions using Google ADK and Brax environments, and observe the system’s outputs through rollout visualizations.

In the previous chapter, we studied NVIDIA’s EUREKA system, an agentic framework that uses large language models to automatically design and iteratively refine reward functions for reinforcement learning. While the original system operates in large-scale robotic environments and physics simulations tightly coupled to NVIDIA’s hardware, reproducing those conditions exactly in an instructional setting is impractical.

For a hands-on demonstration, we will instead implement an Eureka-like reward learning agent using:

  • Lightweight Brax environments, specifically HalfCheetah

  • Google’s Agent Development Kit (ADK) for agent orchestration

  • Free T4 GPU resources on Google Colab

HalfCheetah Brax environment
HalfCheetah Brax environment

The goal is not to reproduce NVIDIA’s system at full scale, but to reimplement the core design principles behind EUREKA in a form that is computationally tractable, easier to reason about, and suitable for controlled experimentation.

In this lesson, we will focus on agent design and workflow, code structure, and observing the system’s outputs. Specifically, we will examine:

  • The agents involved in our Eureka-like system

  • The role and responsibility of each agent

  • The inputs and outputs flowing through the system

  • The orchestration pattern used to manage the iterative reward evolution loop

  • The overall project structure

Finally, we will run the complete system once and inspect its outputs, trained policies, generated reward functions, and rollout visualizations, to see the design in action before diving into the implementation details in the next lessons.

Before we begin: A few key terms

Before we implement the system, we’ll define a few reinforcement learning terms that we’ll use throughout the chapter. You don’t need a deep RL background, just enough intuition to follow the mechanics.

Policy

A policy is the decision-making component of a reinforcement learning agent. Given the current state of the environment, the policy decides what action to take next. In our case, policy controls the HalfCheetah robot, deciding how each joint should move at every step. When we say “training a policy,” we mean optimizing this decision-making function so that the agent behaves better according to a reward signal.

Reward function

A reward function assigns a numerical score to the agent’s behavior at each step. For example:

  • Moving forward might give a positive reward.

  • Falling over might give a negative reward.

  • Wasting energy might incur a penalty.

Designing this function is difficult, and that is exactly the problem EUREKA is trying to solve.

Rollout

A rollout is a recorded ...