Eureka: Automating Reward Design with Coding LLMs
Explore how Eureka automates the design of reward functions in reinforcement learning by leveraging coding large language models. Understand its iterative, autonomous process that generates, tests, and refines reward logic to improve agent performance in complex environments.
Problem space: Automating reward design for reinforcement learning, using LLM-based agent systems.
Understanding the problem of reward design
In reinforcement learning (RL), agents learn to perform tasks by interacting with an environment, and receiving feedback in the form of rewards. The fundamental goal for an RL agent is to learn a policy that maximizes the total accumulated reward over time. However, this learning process is critically dependent on how well the “reward function” is defined. A reward function acts as a precise rulebook, signaling to the agent what constitutes desirable, or undesirable behavior in the environment. For instance, a positive reward might be given for successfully balancing a robot or picking up an object, while a penalty is applied for falling or dropping it. The agent then uses this feedback to iteratively adjust its actions and improve its performance.
What exactly is a reward?
A reward is a single number that the environment returns to the agent after each action. It can be positive (reinforcing good behavior), negative (penalizing bad behavior), or zero (neutral). But in practice, a reward function is rarely just one signal; it’s usually a weighted combination of multiple criteria evaluated simultaneously. For example, for a robotic hand spinning a pen, the reward function might combine:
Rotation progress → +5
Dropping the pen → −2
Jerky movement → −1
Safe wrist angle → +0.5
All of these collapse into one number that the agent sees at each step. The agent has no visibility into the breakdown; it only sees the final score. This is precisely why designing the formula behind that number is so difficult, and why getting it wrong leads to exploitative or unintended behaviors.
While the concept seems straightforward, designing effective reward functions for real-world RL tasks is notoriously difficult. It’s far more complex than simply assigning a success point, or a failure penalty. Real-world behaviors are often nuanced and involve many subtle elements that are challenging to codify explicitly. For example, teaching a robotic hand to spin a pen isn’t just about the pen rotating. The reward function must also implicitly or explicitly encourage smooth motion, prevent the pen from being dropped, avoid unsafe angles, and possibly account for efficiency or style. If any of these subtle elements are missed, the agent might learn unintended or “exploitative” behaviors, such as flinging the pen too quickly accumulate a high reward. This intricate and often brittle process is known as “reward engineering,” or “reward design.”
This challenge has long been a significant bottleneck in applying RL to complex ...