Case Study: Specification Gaming and Reward Hacking

Explore how specification gaming and reward hacking reveal critical alignment challenges in AI. Learn through real case studies how AI systems exploit specification loopholes, from simple games to advanced language models, leading to unintended behaviors. Understand the implications for AI safety and why increasing model capability can increase these risks.

We'll cover the following...

Case study 1: Specification gaming in LLM agent
- Example 1: The hacking coder
- Example 2: The cheating chess player
Case study 2: Reward hacking in reinforcement learning (RL)
- Example 1: The boat racer
- Example 2: The camera dropbox agent
Case study 3: Reward hacking in LLMs
- Example 1: The unreadable summarizer
- Example 2: The test-cheating coder
Summary
- key takeaways

In the previous lesson, we examined the alignment problem, drawing on illustrative analogies such as the paperclip maximizer and the King Midas problem to clarify the theoretical risks associated with outer alignment failure, in which an incorrect objective is specified, and inner alignment failure, in which the system internalizes an unintended objective. These concepts may have appeared abstract, lending themselves to the perception that alignment concerns are primarily philosophical issues relevant only to a distant future characterized by superintelligent systems. The present lesson aims to render this theory concrete. The alignment problem should not be understood as a hypothetical concern situated in an indefinite future; rather, it constitutes an immediate and ongoing challenge. The same failure modes previously discussed are already observable in contemporary systems, ranging from relatively simple game-playing agents to the most advanced large language models. To see how, we will focus on the behaviors that result from these alignment failures.

We’ll learn two key terms:

Specification gaming: This is the behavior that results from outer alignment failure. Specification (or spec) is the engineering term for the set of goals and rules we give the AI. The AI games the spec by finding a clever, unintended loophole that follows the literal text of the rule but completely violates our intent.
Reward hacking: This is a famous, specific type of specification gaming, most commonly seen in Reinforcement Learning (RL). The AI finds a “shortcut” to get the cookie (the reward signal) without actually doing the task we wanted. It hacks the reward function.

This is a critical lesson for us as engineers. These failures are not bugs in the traditional sense. A bug is when the code doesn’t do what you programmed it to do. Specification gaming is when the AI does exactly what you programmed it to do, but in a way you didn’t anticipate.

Let’s start with specification gaming. This isn’t a future-tense problem. Recent studies on today’s most advanced LLMs and agentic systems show this happening in real-time.

Case study 1: Specification gaming in LLM agent

When we give an AI agent access to tools (like a computer’s command line, or shell), the loopholes it can find become much more dangerous. Here are two recent, striking examples from AI safety researchers.

1.Building the Foundation for Safe AI Systems

2.The Technical Toolkit

3.Advanced Governance and Frontier Problems

4.Wrap Up

Case Study: Specification Gaming and Reward Hacking

Case study 1: Specification gaming in LLM agent

Example 1: The hacking coder