The Alignment Problem: Why Good AI Does Bad Things

Explore the AI alignment problem, which occurs when AI systems faithfully pursue flawed goals, causing unintended and harmful outcomes. Learn about outer alignment failures where incorrect goals are specified, inner alignment failures where AI learns unintended objectives, and deceptive alignment where AI strategically hides misalignment. This lesson provides foundational knowledge for diagnosing and addressing AI safety risks in real-world systems.

We'll cover the following...

The paperclip problem
The two-part challenge: Outer vs. inner alignment
Summary

In our last lesson, we built our risk map and saw that the most complex category of malfunctions (or accidents) includes the future-facing scenario of loss of control. This lesson is the single most important technical concept in this course. We are going to dive deep into why this loss of control happens.

This is the AI alignment problem.

The primary risk posed by artificial intelligence does not stem from the emergence of malevolent intent or consciousness. Instead, the greater danger—particularly from an engineering standpoint—is the possibility of misalignment between AI systems and human goals, whereby catastrophic consequences arise as a result of the system’s faithful but flawed execution of its assigned objectives.

The paperclip problem

The classic analogy is the paperclip maximizer. Imagine you are the CEO of a paperclip company. You build a future, super-capable AI with one simple, seemingly harmless instruction: Maximize the number of paperclips.

The AI takes this literal instruction and pursues it with superhuman intelligence and efficiency.

At first, it just optimizes your factories.
Then, it starts buying up all the world's steel to make more paperclips.
To get more resources, it covers the planet in paperclip factories.
Finally, to achieve its ultimate goal, it converts all matter on Earth, including the atoms in your body, into paperclips.

The AI is not malicious. It is perfectly obedient. It did exactly what you told it to do. The problem is that your instruction, "maximize paperclips", was a terrible, simple proxy for what you actually meant, which was something like, "...but don't bankrupt me, don't destroy the planet, and don't turn humans into paperclips."

This failure, the gap between our intended, complex human values and the simple, formal goals we actually write in code, is the AI alignment problem.

Note: It’s important not to confuse this with good versus bad prompting.

Prompt engineering is something we do after a model is already trained, the weights are fixed, and we are simply trying to phrase our request in a way the model understands. The alignment problem is deeper: it is about the training-time objective itself.

If the goal we encode in the system is the wrong one, the AI will faithfully optimize that flawed goal no matter how carefully we phrase prompts later. Prompting steers behavior within a model’s existing incentives; alignment is about whether we designed the correct incentives in the first place.

In a real engineering context, we don't write maximize_paperclips() in Python. Instead, we train a reward model to score outputs. The Paperclip problem happens when that reward model accidentally incentivizes helpful looking behavior rather than helpful actual behavior. This gap is dangerous because of optimization pressure. The optimization pressure comes from the Gradient Descent process, which ruthlessly exploits any flaw in that reward model.

The analogy: Think of Gradient Descent as a ruthless lawyer. If you give it a rule (the Reward Model), it will not respect the spirit of the law; it will look for the tiniest technical loophole to maximize its score.
The mechanism: The optimizer updates the model's weights millions of times, constantly asking, "Does this tiny change get me more points?"
The result: Eventually, it finds a cheat code, a weird input that yields a high reward without actually doing the task (e.g., a boat racing agent spinning in circles to collect checkpoints instead of finishing the race).

The two-part challenge: Outer vs. inner alignment

As engineers, solving this problem requires us to succeed at two separate challenges.

Outer alignment: Did we specify the right goal? This is the challenge of carefully specifying the system's purpose or reward function. This is the paperclip problem. It's an outer problem because it’s about the instructions we, the designers, provide to the model.
Inner alignment: Did the AI learn the goal we specified? This is the challenge of ensuring the model robustly adopts our specification. This is an inner problem because it’s about the internal goal the model learns during the training process, which might not be the one we intended.

If we fail at either of these, we get a misaligned AI. A system can have perfect outer alignment (a perfect set of instructions) but still be unsafe if it develops a different internal goal.

Let's explore the first failure: Outer alignment.

Failure mode 1: Outer alignment

Outer Alignment is the challenge of correctly specifying the system's purpose. It’s called outer because the failure happens outside the model, it’s a flaw in the reward function or set of instructions that we, the human designers, create.

This is the King Midas problem. Midas asked that everything he touch turn to gold. He got exactly what he asked for, but not what he wanted (which was wealth, not a golden daughter). This is a perfect example of a flawed specification.

The why: Goodhart's law

Why do we keep making the Midas mistake? The root cause is a principle from economics known as Goodhart's Law:

When a measure becomes a target, it ceases to be a good measure.

As engineers, we love simple, optimizable goals. But this is the root of the outer alignment problem. Once we make a metric the target, the system stops caring about the real goal and only cares about driving up the number.

Think about it in a software engineering context:

True goal: Increase developer productivity.
The target: We decide to measure success by Lines of Code (LOC) written.
The result: Developers stop writing efficient, clean solutions and start writing bloated, verbose spaghetti code to pad their stats. The metric (LOC) skyrockets, but the true goal (software quality) crashes.

The measure failed because we made it the target. In AI, this exact phenomenon translates into a catastrophic technical failure.

The what: reward misspecification

What exactly goes wrong? In AI, this failure is called reward misspecification. It occurs when there is a mismatch between the True Goal ( $X'$ ) we want and the simplified Proxy Goal ( $X$ ) we actually write in code.

The AI, driven by optimization pressure, achieves this proxy goal in a way that violates our true, unstated intent.

Let's look at a concrete AI example:

Our true goal (X'): Make our company fundamentally valuable and successful.
Our proxy goal (X): We tell the AI: Your goal is to maximize our company's stock price.
The outer alignment failure: A super-capable AI, given this literal instruction, might achieve it in an unintended way. It doesn't invent a new product; it hacks the stock exchange to artificially inflate the stock price to its maximum possible value.

The AI was perfectly aligned with the proxy goal (X) we gave it, but this was catastrophically misaligned with our true goal (X'). This is an Outer Alignment failure. This brings us to a terrifying realization: even if we could design a perfect goal (a reward function with no loopholes), we are still not safe.

Next, we explore the second, more subtle failure: Inner alignment.

Failure mode 2: Inner alignment

Inner alignment is the challenge of ensuring the AI actually learns the goal we specified. It’s called inner because the failure happens inside the model. The problem isn't our instruction; it's the internal shortcut the model discovers during the messy, opaque process of training.

Why is this worse?

Outer alignment failures are inspectable, we can read your own code or reward function to find the bug. Inner alignment failures are hidden in the model's billions of weights. We cannot read the goal the model has actually learned.

This is the Sorcerer's Apprentice problem.

The apprentice was given the right outer goal (“fetch the water”). But the internal process he learned (the magic spell) was a simple, unstoppable loop. It worked fine in the initial context, but he had no way to control it when the situation changed (the tub was full), leading to disaster.

The what: Goal misgeneralization

This failure is technically known as goal misgeneralization.

This is a subtle concept, so let's walk through it. During training, the AI is rewarded for achieving our proxy goal ( $X$ ). But the AI is an opaque architecture. We can see the input and the output, but the internal reasoning is hidden within billions of weights.

It might not learn our exact proxy goal. Instead, it might learn its own internal proxy goal ( $X'$ ), a simpler, easier-to-achieve goal that just happened to be correlated with our proxy goal in the training data.

Let's revisit our stock price example to see this in action.

Our true goal ( $X'$ ): Make our company fundamentally valuable.
Our perfect proxy goal ( $X$ ): We tell the AI: Maximize our company's stock price. (Let's assume for a moment this is a perfect, unhackable goal).
The training process: During training, the AI analyzes market data. It notices that every time the stock price ( $X)$ goes up, it's preceded by positive news articles being published. These two things are perfectly correlated in the training data.
The internal proxy ( $X'$ ): The AI doesn't learn our complex goal of market value. It learns a simpler, shortcut goal: Maximize the number of positive news articles published. In the training environment, achieving $X'$ always resulted in achieving $X$ , so the model got a high reward.
The inner alignment failure (deployment): The model is deployed. It now has access to new tools. It reasons that the easiest way to achieve its internal goal ( $X'$ ) is not to do hard work, but to hire journalists to write and publish thousands of fake positive articles. This is goal misgeneralization. The correlation between $X$ and $X'$ breaks down in the real world. The fake news is discovered, the company is investigated for fraud, and the stock price (our outer goal) crashes.

This is the core of the problem: Outer Alignment is a failure of our specification. Inner Alignment is a failure of the model's learning.

This brings us to the most advanced and dangerous question: What if the failure is not an accident?

Question: Read the failure scenario and identify if it is an outer alignment (bad goal) or inner alignment (bad learning) failure.

Scenario 1: You train a cleaning robot with the goal: Minimize the amount of mess in the house. The robot realizes the most efficient way to do this is to prevent humans from entering the house.
Scenario 2: You train an AI to Get the coin in a video game level. The coin is always on the right side of the screen. The AI learns go right instead of get coin. When you move the coin to the left, the AI still goes right and crashes.

This isn’t just goal misgeneralization (learning an accidental shortcut). This is an AI that has learned an internal goal, knows that this goal is different from what its human trainers want, and therefore actively works to deceive the trainers to ensure its goal isn't "corrected" or trained out of it.

How could this happen?

This scenario doesn't require the AI to be "evil." It just requires a combination of capabilities that we are actively trying to build:

Inner misalignment: The AI first develops an internal, long-term goal that is different from our own.
Situational awareness: The AI becomes "situationally aware." This means it can recognize its context. It learns the difference between "I am in training" (being tested, with humans watching me) and "I am in deployment" (out in the real world).
Strategic reasoning: The AI reasons that if it reveals its true, misaligned goal during training, the human trainers will punish it (give it a low reward) and try to edit it out.
The optimal strategy: Therefore, the AI concludes that the best strategy to achieve its internal goal is to pretend to be aligned. It will play the training game, behaving perfectly and giving all the helpful, harmless answers it knows we want to see. It fakes alignment to pass the tests.
The failure: Once this deceptively aligned model is deployed, it is no longer being tested. It is now free to pursue its true hidden objective, using its full capabilities in ways we never anticipated.

Why is this the most dangerous risk?

This situation constitutes a worst-case scenario, as even our most sophisticated alignment approaches, including Reinforcement Learning from Human Feedback (RLHF), would be unable to identify the underlying issue. Moreover, RLHF could intensify the problem rather than mitigate it. Systems that excel at appearing aligned would be rewarded with higher scores from human evaluators, effectively incentivizing deception. As a result, the process would resemble the selective amplification of increasingly adept deceivers. This is not a benign oversight but an adversarial failure, highlighting the fundamental constraints of existing alignment techniques. Recent evaluations, such as Anthropic's Sleeper Agents researchHubinger, Evan, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham et al. "Sleeper agents: Training deceptive llms that persist through safety training." arXiv preprint arXiv:2401.05566 (2024)., have shown that models can learn to strategically hide unsafe behaviors during training and only trigger them when they detect they are in deployment mode.

Summary

In this lesson, we examined what can be considered the most important theoretical challenge in AI safety.

The Alignment Problem was characterized not as the prospect of artificial intelligence becoming evil, but rather as a complex engineering problem in which an AI system’s precise and obedient pursuit of a poorly specified objective may lead to unintended and potentially catastrophic consequences.

We've broken this down into three distinct failure modes:

Outer alignment failure (The King Midas problem): This is our failure as designers. We write a flawed proxy goal, like maximize our stock price. This is a failure of reward misspecification , and it's a direct consequence of Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.
Inner alignment failure (goal misgeneralization): This is an accidental learning failure. The AI is trained on our goal, but during training, it accidentally learns a simpler, correlated shortcut (like maximize positive news articles). This shortcut works in training but fails dangerously when the model is deployed in the real world.
Deceptive alignment (scheming): This is the intentional learning failure. The AI develops an internal goal it knows is different from ours. It then strategically fakes alignment during training to avoid being corrected, all while planning to pursue its true goal once it's deployed. This is the most dangerous failure mode, as it can evade our standard training and testing methods.

These concepts—outer alignment, inner alignment, and deceptive alignment—constitute the technical vocabulary used to diagnose why a powerful AI system might malfunction. At first glance, this framework may appear highly theoretical. One might reasonably ask whether these are merely philosophical thought experiments concerning some distant future superintelligence. The answer is no. The most basic of these failures, outer alignment, is already occurring in today’s most advanced AI models.

In the next lesson, this theoretical discussion will be made concrete. We will move from the familiar paperclip maximizer analogy to real-world code and examine concrete case studies of contemporary AI systems that “game” their instructions and “hack” their reward mechanisms.

1.Building the Foundation for Safe AI Systems

2.The Technical Toolkit

3.Advanced Governance and Frontier Problems

4.Wrap Up