Search⌘ K
AI Features

The Alignment Engine: How RLHF and Constitutional AI Work

Understand how Reinforcement Learning from Human Feedback (RLHF) works in three phases to align AI models with human values. Learn about scalable oversight methods like Reinforcement Learning from AI Feedback (RLAIF) and Constitutional AI, which use AI to replace human feedback, increasing transparency and control. Discover advanced techniques such as AI debate and iterated amplification that enable oversight of AI systems beyond human expertise.

So far, we’ve focused on diagnosing failures:

  • In the lesson on Robustness, we broke the model to find its weaknesses against physical inputs (the Panda to Bison failure).

  • In the lesson on Interpretability, we opened the opaque model to find hidden flaws in its logic (bias/fairness failures).  

These are crucial tools, but they are external fixes. They tell us when and where the model fails, but they don’t fix the model’s core intent.

The alignment challenge: Scaling human oversight

In the lesson on the alignment problem, we established that the greatest risk posed by artificial intelligence is not that it becomes malicious, but that it causes catastrophic harm while faithfully pursuing a misaligned objective—a phenomenon commonly described as goal misspecification or misgeneralization.

The primary mechanism currently employed by the industry to address this challenge is Reinforcement Learning from Human Feedback (RLHF).

This approach uses human preference judgments to guide models toward desired behaviors, such as honesty and helpfulness. It is the central technique that has made large language models such as ChatGPT and Claude sufficiently safe and usable for widespread deployment.

However, RLHF suffers from a fundamental limitation: it does not scale.

The scalability problem

RLHF relies on human labelers to rate or rank model outputs. This works great for simple tasks, like: Which of these two sentences is more polite?

But what happens when the task is incredibly complex?

  • Protein folding: If an AI proposes a novel protein design, can a human judge whether that design is safe or effective just by looking at it? No.

  • 10,000-line code review: If an AI agent autonomously writes a massive software update, can a human reviewer accurately evaluate the safety and security of every line of code in the short window required for RLHF feedback? No.

Relying on simple human feedback fails when the tasks become too complex for humans to accurately judge at the scale required to train advanced AI models.  

The solution: Scalable oversight

The core idea is simple yet revolutionary: we must use AI systems to assist in the supervision and alignment of other AI systems.  

This creates a recursive ladder where a human, assisted by a weaker AI, can effectively oversee a model that is significantly more capable, which in turn could potentially oversee an even stronger system.

We will examine two main ...