Model Alignment
Understand why supervised fine-tuning alone is insufficient for aligning large language models and explore advanced alignment techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). This lesson explains how these methods incorporate human preferences to make AI outputs safer, more helpful, and truthful, highlighting their differences, benefits, and challenges in model training.
It’s common in GenAI interviews to be asked why supervised fine-tuning (SFT) isn’t enough and why models require an additional alignment step, such as RLHF or newer methods like DPO. The question checks whether you understand how modern LLMs become not just instruction-following, but actually helpful, safe, and aligned with human preferences.
SFT teaches a model to imitate human-written responses, but imitation doesn’t guarantee the outputs people actually want. Even a well-tuned SFT model may still be unhelpful, incorrect, or unsafe. Alignment methods add direct human preference data—ranking, feedback, or pairwise comparisons—to push the model toward preferred behaviors rather than merely plausible ones. This lesson walks through why that extra step matters and how RLHF and DPO achieve it.
Why is an additional alignment step necessary after SFT?
After supervised fine-tuning (SFT), a model becomes better at following instructions, but it still imitates patterns rather than optimizing for what humans prefer. SFT teaches the model to mimic demonstrations, so an SFT-only model can still be verbose, unhelpful, unsafe, or misaligned with user intent. Plausible answers are not the same as preferred answers, and a response that sounds fine may still fail to match what users want.
SFT captures style, not value judgments. Its dataset cannot cover every tricky, harmful, or ambiguous query, so an SFT-only model often fails in situations it has never seen. Alignment methods such as RLHF and DPO add direct human feedback that teaches the model which of two outputs is better, injects policies such as refusing unsafe requests or protecting sensitive information, and steers the model toward truthful, clear, and helpful behavior.
SFT is trained with a next-token cross-entropy loss, so it optimizes for token-level likelihood rather than sequence-level preference or correctness. Alignment steps reward desired behavior and penalize harmful or incorrect responses. Think of SFT as learning how to talk like a helpful human, and alignment as learning how to be helpful to a human.
Empirical evidence supports this distinction. InstructGPT showed that a small SFT+RLHF model outperformed a much larger SFT-only GPT-3 on human preference evaluations. Similar studies show RLHF, RLAIF, and related techniques consistently outperform pure supervised instruction. Stopping at SFT means missing a significant improvement in helpfulness and safety.
Though implemented differently, RLHF and DPO both leverage human preferences to fine-tune model behavior. Unlike SFT, which teaches by imitation, these methods train models to distinguish between the responses that humans prefer. This helps the model learn deeper traits, such as helpfulness, truthfulness, and safety—qualities that imitation alone cannot guarantee. Alignment steps embed nuanced judgment into the model through a human-guided reward signal. That’s why today’s leading chatbots (ChatGPT, Claude, Gemini, etc.) all include post-SFT alignment fine-tuning.
Quick answer: Supervised fine-tuning teaches a model to imitate examples, but imitation alone doesn’t guarantee helpful, safe, or truthful behavior. Alignment steps like RLHF or DPO add direct human preferences, showing the model which answers are better, not just plausible. This reduces hallucinations, unsafe outputs, and stylistic drift—problems SFT alone cannot reliably solve.
SFT is essential but not enough for aligning LLMs. It helps with instruction-following and tone, but doesn’t guarantee optimal or safe outputs. Alignment methods like RLHF or DPO fill these gaps, reducing hallucinations and toxic responses, shaping output to human-preferred styles, and aligning behavior with user and societal values. These techniques elevate models from “can do the task” to “does it the way humans want.”
Next, we’ll dive deeper into how these alignment methods work.
What is reinforcement learning from human feedback (RLHF)?
Before we compare RLHF and DPO, let’s ensure we understand what RLHF is and why it’s used. Reinforcement learning from human feedback (RLHF) is a technique that has become standard for aligning LLMs (OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and DeepSeek’s R1—all use some variant of RLHF in their training pipeline). The idea is to use human feedback as a reward signal to fine-tune the model’s behavior via reinforcement learning. This method gained prominence with OpenAI’s InstructGPT paper, which showed that GPT-3 models fine-tuned with human feedback followed instructions much better than supervised ones.
At a high level, RLHF is implemented in three steps. The figure below illustrates the RLHF pipeline as applied in InstructGPT:
Let’s go through those steps in a bit more detail:
First, a set of prompts (often real user queries or created by labelers) is collected, and human experts write demonstration responses to show ideal answers. For example, for “Explain the moon landing to a 6-year-old,” a human might write a simple, child-friendly reply. These prompt-response pairs are used to fine-tune the pretrained model (like GPT-3) via supervised learning, teaching it to mimic the demonstrations. The result is the SFT (supervised fine-tuning) model, which better follows instructions—answering politely, explaining clearly, etc.—but is not yet fully aligned with human preferences.
Next, we gather human ...