...

/

Model Alignment

Model Alignment

Learn why alignment methods like RLHF and DPO are essential after SFT to train language models that align with human preferences, not just mimic instructions.

It’s very common in GenAI interviews to be asked why simply doing supervised fine-tuning (SFT) on a large language model isn’t enough and why additional alignment steps like RLHF (reinforcement learning from human feedback) or newer methods like DPO (Direct Preference Optimization) are needed. Interviewers love this question because it probes your understanding of how modern LLMs (like ChatGPT or Claude) are trained to be helpful and safe, not just statistically good at next-word prediction. As an ML engineer, you’re expected to demonstrate that you grasp the limitations of a model fine-tuned only on supervised data and how alignment techniques overcome those limitations to produce models that better satisfy human preferences.

When answering, a strong candidate will explain what SFT does and doesn’t do and then describe why an extra alignment phase is introduced. The interviewer wants you to mention that SFT (often called instruction tuning when done on curated prompts and responses) teaches the model how humans might respond, but it doesn’t guarantee the model’s outputs are what humans want or find acceptable. You should articulate that alignment methods like RLHF or DPO inject direct human feedback into training, which helps the model prefer more helpful, correct, or harmless answers. In other words, you need to show you understand that SFT alone might still be vulnerable to producing untruthful or undesired outputs, and alignment steps fix that by fine-tuning the model with human preferences in mind. Throughout this lesson, we’ll explore these concepts in depth and give you a solid, beginner-friendly, yet technically detailed understanding.

Why is an additional alignment step necessary after SFT?

After a language model has been under supervised fine-tuning (SFT)—typically on a dataset of example prompts and ideal responses—it certainly becomes better at following instructions than the raw pretrained model. So why isn’t that still enough? The short answer: a pure SFT model might imitate the style of responses in the fine-tuning data, but it can still misbehave or be suboptimal when facing real user inputs. Alignment steps like RLHF or DPO are necessary to address several limitations of SFT:

  • In SFT, the model is trained to mimic demonstrations (human-written answers). It learns to produce plausible answers, but plausibility ≠ preference. For example, the model might give an answer that sounds good but isn’t what a user truly wants. It might be too verbose or not sufficiently helpful, or it might miss the nuance of what makes an answer useful. SFT teaches the model a general style, but it doesn’t explicitly tell the model which of two plausible answers is better from a human perspective. In contrast, alignment methods use direct feedback from humans to indicate which outputs are preferred, not just plausible​. This feedback is more fine-grained and value-laden, allowing the model to learn what humans prefer (e.g., truthfulness, clarity, not just looking “okay”).

  • Imagine teaching a model to be helpful but not harmful using only SFT. You would need your supervised dataset to include every possible tricky user request with the exact correct response, which is infeasible. SFT is limited by its dataset—if the data doesn’t cover a certain situation, the model won’t handle it well. For instance, an SFT model might not know how to respond to a blatantly harmful request if it never saw one in training. Alignment via RLHF/DPO lets us incorporate nuanced policies (like “don’t reveal sensitive info” or “refuse certain requests politely”) by having humans actively evaluate model outputs and prefer the ones that follow those policies. This nuanced feedback goes beyond the explicit content of the supervised training examples​.

  • Even after SFT, language models can produce hallucinations (confident-sounding but false statements) or toxic content, because SFT mainly optimizes for following patterns in the fine-tuning data. It doesn’t strongly penalize inaccuracies or unsafe content unless those were explicitly absent in the fine-tuning data. Alignment steps address this by rewarding truthful, safe behavior and punishing undesirable outputs. In RLHF, for example, the reward model can be trained to give low scores to factually incorrect or toxic answers, and the RL step will then reduce the likelihood of those answers​. Essentially, alignment tunes the model to what humans value (factual correctness, helpfulness, harmlessness) rather than just “what looks like the training text.”

  • During pretraining and even after SFT, the model might have learned all sorts of biases or unwanted behaviors (because pre-training data from the internet is unfiltered). SFT might not remove those latent behaviors if the supervised data is limited. Alignment steps are explicitly designed to align the model’s behavior with human values or specific guidelines. RLHF was originally proposed as a way to make models “follow instructions” and be safe and reliable

SFT is usually trained with a next-token cross-entropy loss. That means it optimizes for token-level likelihood, not sequence-level preference or correctness. Think of SFT as learning how to talk like a helpful human, and alignment (RLHF/DPO) as learning how to be helpful to a human.

Press + to interact

Empirical studies show users strongly prefer models that undergo alignment fine-tuning. OpenAI’s InstructGPT work famously demonstrated that a 1.3B parameter model trained with SFT and RLHF outperformed the much larger 175B GPT-3 model trained only with supervised learning. This highlights how alignment can boost human satisfaction even more than scale. Pure SFT models fall short compared to those enhanced with human feedback. Other studies confirm this: RLHF and AI-assisted feedback (RLAIF) significantly outperform models trained solely via supervised instruction. Stopping at SFT means missing out on a major quality boost.

Though implemented differently, RLHF and DPO leverage human preferences to fine-tune model behavior. Unlike SFT, which teaches by imitation, these methods train models to distinguish which responses humans prefer. This helps the model learn deeper traits like helpfulness, truthfulness, and safety, qualities that imitation alone can’t ensure. Alignment steps embed nuanced judgment into the model by adding a human-guided reward signal. That’s why today’s leading chatbots (ChatGPT, Claude, Gemini, etc.) all include post-SFT alignment fine-tuning.

SFT is essential but not enough for aligning LLMs. It helps with instruction-following and tone, but doesn’t guarantee optimal or safe outputs. Alignment methods like RLHF or DPO fill these gaps, reducing hallucinations and toxic responses, shaping output to ...