Fine-tuning and Alignment

Understand how base language models become aligned assistants through fine-tuning and alignment techniques. Explore supervised fine-tuning, reinforcement learning from human feedback, and newer approaches like Direct Preference Optimization and Constitutional AI. Learn the distinctions between these methods and how they contribute to safe, effective model deployment in production-grade systems.

We'll cover the following...

What is the difference between pretraining and fine-tuning?
Why is SFT not enough, and what does alignment add?
How does RLHF work?
What is DPO and when does it replace RLHF?
What is Constitutional AI and how does it train Claude?
What are GRPO and DAPO, and how do reasoning models get trained?
How would you implement a toy DPO training step?
What’s next?

A base language model is a next-token predictor. It is extraordinarily good at completing text in the style of its training corpus. It is not, by default, a helpful assistant. Getting from “completes text plausibly” to “reliably helpful, honest, and safe” requires a stack of training techniques that every interviewer at an AI company will probe. Fine-tuning and alignment are the bridge between raw capability and deployed product.

SFT, RLHF, DPO, and Constitutional AI are not competing approaches that replaced one another. They are layers in a training pipeline that modern frontier models use in combination. All frontier models use variants of all of them. Understanding how they stack, and why each layer exists, is what separates a strong answer from a surface-level one.

What is the difference between pretraining and fine-tuning?

Pretraining is where the model learns language itself. A massive corpus, typically trillions of tokens from the internet, books, and code, is fed to the model with a simple objective: predict the next token. After pretraining, the model has a rich internal representation of language, facts, and reasoning patterns, but no concept of how to be helpful in a conversation.

Fine-tuning adapts the pretrained model to a specific behavior or task by continuing training on a much smaller, curated dataset. The pretrained weights are not discarded; they are the starting point. This is why fine-tuning is cheap relative to pretraining. A model like Llama 3 70B costs millions of dollars to pretrain. Fine-tuning it for instruction following costs a small fraction of that, because you are nudging an already-capable model rather than building one from scratch.

Supervised fine-tuning (SFT) is the first fine-tuning stage. Human annotators write examples of good model behavior: a prompt paired with a high-quality response. The model is trained to imitate these demonstrations using the standard next-token cross-entropy loss. After SFT, the model follows instructions much better. It ...

1.How AI Models Work

2.LLM Training, Fine-Tuning, and Optimization

3.AI System Design

Fine-tuning and Alignment

What is the difference between pretraining and fine-tuning?