Search⌘ K

The Learning Objective and The Training Loop

Understand the fundamental training technique of next-token prediction and the four steps involved in the training loop.

We have understood how the transformer decoder block works and how these models generate outputs. For our demo, the model’s weights were initialized with random, meaningless numbers. It knows nothing.

So, how is a model like this transformed from a blank slate into an expert? The process is a simple, repeatable game, played at an unimaginable scale. In this lesson, we will learn the rules of that game and explore the mechanics of how a model plays a single “turn” to get incrementally smarter. We are not training our own model, but we will understand exactly how the professionals do it.

Self-supervised learning

The entire training process is designed around a single, elegant objective called causal language modeling, which is a formal name for a simple task: predicting the next token.

This objective is powerful because it is self-supervised. The training data itself provides both the questions and the answers. If the model is shown the text “Twinkle, twinkle, little”, the label—the correct next token—is right there in the sequence: “star”. This enables us to utilize the vast, raw text of the internet as both our textbook and our answer key, eliminating the need for humans to manually label any content.

To play this game effectively, we need a way to keep track of the score. How do we measure how “good” or “bad” the model’s prediction is on any given turn? This is the job of the loss function.

You can think of loss as a precise mathematical measure of the model’s “surprise.” The specific function used is called cross-entropy loss.

  • If the model predicts “star” with 90% probability and the actual next token is indeed “star”, its surprise is very low. The loss value is small.

  • If the model predicts “fish” with 80% probability and the actual next token is “star”, its surprise is enormous. The loss value is very high.

The single, overarching goal of the entire training process is to adjust the model’s billions of weights to minimize the total loss over the entire training dataset. The model wins the game by becoming an expert at being unsurprised by human language.

Why does cross-entropy loss work?

For the specific task of next-token prediction, cross-entropy loss is the standard and universally used loss function. It is the mathematically perfect tool for the job. Why? Because it is specifically designed to measure the “distance” or “divergence” between two probability distributions:

  1. The model’s predicted probability distribution (e.g., “star”: 0.8, moon”: 0.1, ...).

  2. The “true” probability distribution is a one-hot encoded vector, where the correct token (“star”) has a probability of 1.0 and all other tokens have a probability of 0.

Cross-entropy loss gives a high penalty when the model assigns a low probability to the correct token. Its entire structure is optimized for this single task of “making the probability of the correct next token as high as possible.” While other loss functions exist in machine learning, for this generative pre-training objective, cross-entropy is the undisputed champion.

Preventing cheating with masks

This is where the causal mask from our last section becomes the hero of the story. During inference, its job was to ensure orderly, step-by-step generation. But its true purpose, the reason it was invented, is for training.

For massive efficiency, the entire text sequence (“Twinkle, twinkle, little star”) is fed into the model at once. But this creates a problem. When the model is at the third position (“little”), and its job is to predict the fourth token, the answer (“star”) is technically already in its input!

If the model could see that future token, it would cheat. It would learn to just copy the next word, not to predict it. The causal mask prevents this. It makes all future tokens invisible to the attention mechanism, forcing the model to make an honest prediction based only on the context it has seen so far. The mask is the rule that makes the game fair and ensures that real learning happens.

What happens behind the scenes

When training a large model, it isn’t given a single sentence or an entire book at once. It ...