The Learning Objective and The Training Loop

Explore how large language models learn through a training loop that includes predicting the next token, calculating loss, backpropagating errors, and updating weights. Understand the role of the causal mask in preventing cheating and the importance of cross-entropy loss in optimizing the model for next-token prediction.

We'll cover the following...

Self-supervised learning
- Why does cross-entropy loss work?
Preventing cheating with masks
Playing one turn: The training loop
Conclusion

We have understood how the transformer decoder block works and how these models generate outputs. For our demo, the model’s weights were initialized with random, meaningless numbers. It knows nothing.

So, how is a model like this transformed from a blank slate into an expert? The process is a simple, repeatable game, played at an unimaginable scale. In this lesson, we will learn the rules of that game and explore the mechanics of how a model plays a single “turn” to get incrementally smarter. We are not training our own model, but we will understand exactly how the professionals do it.

Self-supervised learning

The entire training process is designed around a single, elegant objective called causal language modeling, which is a formal name for a simple task: predicting the next token.

This objective is powerful because it is self-supervised. The training data itself provides both the questions and the answers. If the model is shown the text “Twinkle, twinkle, little”, the label—the correct next token—is right there in the sequence: “star”. This enables us to utilize the vast, raw text of the internet as both our textbook and our answer key, eliminating the need for humans to manually label any content.

To play this game effectively, we need a way to keep track of the score. How do we measure how “good” or “bad” the model’s prediction is on any given turn? This is the job of the loss function.

You can think of loss as a precise mathematical measure of the model’s “surprise.” The specific function used is called cross-entropy loss.

If the model predicts “star” with 90% probability and the actual next token is indeed “star”, its surprise is very low. The loss value is small.
If the model predicts “fish” with 80% probability and the actual next token is “star”, its surprise is enormous. The loss value is very high.

The single, overarching goal of the entire training process is to adjust the model’s billions of weights to minimize the total loss over the entire training dataset. The model wins the game by becoming an expert at being unsurprised by human language.

Why does cross-entropy loss work?

For the specific task of next-token prediction, cross-entropy loss is the standard and universally used loss function. It is the mathematically perfect tool for the job. Why? Because it is specifically designed to measure the “distance” or “divergence” between two probability distributions:

The model’s predicted probability distribution (e.g., “star”: 0.8, moon”: 0.1, ...).
The “true” probability distribution is a one-hot encoded vector, where the correct token (“star”) has a probability of 1.0 and all other tokens have a probability of 0.

Cross-entropy loss gives a high penalty when the model assigns a low probability to the correct token. Its entire structure is optimized for this single task of “making the probability of the correct next token as high as possible.” While other loss functions exist in machine learning, for this generative pre-training objective, cross-entropy is the undisputed champion.

Preventing cheating with masks

This is where the causal mask from our last section becomes the hero of the story. During inference, its job was to ensure orderly, step-by-step generation. But its true purpose, the reason it was invented, is for training.

1.Course Overview

2.The Inference Journey

3.The Training Journey

4.Building with LLMs: The Developer’s Toolkit

5.Wrap Up

The Learning Objective and The Training Loop

Self-supervised learning

Why does cross-entropy loss work?

Preventing cheating with masks