Search⌘ K
AI Features

How LLMs Learn (The Training Loop)

Understand how large language models learn by predicting the next token in a sequence and refining their predictions through a repeated training loop. Learn the roles of the loss function, backpropagation, and weight updates in enabling models to improve across massive datasets and many iterations.

In this course, we introduced a formula that estimates how long it takes to train a large language model. Depending on the number of parameters, the size of the dataset, and the number of training epochs, this process can span hundreds of days on large-scale hardware. While the formula helps quantify the cost of training, it does not explain what is actually happening during that time.

This lesson focuses on the mechanics of learning in large language models. Instead of diving into mathematical derivations, we will take a conceptual view of how an LLM improves over time. The goal is to understand what “learning” means in this context and how the training loop gradually transforms an untrained model into a system capable of generating coherent and useful text.

What does it mean for an LLM to learn?

Large language models do not learn concepts, facts, or rules in the way humans do. They do not store explicit knowledge about the world, nor do they reason symbolically about language. Instead, their learning objective is much simpler and more mechanical.

An LLM learns by repeatedly predicting the next token in a sequence.

A token is a basic unit of text used by the model. Depending on the tokenizer, a token may represent a full word, a word fragment, punctuation, or whitespace. During training, the model is shown a sequence of tokens and asked to predict the next token.

For example, given the sequence:

“The capital of France is”

The correct next token is:

“Paris”

The model makes a prediction, compares it with the actual next token from the training data, and adjusts itself if the prediction is incorrect. This process is repeated across massive datasets containing trillions of tokens.

Let’s look at how models learn through these tokens.

Self-supervised learning through text

The common training approach is known as self-supervised learning. Unlike traditional supervised learning, there is no need for manually labeled data. The structure of the text itself provides supervision.

Every sentence in the training corpus naturally contains both:

  • An input (all tokens except one).

  • A target (the next token to be predicted).

For instance, if the training data contains the sentence:

“Large language models learn from data.”

The model can be trained using:

  • Input: “Large language models learn from”

  • Target: “data”

Because the correct answer is already present in the data, no external labeling process is required. This enables training models on extremely large, diverse datasets collected from books, articles, code repositories, and other text sources.

It is important to note that the model does not “know” what the sentence means. It only learns that certain token sequences are statistically likely to follow others. Over time, as it is exposed to more data, the model becomes increasingly accurate at making these predictions.

The training loop

The learning process of an LLM is structured as a loop that is repeated continuously during training. Each iteration of this loop slightly improves the model’s parameters. Individually, these improvements are small, but at scale, they accumulate into significant capability.

At a high level, the training loop consists of four steps:

  1. The model predicts the next token.

  2. The prediction is evaluated using a loss function.

  3. The error is propagated backward through the model.

  4. The model’s weights are updated to reduce future error.

This loop is executed billions or even trillions of times during training.

Next-token prediction training flows from input text through tokenization and model prediction to loss feedback
Next-token prediction training flows from input text through tokenization and model prediction to loss feedback

In the following sections, we will examine each step of this loop in detail, starting with the model’s prediction.

Prediction: Making a guess

During training, the model receives a sequence of tokens as input. Using its current set of parameters, it processes this sequence and produces a prediction for the next token. Importantly, the model does not directly output a single token. Instead, it generates a probability distribution over all tokens in its vocabulary.

For example, given an input sequence, the model might assign:

  • 60% probability to one token.

  • 25% to another.

  • And smaller probabilities to many others.

The token with the highest probability is considered the model’s prediction. Early in training, these predictions are often close to random. As training progresses, the probability mass increasingly shifts toward correct or plausible tokens.

Flow through an LLM from raw text to next-token probability predictions
Flow through an LLM from raw text to next-token probability predictions

This prediction step is purely computational. It consists of matrix multiplications, non-linear transformations, and a final normalization step that converts raw scores into probabilities. There is no memory of past predictions and no awareness of meaning—only numerical computation based on the current weights.

The loss

...