Search⌘ K

Pretraining Paradigms

Explore pretraining paradigms that shape foundation models in generative AI. Understand how methods like autoregressive prediction, masked language modeling, and contrastive learning influence both training and model behavior during inference, enabling versatile applications in language, vision, and speech.

Modern foundation models, such as GPT, can initially appear somewhat mysterious. You might hear terms like masked language modeling, autoregressive next-token prediction, or contrastive learning, and wonder—are these about how the model is trained or makes predictions? The truth is that the pretraining task shapes both.

  • The training process (How the model’s parameters are optimized)

  • The final behavior (How the trained model naturally generates text)

We’ll dissect these two facets, explaining why the pretraining technique is the core of how the model is trained and how that leads to the model’s eventual capabilities at inference.

How does pretraining define both training and inference?

A great example comes from the GPT series. During pretraining, GPT learns with an autoregressive objective: given all previous tokens, predict the next one. For example, if the input is “The cat sat on the”, the model must guess “mat.” This process is repeated billions of times, comparing predictions to true tokens and adjusting parameters to reduce errors.

The same autoregressive mechanism drives inference. If you prompt GPT with “Translate this sentence into French: I love cats”, it again predicts the next most likely token. Because it has seen translation patterns in training, the continuation becomes “J’aime les chats.”

In short, GPT learns and generates through the same unified process: predict the next token autoregressively. This simplicity is what ...