Pretraining Paradigms
Explore how pretraining shapes foundation models, covering masked, causal, and contrastive learning techniques.
Modern foundation models, such as GPT, can initially appear somewhat mysterious. You might hear terms like masked language modeling, autoregressive next-token prediction, or contrastive learning, and wonder—are these about how the model is trained or makes predictions? The truth is that the pretraining task shapes both.
The training process (How the model’s parameters are optimized)
The final behavior (How the trained model naturally generates text)
We’ll dissect these two facets, explaining why the pretraining technique is the core of how the model is trained and how that leads to the model’s eventual capabilities at inference.
How does pretraining define both training and inference?
A great example comes from the GPT series. During pretraining, GPT learns with an autoregressive objective: given all previous tokens, predict the next one. For example, if the input is “The cat sat on the”, the model must guess “mat.” This process is repeated billions of times, comparing predictions to true tokens and adjusting parameters to reduce errors.
The same autoregressive mechanism drives inference. If you prompt GPT with “Translate this sentence into French: I love cats”, it again predicts the next most likely token. Because it has seen translation patterns in training, the continuation becomes ...