Search⌘ K
AI Features

Pretraining Paradigms

Explore the core pretraining paradigms used in foundation models, including autoregressive next-token prediction, masked language modeling, and contrastive learning. Understand how these methods optimize training and influence model inference, enabling versatile applications across text, vision, and speech domains. Gain insight into the workflow and benefits of large-scale self-supervised pretraining.

Modern foundation models, such as GPT, can initially appear somewhat mysterious. You might hear terms like masked language modeling, autoregressive next-token prediction, or contrastive learning, and wonder—are these about how the model is trained or makes predictions? The truth is that the pretraining task shapes both.

  • The training process (How the model’s parameters are optimized)

  • The final behavior (How the trained model naturally generates text)

We’ll dissect these two facets, explaining why the pretraining technique is the core of how the model is trained and how that leads to the model’s eventual capabilities at inference.

How does pretraining define both training and inference?

A great example comes from the GPT series. During pretraining, GPT learns with an autoregressive objective: given all previous tokens, predict the next one. For example, if the input is “The cat sat on the”, the model must guess “mat.” This process is repeated billions of times, comparing predictions to true tokens and adjusting parameters to reduce errors.

The same autoregressive mechanism drives inference. If you prompt GPT with “Translate this sentence into French: I love cats”, it again predicts the next most likely token. Because it has seen translation patterns in training, the continuation becomes “J’aime les chats.”

In short, GPT learns and generates through the same unified process: predict the next token autoregressively. This simplicity is what makes it powerful and versatile.

Role of loss functions and optimization during pretraining

A model like GPT doesn’t know if its guesses are right or wrong: it needs a loss function to measure errors. In next-token prediction, every mismatch between GPT’s guess and the true token raises the loss, while correct predictions lower it. This provides implicit feedback: the training text itself supplies the “right answer.”

The optimizer (often Adam, a variant of gradient descent) then calculates how to adjust millions or billions of parameters to reduce the loss. Each update is a tiny nudge in the right direction, and repeated over vast datasets, these updates gradually teach the model grammar, facts, and reasoning patterns.

This loop, predict, measure loss, adjust, repeat, is the engine of pretraining. The benefit of this approach is flexibility: by training on huge amounts of diverse text, GPT acquires broad knowledge that it can transfer to many tasks. With only minor fine-tuning, it can adapt to tasks such as translation, summarization, or creative writing far more efficiently than a model trained from scratch.

What are the most used pretraining paradigms in foundation models?

We’ve already seen how GPT’s autoregressive (causal) language modeling shapes both its training and inference. But autoregression is only one paradigm. To fully understand foundation models, let’s explore the main approaches that have shaped their development.

Autoregressive (causal) modeling

In causal language modeling (CLM), a model predicts the next token in a sequence based on all the tokens that came before it. No looking ahead is allowed.

Historically, researchers relied on n-gram models, which handled short contexts but failed with large vocabularies and long dependencies. Transformers, used in GPT, overcame this by scaling to billions of parameters and efficiently handling long contexts.

One key strength of autoregressive modeling is its training–inference unity. At training, the model predicts the next token; at inference, it does the same, simply continuing a user’s prompt. This makes GPT highly versatile: by observing diverse text during training, it can perform tasks like translation or summarization without being explicitly trained for them.

Masked language modeling (MLM)

While autoregression looks only left-to-right, masked language modeling (MLM) uses context from both directions. In this setup, random tokens in a sentence are masked, and the model must guess the missing pieces using surrounding words.

This approach was popularized by BERT (2018). Inspired by denoising autoencoders, MLM forces the model to reconstruct text where words are hidden. For example:

  • Input: “The quick [MASK] fox jumps.”

  • Prediction: “brown”.

Because BERT can see both left and right context, it learns deeply contextual word embeddings. This makes it especially strong in reading comprehension, search, and classification tasks, although it is less suited for generation.

Contrastive learning

Many real-world tasks involve multiple modalities, like images paired with text. Contrastive learning aligns these by bringing related items close in embedding space and pushing unrelated items apart.

A major breakthrough was OpenAI’s CLIP, which trains an image encoder and a text encoder together. If an image is captioned “a cat sitting on a mat,” their embeddings are pulled close; a mismatched caption pushes them apart. This process uses naturally available data (captions, alt-text) instead of manually labeled data.

Contrastive learning has roots in metric learning and Siamese networks, but scaled transformers and web-scale datasets unlocked its true power. Today, it underpins multimodal models capable of linking vision, text, and more.

Other pretraining paradigms

Beyond causal, masked, and contrastive strategies, speech-based and diffusion-based paradigms stand out:

  • Speech-based objectives: Models like OpenAI’s Whisper mask or predict parts of speech sequences, enabling transcription, translation, and recognition without labeled recordings.

  • Diffusion-based training: Models such as DALL·E generate images by progressively denoising random noise, a process well-suited for high-fidelity generation in vision tasks.

These approaches extend beyond text, demonstrating how self-supervised objectives can be adapted to various data types.

What does the pretraining workflow look like?

Regardless of paradigm, most foundation models follow a similar pipeline:

  1. Data gathering: Collect large, diverse datasets (text, images, audio, code).

  2. Preprocessing: Clean, normalize, and tokenize data into usable chunks.

  3. Model initialization: Start with random weights.

  4. Self-supervised task: Next-token prediction (CLM), masked tokens (MLM), or multimodal matching (contrastive).

  5. Loss calculation: Compare predictions to actual data; calculate error.

  6. Parameter updates: Use optimizers like Adam to nudge weights toward better predictions.

  7. Repeat at scale: Billions of steps over massive datasets forge strong, generalizable representations.

Through large-scale self-supervised pretraining, models learn rich and flexible patterns that apply across text, images, and speech. Massive, diverse datasets allow them to capture subtle nuances, generalize well, and even absorb traits like humor or cultural references without explicit teaching. This broad understanding forms the foundation, while post-training (fine-tuning or adaptation) refines the model for specific tasks such as translation, question answering, or image generation. Together, these stages define the power of today’s foundation models.