Search⌘ K
AI Features

How Do Models Learn?

Explore how foundation models acquire intelligence through pretraining methods including supervised, unsupervised, and self-supervised learning. Understand how these techniques enable AI to recognize patterns, create rich data representations, and perform effectively across language, image, and audio data.

Have you ever wondered how these foundation models become so intelligent in the first place? They aren’t born understanding language or recognizing images, right? Instead, they go through an initial phase called pretraining—AI’s equivalent of foundational education. Let’s dive deep into how this foundational education happens and why it matters.

We’ll briefly introduce the landscape of pretraining methods for modern AI and see how models like GPT rely on heavy training to understand language. First, let’s step back and explore how to train a foundation model for images, text, audio, or a combination of all three. Think of it like hiring three robot chefs to work in your restaurant kitchen:

  • The first robot attended culinary school, carefully following labeled recipes with step-by-step instructions.

  • The second robot never had formal instruction; instead, it studied countless cookbooks to find common cooking patterns.

  • The third robot had no instructions. It experimented by cooking randomly, tasting the results, and learning what worked best.

These robots perfectly represent AI’s three main pretraining paradigms: supervised learning, unsupervised learning, and self-supervised learning. Let’s understand how exactly these models learn.

What does it mean to train a model?

When we say we’re “training a model,” we mean teaching a computer to recognize patterns from data. A model starts off knowing nothing (random parameters), and as it sees more examples, it refines its internal “brain”—the weights and biases that define its understanding.

Showing an image of a cat
1 / 8
Showing an image of a cat

Imagine teaching a child what a cat looks like. You show pictures, correct mistakes, and over time, they learn to recognize cats by spotting patterns like whiskers or tails. Computers learn in a similar way, but with math and data. They start with random guesses, compare them to the correct labels, and adjust their internal weights each time they’re wrong. After thousands or millions of examples, they get very good at recognition.

This is supervised learning, where a model is trained using labeled data. But labeling huge datasets is costly and prone to bias. If a model only sees cats in sunny rooms, it may struggle to recognize one outside at night. To function effectively, datasets must be large, diverse, and representative of real-world scenarios.

What is unsupervised learning?

Most of the world’s data isn’t labeled—think of the billions of images, videos, and posts online. Manually tagging them all would be impossible. Unsupervised learning addresses this by allowing AI to discover patterns in unlabeled data without explicit guidance. Instead of being told “this is a cat” or “this is a dog,” the model groups similar data points together—like clustering faces that look alike or discovering topics in a collection of articles. In short, it’s about uncovering hidden structure in raw data.

Unsupervised learning
Unsupervised learning

Let’s revisit our kid who loves animals, but this time, they’re facing a big pile of photos with no labels at all.

The kid starts grouping pictures based on similarities: animals with whiskers and pointy ears in one pile, long-eared hoppers in another, and floppy-eared barkers in a third. At this point, they don’t know the names of these animals; they’ve just clustered them by appearance. Later, someone points to one pile and says, “Those are cats.” Instantly, the kid now knows what “cats” look like, even though no one labeled the photos beforehand.

This is the essence of unsupervised learning: the model looks at unlabeled data, finds similarities, and clusters them into groups. It is powerful because there is an endless supply of unlabeled data, and it can even uncover hidden or surprising patterns, such as new customer segments in business data.

However, there is a catch: these clusters are often based on surface-level features. Just as the kid might think only floppy-eared animals are dogs (and miss a German shepherd with pointy ears), AI models may learn shallow representations without fully grasping the deeper meaning of the data.

What we really want is for the model to capture the deeper essence of data through rich representations, not just memorize surface details. In our animal example, this means the kid doesn’t rely only on ears or whiskers but also understands facial structure, body proportions, behaviors, and environments. With this deeper knowledge, they can still identify a cat even if it is in a strange pose, partly hidden, or shown in poor lighting.

In AI, rich representations enable models to generalize more effectively, allowing them to recognize patterns in new data they have never encountered before. This leads to more reliable and accurate performance in the real world. To overcome the limits of unsupervised methods, researchers developed self-supervised learning: a breakthrough that pushes models toward learning richer, more meaningful representations.

What is self-supervised learning?

In self-supervised learning, the model doesn’t rely on human-provided labels. Instead, it generates its own labels by hiding parts of the data and challenging itself to predict the missing pieces. Think of a curious child with a book full of animal pictures—cats, dogs, rabbits—but no names attached. Rather than giving up, the child invents clever games to learn:

  • The guessing game: Cover part of a picture and try to guess the animal from the visible clues.

  • The puzzle game: Cut the picture into pieces and reassemble it, noticing how tails, bodies, and faces fit together.

  • The match-and-mix game: Shuffle animal parts and reconstruct the correct animals from the pile.

By playing these games, the child develops a deep sense of what animals look like—the shapes of cat eyes, the structure of dog paws, the proportions of rabbit ears. Even without labels, they build strong internal representations that let them recognize animals instantly when someone finally says, “This is a cat.”

Self-supervised learning works the same way for AI. By predicting hidden aspects of the data, models build rich, nuanced representations that enable them to be powerful at recognizing, classifying, and even generating new data later.

AI models learn through self-supervised learning in much the same way as the child in our analogy—by inventing internal “games” rather than relying on human labels. In practice, these games look like this:

  • Masked language modeling (MLM): Certain words in a sentence are hidden, and the model must guess them from context. For example: “The quick [MASK] fox jumps” → the model predicts brown.

  • Causal language modeling (autoregressive): The model predicts the next word in a sequence. For example: given “The quick brown…” → it predicts fox.

  • Contrastive learning (common in vision models): The model matches related pairs, such as an image and its caption, learning which description belongs to which picture.

By repeatedly solving these tasks, models develop strong internal representations. They move beyond surface clues and instead capture deeper structures like grammar, meaning, visual features, and context. These representations form the backbone of modern generative AI, enabling models not only to recognize patterns but also to create fluent and coherent outputs.

Why is self-supervised learning so powerful?

Self-supervised learning (SSL) has become the driving force behind modern AI because of its unique strengths:

  • No costly labels: Models train directly on raw data like books, images, and audio, without needing expensive human annotations.

  • Highly scalable: Unlabeled data is abundant, so models can continuously improve as more data is added.

  • Rich representations: Unlike simple clustering, SSL tasks push models to learn deep, generalizable features that capture context and nuance.

  • Backbone of foundation models: These rich features make SSL the foundation of today’s general-purpose AI systems, from language models like GPT to multimodal systems that handle text, images, and sound.

Here’s a brief summary of what we discussed:

Learning Paradigm

Example (Child Analogy)

How the Model Learns

Advantages

Limitations

Supervised

Shows labeled pictures: “This is a cat!”

Receives explicit feedback from labels

Very accurate for clearly defined tasks

Expensive labels, hard to scale

Unsupervised

Groups unlabeled pictures by similarity

Finds patterns or clusters in the data

Easily scalable, finds unknown patterns

Learns shallow, surface-level patterns

Self-Supervised

Plays games by hiding or rearranging data

Creates tasks to predict hidden information

Deep, rich understanding without labels

Complex to set up cleverly

Now that we understand how models can learn from data without explicit labels, let’s explore how this is applied in real-world pretraining tasks.