...

/

Understanding Vision Models: How AI Learns to See

Understanding Vision Models: How AI Learns to See

Explore how vision transforms AI models, how they are trained, and what new experiences they enable.

We'll cover the following...

We’ve seen how powerful AI is with language: writing, answering questions, and chatting naturally. But think about your daily life. How much of your experience is just words?

Not much. You rely heavily on vision: recognizing faces, moving around, enjoying a sunset. Vision is central to how you understand the world.

That is where vision models come in. Just like language models handle text, vision models let AI see, interpret, and even generate images. They allow AI to visually understand the world, which is why vision is such a big deal in AI.

Why is vision so important in AI?

Let’s first quickly clarify: What is an image?

An image is simply a grid of pixels—tiny dots, each holding color information. Imagine a huge mosaic made up of thousands of tiny, colored tiles. Humans instantly recognize what the mosaic represents (say, a cat or a sunset). But to a computer, these are just numbers.

Teaching AI to interpret these numerical patterns visually is a game changer. Here’s why:

  • Vision dominates how we navigate and understand our world. If AI will assist us effectively, it needs vision, too. Imagine a robot that sees obstacles or medical AI that spots tumors on scans better than humans can.

  • The world is filled with visual data—from your smartphone gallery filled with pet photos to medical X-rays and satellite images. Processing this huge amount of visual data efficiently can help AI solve important real-world problems.

Vision-based foundation models, trained on billions of images, are at the forefront of this visual revolution. Let’s uncover exactly how these models do their magic.

What is the vision transformer?

We’ve talked about transformers—the powerful models behind language AI that can chat, write essays, and even create stories. Transformers are fantastic at understanding sequences—words in a sentence, for example. But could transformers also learn to understand images? At first glance, it seems tricky. Images aren’t words, after all—they’re pictures!

But imagine for a moment that we could turn images into something that transformers can naturally process—something like ...