How Microsoft’s WHAMM Uses AI to Render Gameplay in Real Time

How Microsoft’s WHAMM Uses AI to Render Gameplay in Real Time

Microsoft’s WHAMM model renders real-time game frames using AI instead of a graphics engine. Trained on just one week of Quake II footage, it predicts each frame from player input, opening the door to a future of interactive, AI-generated worlds.
10 mins read
May 12, 2025
Share

Traditional game graphics are built through simulation: every shadow is calculated, every polygon is mapped, and every collision is modeled using hardcoded rules.

But what if we skipped the simulation entirely?

What if, instead of calculating every detail through physics and rendering engines, an AI model could predict the game’s next frame, based purely on learned patterns from gameplay?

That’s the promise of a new direction in Generative AI: systems that don’t just create static content like art or text, but generate dynamic, interactive environments. At the cutting edge of that idea is WHAMM, a project from Microsoft Research.

Today, I'll share:

  • What WHAMM is (and why it matters)

  • How WHAMM works under the hood

  • WHAMM vs. WHAM: A performance leap

Let's get started.

What is WHAMM?#

WHAMM stands for World and Human Action MaskGIT Model. At its core, it is a generative transformer trained to render video game frames based entirely on user input, such as moving, turning, or shooting. It does this without relying on a traditional rendering engine.

Conventional games rely on sophisticated rendering engines — like Unreal Engine or Unity — to simulate lighting, motion, texture, and physics, frame by frame in real time.

Instead of these simulations, WHAMM uses a process known as neural inference. Instead of calculating each frame by applying physical rules, it relies on patterns learned from previous gameplay to predict the next visual frame based on the player’s actions.

Imagine controlling a character in a game, and instead of your inputs being processed by a graphics engine, they are interpreted by an AI model. The model effectively thinks, “Based on everything I’ve seen before, here’s what that next frame probably looks like.”

That is what WHAMM does, and it does it in real time.

Let’s take a look. In the clip below, every flick of movement, every explosion, and every interaction is generated live. There is no 3D engine, no hand-coded shaders. It is just pure Generative AI producing the scene as you play.

Why this matters#

Rendering dynamic, high-speed visuals on the fly is incredibly hard. Until now, Generative AI models have mostly been used for static content: images, videos, or short clips. But video games are fast, interactive, and constantly changing. Creating a convincing, playable game world using only AI is new.

To be playable, a model must not only generate convincing images — it must predict what happens next, over and over again, fast enough to keep up with the player.

So to test WHAMM’s real-time capabilities, Microsoft picked a game that would push the model to its limits.

Why Quake II is the perfect test#

To showcase WHAMM’s real-time rendering capabilities, Microsoft turned to a classic: Quake II.

Released in 1997 by ID Software, Quake II is a groundbreaking first-person shooter (FPS). It introduced fully 3D environments, fast-paced combat, and fluid player movement, all rendered in real time on the hardware of the late ’90s.

For modern AI researchers, it offers something else: the perfect stress test.

A frame from the Quake II game
A frame from the Quake II game

Here’s why:

  • High-speed gameplay: Quake II is a twitch shooter. The player constantly moves, jumps, strafes, and reacts to threats. This is unlike generating a still image or a slow-paced RPGA slow-paced RPG (role-playing game) is a game where the action unfolds gradually, often through turn-based combat, exploration, or dialogue, rather than fast reflexes. Divinity: Original Sin 2 is an example of this, where players take turns in combat instead of reacting in real time., where visuals change gradually. In a game like Quake II, every frame must reflect fast, precise actions in real time.

  • Constant perspective shifts: The camera isn’t static. It constantly rotates and repositions based on mouse movement, making frame prediction harder than fixed-perspective games.

  • Complex frame-to-frame dynamics: Lighting, motion blur, and weapon effects change dramatically in milliseconds. A generative model has to anticipate and replicate these changes seamlessly.

Even though the textures are low-resolution by today’s standards, the task is still incredibly difficult for a neural network. It’s not about how realistic the graphics are; it’s about how quickly and correctly the AI can generate the next frame.

How WHAMM works#

Let’s break down what WHAMM is doing and how it turns user actions into playable, AI-rendered video.

WHAMM is a transformer-based generative model trained to render real-time video game frames based on live player input. It uses an architecture called MaskGIT (Masked Generative Image Transformer), which was originally developed for fast image generation.

MaskGIT breaks each video frame into small visual patches, or tokens, and then masks some of them out. The model learns to predict the missing patches in parallel, based on what’s happening in the game and what the player is doing.

This parallel prediction allows WHAMM to keep up with gameplay, generating more than 10 frames per second, even during fast camera movement and continuous input.

Player input is sent to the WHAMM model, which uses the MaskGIT architecture to predict the next frame by filling in masked regions. The result is a complete, AI-generated gameplay frame rendered in real time.
Player input is sent to the WHAMM model, which uses the MaskGIT architecture to predict the next frame by filling in masked regions. The result is a complete, AI-generated gameplay frame rendered in real time.

Here’s what that means in practice:

Like solving a puzzle#

Every time the player moves, jumps, or fires a weapon, WHAMM takes the next frame, masks out parts of it, and then predicts what those missing regions should look like. It’s like solving a puzzle where the model understands the game world’s rules.

Instead of producing the frame pixel by pixel, WHAMM predicts many visual tokens simultaneously. This makes it fast enough to support real-time play without relying on physics simulation or traditional rendering techniques.

You can think of it as asking the model: “You’ve seen how this game usually looks. Based on what’s happening right now, what should appear next?”

Why this works: The MaskGIT architecture#

WHAMM’s real-time performance depends on MaskGIT, the architecture behind its frame generation. MaskGIT predicts many missing parts of a scene in parallel, unlike older models that generate pixels or tokens one at a time in sequence (a method called autoregressive generation).

In simple terms:

  • Autoregressive means slow, one token at a time

  • MaskGIT means fast, lots of tokens at once

This allows WHAMM to keep up with gameplay, generating over 10 frames per second, even as the scene changes rapidly.

How was WHAMM trained?#

WHAMM was trained on a curated dataset of one week of recorded Quake II gameplay. From this footage, the model learned to associate player inputs such as movement, turning, and shooting with the visual frames that typically follow.

This training allowed WHAMM to generate new frames based on learned input-output patterns, without requiring hand-coded assets or simulation logic.

From WHAM to WHAMM: What’s the breakthrough?#

Before WHAMM, Microsoft released an earlier version of the model called WHAM-1.6 B. It had the same goal: generate game frames based on player input. But it had one major problem: speed.

WHAM-1.6B used an autoregressive approach, generating one image token at a time in sequence. While it could produce realistic results, it was painfully slow, generating maybe one frame per second.

You can imagine how that plays out in a fast-paced shooter like Quake II. If the screen updates once a second, it’s no longer a game; it’s a slideshow.

WHAMM solves that problem by switching to a parallelized generation architecture based on MaskGIT. Instead of generating image tokens individually, WHAMM simultaneously fills in multiple visual regions, allowing it to render frames at interactive speeds (10+ FPS).

And that change makes all the difference.

WHAM vs. WHAMM

Feature

WHAM-1.6B

WHAMM

Generation style

Autoregressive

Masked parallel (MaskGIT-style)

FPS (frames/sec)

~1

10+ (real-time capable)

Context length

~1 second

Still 0.9s—but faster inference

Resolution

300×180

640×360 (2x detail)

Playing a game rendered by WHAMM#

The WHAMM demo, available in-browser via Microsoft’s Copilot Labs, lets users play a simplified version of Quake II using standard keyboard and mouse controls. What makes this experience unique is how it’s rendered: not by a game engine, but by a Generative AI model predicting every frame in real time.

As you move, WHAMM observes your inputs, such as turning, jumping, and firing, and uses its trained model to generate the next game frame. This happens continuously, with the model updating the display roughly 10 times per second.

Despite being a research prototype, WHAMM offers a surprisingly stable experience. The movement feels responsive, the environments are recognizable, and key visual cues like rooms, corridors, and enemies are mostly preserved from frame to frame.

Of course, there are limitations.

The model’s memory is short (around 0.9 seconds), which can cause off-screen objects to disappear or reappear unexpectedly.

Textures may be blurry, and enemy details can be imprecise.

There is no real-time physics. The model relies entirely on visual prediction.

Still, the core achievement is clear. WHAMM shows that interactive, playable environments can be generated using Generative AI instead of traditional simulation-based rendering.

It is not ready to replace graphics engines, but it opens the door to a new approach in which models might eventually render real-time gameplay.

Microsoft released a real-time demo of WHAMM, though the link now appears offline. However, you can try a public demo directly in your browser based on WHAM, the earlier version of WHAMM.

👉 Try the WHAM demo

Beyond gaming: Where could WHAMM go next?#

While WHAMM’s Quake II demo is a compelling showcase, the implications of this technology extend far beyond gaming.

At its core, WHAMM demonstrates a new paradigm: real-time, input-conditioned visual generation. Instead of hardcoded simulations, it uses learned patterns to predict what the world should look like in response to actions. That capability could be useful in any field where interactive, visual environments are important.

Here are just a few areas where WHAMM-like systems could have a long-term impact:

Use Case

What WHAMM Could Enable

Robotics simulation

AI-rendered environments for training agents

Virtual filmmaking

Frame-by-frame scene generation with motion control

Sports broadcasting

Reconstructed game highlights from skeletal data

Surveillance simulation

Predictive modeling of real-world action

WHAMM is essentially video generation conditioned on user input, which makes it a foundation for interactive AI in entertainment, training, and simulation.

The takeaway isn’t that WHAMM is ready for production use in all these domains. WHAMM points to a future where models don’t just understand images, they generate complex, interactive environments from scratch.

In that future, we might see:

  • AI-generated training simulators without physics engines

  • On-demand visualizations of instructions or goals

  • Entire synthetic worlds for models to explore, learn from, or reason about

WHAMM is still early, but it offers a clear look at where this category of Generative AI is heading.

The end of rendering as we know It?#

WHAMM challenges the entire premise of how we build and play games.

Instead of relying on rules and engines to simulate a world, WHAMM shows us that a model can learn to predict and generate what comes next, fast enough to be playable.

It’s not production-ready. It doesn’t simulate physics. And it sometimes forgets what’s behind a door you just walked through. But it works well enough to be experienced by anyone with a browser in real time.

And that’s a breakthrough.

Traditional game graphics, built on physics engines and rendering code, won’t vanish overnight. But WHAMM shows that, for the first time, models can generate interactive game worlds on their terms.

WHAMM might not be the final form, but it’s a bold first move toward a future where the graphics pipeline is not engineered, but learned.

Explore the future of games, from engines to AI#

Curious how AI models like WHAMM might reshape how we build and render games? We've got you covered:

Whether you're a game dev, an ML engineer, or just AI-curious, there’s a path for you.


Written By:
Fahim ul Haq
The AI Infrastructure Blueprint: 5 Rules to Stay Online
Whether you’re building with OpenAI’s API, fine-tuning your own model, or scaling AI features in production, these strategies will help you keep services reliable under pressure.
9 mins read
Apr 9, 2025