Traditional game graphics are built through simulation: every shadow is calculated, every polygon is mapped, and every collision is modeled using hardcoded rules.
But what if we skipped the simulation entirely?
What if, instead of calculating every detail through physics and rendering engines, an AI model could predict the game’s next frame, based purely on learned patterns from gameplay?
That’s the promise of a new direction in Generative AI: systems that don’t just create static content like art or text, but generate dynamic, interactive environments. At the cutting edge of that idea is WHAMM, a project from Microsoft Research.
Today, I'll share:
What WHAMM is (and why it matters)
How WHAMM works under the hood
WHAMM vs. WHAM: A performance leap
Let's get started.
WHAMM stands for World and Human Action MaskGIT Model. At its core, it is a generative transformer trained to render video game frames based entirely on user input, such as moving, turning, or shooting. It does this without relying on a traditional rendering engine.
Conventional games rely on sophisticated rendering engines — like Unreal Engine or Unity — to simulate lighting, motion, texture, and physics, frame by frame in real time.
Instead of these simulations, WHAMM uses a process known as neural inference. Instead of calculating each frame by applying physical rules, it relies on patterns learned from previous gameplay to predict the next visual frame based on the player’s actions.
Imagine controlling a character in a game, and instead of your inputs being processed by a graphics engine, they are interpreted by an AI model. The model effectively thinks, “Based on everything I’ve seen before, here’s what that next frame probably looks like.”
That is what WHAMM does, and it does it in real time.
Let’s take a look. In the clip below, every flick of movement, every explosion, and every interaction is generated live. There is no 3D engine, no hand-coded shaders. It is just pure Generative AI producing the scene as you play.
Rendering dynamic, high-speed visuals on the fly is incredibly hard. Until now, Generative AI models have mostly been used for static content: images, videos, or short clips. But video games are fast, interactive, and constantly changing. Creating a convincing, playable game world using only AI is new.
To be playable, a model must not only generate convincing images — it must predict what happens next, over and over again, fast enough to keep up with the player.
So to test WHAMM’s real-time capabilities, Microsoft picked a game that would push the model to its limits.
To showcase WHAMM’s real-time rendering capabilities, Microsoft turned to a classic: Quake II.
Released in 1997 by ID Software, Quake II is a groundbreaking first-person shooter (FPS). It introduced fully 3D environments, fast-paced combat, and fluid player movement, all rendered in real time on the hardware of the late ’90s.
For modern AI researchers, it offers something else: the perfect stress test.
Here’s why:
High-speed gameplay: Quake II is a twitch shooter. The player constantly moves, jumps, strafes, and reacts to threats. This is unlike generating a still image or a
Constant perspective shifts: The camera isn’t static. It constantly rotates and repositions based on mouse movement, making frame prediction harder than fixed-perspective games.
Complex frame-to-frame dynamics: Lighting, motion blur, and weapon effects change dramatically in milliseconds. A generative model has to anticipate and replicate these changes seamlessly.
Even though the textures are low-resolution by today’s standards, the task is still incredibly difficult for a neural network. It’s not about how realistic the graphics are; it’s about how quickly and correctly the AI can generate the next frame.
Let’s break down what WHAMM is doing and how it turns user actions into playable, AI-rendered video.
WHAMM is a transformer-based generative model trained to render real-time video game frames based on live player input. It uses an architecture called MaskGIT (Masked Generative Image Transformer), which was originally developed for fast image generation.
MaskGIT breaks each video frame into small visual patches, or tokens, and then masks some of them out. The model learns to predict the missing patches in parallel, based on what’s happening in the game and what the player is doing.
This parallel prediction allows WHAMM to keep up with gameplay, generating more than 10 frames per second, even during fast camera movement and continuous input.
Here’s what that means in practice:
Every time the player moves, jumps, or fires a weapon, WHAMM takes the next frame, masks out parts of it, and then predicts what those missing regions should look like. It’s like solving a puzzle where the model understands the game world’s rules.
Instead of producing the frame pixel by pixel, WHAMM predicts many visual tokens simultaneously. This makes it fast enough to support real-time play without relying on physics simulation or traditional rendering techniques.
You can think of it as asking the model: “You’ve seen how this game usually looks. Based on what’s happening right now, what should appear next?”
WHAMM’s real-time performance depends on MaskGIT, the architecture behind its frame generation. MaskGIT predicts many missing parts of a scene in parallel, unlike older models that generate pixels or tokens one at a time in sequence (a method called autoregressive generation).
In simple terms:
Autoregressive means slow, one token at a time
MaskGIT means fast, lots of tokens at once
This allows WHAMM to keep up with gameplay, generating over 10 frames per second, even as the scene changes rapidly.
WHAMM was trained on a curated dataset of one week of recorded Quake II gameplay. From this footage, the model learned to associate player inputs such as movement, turning, and shooting with the visual frames that typically follow.
This training allowed WHAMM to generate new frames based on learned input-output patterns, without requiring hand-coded assets or simulation logic.
Before WHAMM, Microsoft released an earlier version of the model called WHAM-1.6 B. It had the same goal: generate game frames based on player input. But it had one major problem: speed.
WHAM-1.6B used an autoregressive approach, generating one image token at a time in sequence. While it could produce realistic results, it was painfully slow, generating maybe one frame per second.
You can imagine how that plays out in a fast-paced shooter like Quake II. If the screen updates once a second, it’s no longer a game; it’s a slideshow.
WHAMM solves that problem by switching to a parallelized generation architecture based on MaskGIT. Instead of generating image tokens individually, WHAMM simultaneously fills in multiple visual regions, allowing it to render frames at interactive speeds (10+ FPS).
And that change makes all the difference.
Feature | WHAM-1.6B | WHAMM |
Generation style | Autoregressive | Masked parallel (MaskGIT-style) |
FPS (frames/sec) | ~1 | 10+ (real-time capable) |
Context length | ~1 second | Still 0.9s—but faster inference |
Resolution | 300×180 | 640×360 (2x detail) |
The WHAMM demo, available in-browser via Microsoft’s Copilot Labs, lets users play a simplified version of Quake II using standard keyboard and mouse controls. What makes this experience unique is how it’s rendered: not by a game engine, but by a Generative AI model predicting every frame in real time.
As you move, WHAMM observes your inputs, such as turning, jumping, and firing, and uses its trained model to generate the next game frame. This happens continuously, with the model updating the display roughly 10 times per second.
Despite being a research prototype, WHAMM offers a surprisingly stable experience. The movement feels responsive, the environments are recognizable, and key visual cues like rooms, corridors, and enemies are mostly preserved from frame to frame.
Of course, there are limitations.
The model’s memory is short (around 0.9 seconds), which can cause off-screen objects to disappear or reappear unexpectedly.
Textures may be blurry, and enemy details can be imprecise.
There is no real-time physics. The model relies entirely on visual prediction.
Still, the core achievement is clear. WHAMM shows that interactive, playable environments can be generated using Generative AI instead of traditional simulation-based rendering.
It is not ready to replace graphics engines, but it opens the door to a new approach in which models might eventually render real-time gameplay.
Microsoft released a real-time demo of WHAMM, though the link now appears offline. However, you can try a public demo directly in your browser based on WHAM, the earlier version of WHAMM.
While WHAMM’s Quake II demo is a compelling showcase, the implications of this technology extend far beyond gaming.
At its core, WHAMM demonstrates a new paradigm: real-time, input-conditioned visual generation. Instead of hardcoded simulations, it uses learned patterns to predict what the world should look like in response to actions. That capability could be useful in any field where interactive, visual environments are important.
Here are just a few areas where WHAMM-like systems could have a long-term impact:
Use Case | What WHAMM Could Enable |
Robotics simulation | AI-rendered environments for training agents |
Virtual filmmaking | Frame-by-frame scene generation with motion control |
Sports broadcasting | Reconstructed game highlights from skeletal data |
Surveillance simulation | Predictive modeling of real-world action |
WHAMM is essentially video generation conditioned on user input, which makes it a foundation for interactive AI in entertainment, training, and simulation.
The takeaway isn’t that WHAMM is ready for production use in all these domains. WHAMM points to a future where models don’t just understand images, they generate complex, interactive environments from scratch.
In that future, we might see:
AI-generated training simulators without physics engines
On-demand visualizations of instructions or goals
Entire synthetic worlds for models to explore, learn from, or reason about
WHAMM is still early, but it offers a clear look at where this category of Generative AI is heading.
WHAMM challenges the entire premise of how we build and play games.
Instead of relying on rules and engines to simulate a world, WHAMM shows us that a model can learn to predict and generate what comes next, fast enough to be playable.
It’s not production-ready. It doesn’t simulate physics. And it sometimes forgets what’s behind a door you just walked through. But it works well enough to be experienced by anyone with a browser in real time.
And that’s a breakthrough.
Traditional game graphics, built on physics engines and rendering code, won’t vanish overnight. But WHAMM shows that, for the first time, models can generate interactive game worlds on their terms.
WHAMM might not be the final form, but it’s a bold first move toward a future where the graphics pipeline is not engineered, but learned.
Curious how AI models like WHAMM might reshape how we build and render games? We've got you covered:
Hands-On Game Development with Rust: Build fast, real-time game logic from scratch
Generative AI Resources: Learn how transformers, token prediction, and real-time inference work under the hood
Whether you're a game dev, an ML engineer, or just AI-curious, there’s a path for you.