VLM-Feedback Control and Human-in-the-Loop Interaction
Explore MuLan’s self-correction mechanism powered by a VLM-based feedback loop, and understand how its step-by-step process enables powerful human-AI collaboration.
We'll cover the following...
In our last lesson, we saw how MuLan’s planner and progressive generator work together to build a complex image step-by-step. But what happens if the diffusion model makes a mistake in an early stage? Without a mechanism to catch and correct errors, these mistakes would cascade, ruining the final image.
A painter doesn’t just paint without looking; they constantly step back, critique their own work, and make corrections. To make its process robust, the MuLan system needs an internal “critic” that can do the same. This lesson explores the critic and how its step-by-step process unlocks powerful human-AI collaboration.
VLM-feedback for self-correction
This is the third and final pillar of MuLan’s architecture: a VLM-feedback control loop. After each object is generated, a Vision Language Model (VLM), such as LLaVA-1.5, is used as a critic.
Its job is to perform a number of functions mentioned below.
Inspect the image: The diffusion model generates the object for the current stage. The VLM then looks at the resulting image.
Compare to the prompt: After each stage, the VLM scores the newly added object against the full original prompt, specifically focusing on checking for object presence, correct attributes (color/size), and proper spatial relations.
Provide feedback: If the VLM detects ...