MuLan’s Planning and Progressive Generation
Explore the first two stages of the MuLan agent: how a large language model functions as a “global planner” to break down the prompt, and how a diffusion model progressively generates each object.
We'll cover the following...
In our last lesson, we introduced MuLan’s “divide and conquer” strategy. Instead of tackling a complex image generation task all at once, it breaks the problem down into smaller, more manageable pieces. In this lesson, we’ll explore the first pillar of this architecture in detail: how the agent creates its initial plan.
The LLM as a global planner
Let’s return to our analogy of a human painter. A painter doesn’t just start randomly dabbing paint on a canvas. They first create a mental plan or a light sketch, deciding which objects will form the background and which will be in the foreground, and the general order in which they will be painted.
MuLan’s first step is exactly this: LLM planning. At the very beginning of the process, before any image generation happens, an LLM is used to create a global plan. It takes the user’s single, complex prompt and decomposes it into an ordered sequence of objects to be generated.
Creating an ordered sequence of sub-prompts
The first pillar of MuLan’s architecture is the “global plan,” created by an LLM before any image generation begins. The goal is to decompose the user’s single, complex prompt into an ordered sequence of simpler sub-prompts.
To achieve this, the LLM is assigned the persona of an “adept painter” and given a specific, rule-based task. The prompt template used is as follows.
Global planning prompt:
You are an excellent painter. I will give you some descriptions. Your task is to turn the description into a painting. You only need to list the objects in the description by painting order, from left to right, from down to top....