Evaluating MuLan: Performance and Design Insights
Explore how MuLan’s multi-stage architecture improves text-to-image generation by decomposing complex prompts into simpler steps. Learn how performance is rigorously evaluated with human and GPT-4V scoring, highlighting key design principles like feedback loops and task decomposition to enhance accuracy and robustness in AI agent systems.
We’ve explored MuLan’s innovative, multi-step architecture. But how do we prove that this agentic system design is actually more effective than a standard, one-shot approach? To answer this, the researchers needed a rigorous way to evaluate its performance on complex, multi-object prompts.
A benchmark for compositional prompts
To create a fair and challenging test for MuLan, the researchers curated a new dataset consisting of 200 hard prompts. This benchmark wasn’t taken from a single source; it was carefully constructed to test the specific failure points of modern text-to-image models. The creation process involved several steps outlined below.
Foundation: They began by collecting complex spatial prompts from an existing benchmark, T2I-CompBench.
Expansion: To broaden the scope, they used ChatGPT to generate hundreds of new prompts with diverse objects, relationships, and attributes.
Curation: Finally, they manually selected the most difficult prompts that state-of-the-art models like SDXL ...