Evaluating MuLan: Performance and Design Insights
Explore MuLan's performance in generating images from complex multi-object prompts by using a multi-step, agentic system design. Learn how breaking down tasks and employing feedback loops improve accuracy in attribute binding and spatial relationships, supported by both human and GPT-4V evaluations.
We’ve explored MuLan’s innovative, multi-step architecture. But how do we prove that this agentic system design is actually more effective than a standard, one-shot approach? To answer this, the researchers needed a rigorous way to evaluate its performance on complex, multi-object prompts.
A benchmark for compositional prompts
To create a fair and challenging test for MuLan, the researchers curated a new dataset consisting of 200 hard prompts. This benchmark wasn’t taken from a single source; it was carefully constructed to test the specific failure points of modern text-to-image models. The creation process involved several steps outlined below.
Foundation: They began by collecting complex spatial prompts from an existing benchmark, T2I-CompBench.
Expansion: To broaden the scope, they used ChatGPT to generate hundreds of new prompts with diverse objects, relationships, and attributes.
Curation: Finally, they manually selected the most difficult prompts that state-of-the-art models like SDXL consistently failed to generate correctly.
The final set of 200 prompts specifically targets challenging compositional requirements, as mentioned below.
Complex spatial relationships (e.g., “on top of,” “to the left of”).
Attribute bindings (e.g., “a red cube,” “a blue cylinder”). ...