Search⌘ K
AI Features

Evaluating MuLan: Performance and Design Insights

Explore MuLan's performance in generating images from complex multi-object prompts by using a multi-step, agentic system design. Learn how breaking down tasks and employing feedback loops improve accuracy in attribute binding and spatial relationships, supported by both human and GPT-4V evaluations.

We’ve explored MuLan’s innovative, multi-step architecture. But how do we prove that this agentic system design is actually more effective than a standard, one-shot approach? To answer this, the researchers needed a rigorous way to evaluate its performance on complex, multi-object prompts.

A benchmark for compositional prompts

To create a fair and challenging test for MuLan, the researchers curated a new dataset consisting of 200 hard prompts. This benchmark wasn’t taken from a single source; it was carefully constructed to test the specific failure points of modern text-to-image models. The creation process involved several steps outlined below.

  • Foundation: They began by collecting complex spatial prompts from an existing benchmark, T2I-CompBench.

  • Expansion: To broaden the scope, they used ChatGPT to generate hundreds of new prompts with diverse objects, relationships, and attributes.

  • Curation: Finally, they manually selected the most difficult prompts that state-of-the-art models like SDXL consistently failed to generate correctly.

The evaluation benchmark for MuLan was curated by collecting prompts from multiple sources, including existing benchmarks and LLM-generated examples. These were then filtered to create a final set of 200 challenging prompts
The evaluation benchmark for MuLan was curated by collecting prompts from multiple sources, including existing benchmarks and LLM-generated examples. These were then filtered to create a final set of 200 challenging prompts

The final set of 200 prompts specifically targets challenging compositional requirements, as mentioned below.

  • Complex spatial relationships (e.g., “on top of,” “to the left of”).

  • Attribute bindings (e.g., “a red cube,” “a blue cylinder”). ...

Evaluation metrics