Introduction to MuLan and the Multi-Object Generation Challenge
Explore how an agentic system design can solve complex, multi-object image generation challenges by adding a layer of planning, control, and feedback on top of standard text-to-image models.
We'll cover the following...
The problem space: Text-to-image generation
The one-shot process vs. an agentic architecture
In recent years, we’ve seen an explosion in the capabilities of text-to-image (T2I) models. These AI systems can take a simple text prompt and produce visually appealing, high-quality images in a single step. As the underlying models have improved, their ability to handle compositional requests has improved remarkably.
However, as agentic system designers, our goal is to think beyond the capabilities of a single model call and consider the architecture of the entire system. The inherent architectural limitation of a “one-shot” generative process is its lack of fine-grained control, interactivity, and a mechanism for self-correction. When a one-shot model fails to capture all the constraints in a complex prompt, our only option is to try again.
As the image above illustrates, even powerful diffusion models can sometimes struggle with prompts that require precise compositional understanding, leading to issues mentioned ...