Developing AI systems that can seamlessly understand and generate content across various modalities — such as text, images, and video — with reasoning capabilities approaching human cognition has been a central goal in the field.
While proprietary models have long showcased this integrated intelligence, their underlying mechanisms remain private.
So of course, we had to take it for a spin.
Here's what we'll cover:
The core architecture behind BAGEL, including its Mixture-of-Transformer experts and dual vision modules for understanding and generating images
How a multi-stage pipeline (alignment, large-scale pretraining, continued training, and supervised fine-tuning) shapes multimodal reasoning
Key benchmark results that show where BAGEL excels against other open-source and some proprietary models
Practical takeaways from hands-on testing, highlighting both impressive image outputs and current demo limitations
How to tap into BAGEL’s open-source checkpoints, code, and public demo for your own experiments
Let's begin!
BAGEL follows the design principle of maximizing the model’s capabilities without introducing artificial limits. It uses a Mixture of Transformer experts (MoT) architecture, which means it has specialized parts for different tasks, but they all work together closely. Unlike earlier methods that often created choke points, BAGEL’s design allows information to flow freely, enabling extensive interaction between understanding and generation processes. This open design facilitates efficient scaling of training and data, allowing the model’s full potential to develop without being held back by its structure.
BAGEL’s foundation is built upon a powerful, language-focused Qwen2.5 LLM. This model is designed to process information effectively, incorporating advanced techniques to ensure stability and efficiency.
Visual information is processed in two distinct ways to support both understanding and generation tasks:
For visual understanding: BAGEL employs a ViT encoder. This component acts as an advanced image reader, converting raw image pixels into digital tokens that the model can interpret. It can flexibly handle images at their native aspect ratios up to 980x980. A two-layer MLP connector helps these visual tokens align with the main language model’s internal data format.
For visual generation: BAGEL utilizes a pretrained VAE model from FLUX. This VAE converts images between pixel space and a compressed latent space. This latent representation is then further processed to match the hidden dimension of the main language model. Importantly, the VAE model’s parameters remain fixed during BAGEL’s training, providing a stable tool for visual creation.
All types of tokens, including text, understanding image tokens (ViT), and generation image tokens (VAE), are combined and interleaved according to the input’s modality structure. Before being integrated, both ViT and VAE tokens receive 2D positional encoding. For diffusion-based generation, a timestep embedding is directly added to the initial states of VAE tokens for a cleaner architecture. BAGEL employs the rectified flow method to generate images from visual tokens, aligning with leading techniques in visual generation.
BAGEL’s advanced capabilities are built on a meticulously structured, multi-stage training process, leveraging an exceptionally rich and diverse dataset. This includes trillions of tokens from interleaved text, image, video, and web data, carefully filtered and augmented for complex multimodal reasoning.
Training Stage | Purpose and Key Focus | Tokens Consumed | Max Context Window |
Alignment | Connects visual understanding (ViT) with language model (LLM) | 4.9 Billion | 16k |
Pretraining (PT) | Large-scale core learning with diverse data and native image resolution | 2.5 Trillion | 16k |
Continued Training (CT) | Increases visual resolution; boosts interleaved data for cross-modal reasoning | 2.6 Trillion | 40k |
Supervised Fine-tuning (SFT) | Refines performance on high-quality, curated datasets | 72.7 Billion | 40k |
BAGEL’s training prioritizes sampling generation examples more often than understanding examples. Its data corpus spans text, image-text pairs, and crucial interleaved data from videos and web, which are specially prepared to support complex in-context reasoning, world modeling, and even future frame prediction.
BAGEL’s performance across various benchmarks demonstrates its significant capabilities. It often surpasses specialized and other unified open-source models and even competes with some proprietary systems.
BAGEL shows strong performance in understanding tasks across diverse public benchmarks, outperforming existing unified and often specialized understanding models. These benchmarks (MME-P, MMBench, MMMU, MM-Vet) comprehensively evaluate a model’s abilities in multimodal understanding, ranging from basic perception and all-around performance to expert-level reasoning across various disciplines, and integrated capability verification.
Model | MME-P | MMBench | MMMU | MM-Vet |
LlamaFusion | 1604 | - | 72.1 | 41.7 |
Chameleon-7B | - | 35.7 | 28.4 | 8.3 |
Show-o-1.3B | 1097 | - | 26.7 | - |
Emu3-8B | 1244 | 58.5 | 31.6 | 37.2 |
TokenFlow-XL-13B | 1546 | 68.9 | 38.7 | 40.7 |
Janus-Pro-7B | 1567 | 79.2 | 41 | 50 |
MetaQuery-XL-7B | 1685 | 83.5 | 58.6 | 66.6 |
BLIP3-o-8B | 1683 | 83.5 | 50.6 | 66.6 |
BAGEL | 1687 | 85 | 55.3 | 67.2 |
BAGEL delivers competitive and often superior results in text-to-image generation, surpassing both specialized image generation models and other unified approaches. The benchmarks (single object, two object, counting, colors, position, color attribute, overall) are common metrics used in image generation benchmarks like GenEval. They assess a model’s ability to accurately generate images with specific object counts, colors, positions, and attributes based on textual prompts, along with an overall quality score.
Model | Single Object | Two Object | Counting | Colors | Position | Color Attribute | Overall |
DALL•E 2 | 0.94 | 0.66 | 0.49 | 0.77 | 0.10 | 0.19 | 0.52 |
DALL•E 3 | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 |
Chameleon-7B | - | - | - | - | - | - | 0.39 |
Show-o-1.3B | 0.98 | 0.80 | 0.66 | 0.84 | 0.31 | 0.50 | 0.68 |
Emu3-8B | 0.99 | 0.81 | 0.42 | 0.80 | 0.49 | 0.45 | 0.66 |
TokenFlow-XL-13B | 0.95 | 0.60 | 0.41 | 0.81 | 0.16 | 0.24 | 0.55 |
Janus-Pro-7B | 0.99 | 0.89 | 0.59 | 0.90 | 0.79 | 0.66 | 0.80 |
MetaQuery-XL-7B | - | - | - | - | - | - | 0.80 |
BLIP3-o-8B | - | - | - | - | - | - | 0.84 |
BAGEL | 0.98 | 0.95 | 0.84 | 0.95 | 0.78 | 0.77 | 0.88 |
BAGEL’s open-source nature is an important contribution, aiming to democratize advanced AI capabilities by making its foundational model publicly available. This commitment includes sharing its code and releasing its trained checkpoints, which allow developers, researchers, and tech professionals globally to inspect, utilize, and build upon a sophisticated multimodal model without proprietary barriers. This accessibility enables users to reproduce results, fine-tune for specific applications, and innovate in new directions, fostering a vibrant ecosystem in multimodal AI.
A public project page with a demo is available, allowing direct interaction with the model to test its understanding and generation abilities firsthand, proving invaluable for quick evaluation, inspiration, and learning across the community. Here is what the demo looks like.
BAGEL is advertised as being meticulously pretrained on extensive, interleaved video and web data, which equips it with the ability to produce high-fidelity, photorealistic images, dynamic video frames, or complex interleaved image-text content. With such impressive claims, it's time to put BAGEL’s capabilities to the test and examine its real-world performance.
Observation: The image captures the prompt’s mystical tone and color palette, with glowing crystals on a velvet-lined shelf. However, it partially fails in label accuracy — “LYNX” is misspelled as “LYYXX” and “NOVA” is missing or unclear.
Editing, style transfer, navigation, composition, and thinking couldn’t be tested as BAGEL returned the error: Apologies, Bagel encountered an error.
This is how the authors demonstrate the model’s capabilities in the paper.
While BAGEL is a new and powerful model, considerable room remains for improvement. The demo did not consistently function during testing, and the text descriptions exhibited inaccuracies or lacked clarity. Despite these challenges, the commitment to open-source development is a major positive, offering great potential for collaborative enhancement and wider accessibility within the AI community.
Curious to learn more about image generation models? You can start by exploring these exciting courses: