The DALL·E Model

Learn about the DALL·E transformer, another task-agnostic transformer model that can process images and text.


DALL·E, like CLIP, is a task-agnostic model. CLIP processed text-image pairs. DALL·E processes the text and image tokens differently. DALL·E’s input is a single stream of text and image of 1,280 tokens. 256 tokens are for the text, and 1,024 tokens are used for the image. DALL·E is a foundation model like CLIP.

DALL·E was named after Salvador Dali and Pixar’s WALL-E. When using DALL·E, we enter a text prompt and produce an image. However, DALL·E must first learn how to generate images with text.

DALL·E is a 12-billion-parameter version of GPT-3.

This transformer generates images from text descriptions using a dataset of text-image pairs.

The basic architecture of DALL·E

Unlike CLIP, DALL·E concatenates up to 256 BPE-encoded text tokens with 32×32 = 1,024 image tokens, as shown in the figure below:

Get hands-on with 1200+ tech skills courses.