Image-to-text generation

Transformer models have become the dominant approach for image captioning, partly due to their enhanced feature to model global context and long-range dependencies while generating coherent text. Recent state-of-the-art models leverage encoder-decoder transformers. The encoder processes the features of the image into an intermediate representation. The decoder auto-regressively generates the caption text, token-by-token, attending to relevant parts of the encoded image context at each generation (or step) through cross-attention layers.

Decoders such as GPT-2 and GPT-3 have been adapted for caption generation by modifying their self-attention masking to prevent the model from attending to future token positions during training. Vision transformers, as seen previously, are also commonly used as the encoder architectures in this schema.

The following figure shows a high-level block diagram of commonly used image-to-text model architectures:

Press + to interact

Course Overview

Introduction to Transformers

Generative AI Products

Conclusion

Text-Guided Image Generation Products

Image-to-text generation