Natural Language Processing with TensorFlow/

...

The Machine Learning Pipeline for Image Caption Generation

Learn to create the pipeline for image caption generation.

We'll cover the following...

Vision transformer (ViT)
Text-based decoder transformer
Putting everything together

Here, we’ll look at the image caption generation pipeline at a very high level and then discuss it piece by piece until we have the full model. The image caption generation framework consists of two main components:

A pretrained vision transformer model to produce an image representation.
A text-based decoder model that can decode the image representation to a series of token IDs. This uses a text tokenizer to convert tokens to token IDs and vice versa.

Though the transformer models were initially used for text-based NLP problems, they have outgrown the domain of text data and have been used in other areas, such as image data and audio data.

Here, we’ll be using one transformer model that can process image data and another that can process text data.

Vision transformer (ViT)

First, let’s look at the transformer generating the encoded vector representations of images. We’ll be using a pretrained vision transformer (ViT) to achieve this. This model has been trained on the ImageNet dataset we discussed above. Let’s look at the architecture of this model.

Originally, the ViT was proposed in the paper ...

Introduction to Natural Language Processing

Understanding TensorFlow 2

Word2vec: Learning Word Embeddings

Advanced Word Vector Algorithms

Sentence Classification with Convolutional Neural Networks

Recurrent Neural Networks

Understanding Long Short-Term Memory Networks

Applications of LSTM: Generating Text

Sequence-to-Sequence Learning: Neural Machine Translation

Transformers

Sarcasm Classification Using BERT

Image Captioning with Transformers

Caption Generation Using PyTorch

Final Remarks

Appendix: Mathematical Foundations and Advanced TensorFlow

The Machine Learning Pipeline for Image Caption Generation

Vision transformer (ViT)