Summary: Image Captioning with Transformers

Review what we've learned in this chapter.

Image captioning model

In this chapter, we focused on a very interesting task that involves generating captions for given images. Our image-captioning model was one of the most complex models in this course, which included the following:

  • A vision transformer model that produces an image representation

  • A text-based transformer decoder

Analysis of the dataset

Before we began with the model, we analyzed our dataset to understand various characteristics, such as image sizes and vocabulary size. Then, we learned how we can use a tokenizer to tokenize caption strings. We then used this knowledge to build a TensorFlow data pipeline.

Components of the model

We discussed each component in detail. The vision transformer (ViT) takes in an image and produces a hidden representation of that image. Specifically, the ViT breaks an image into a sequence of 16 × 16 patches of pixels. After that, it treats each patch as a token embedding to the transformer (along with positional information) to produce a representation of each patch. It also incorporates the [CLS] token at the beginning to provide a holistic representation of the image.

Next, the text decoder takes in the image representation along with caption tokens as inputs. The objective of the decoder becomes to predict the next token at each time step. We were able to reach a BLEU-4 score of just above 0.10 for the validation dataset.

Evaluating the model

Then, we discussed several different metrics (BLEU, ROUGE, METEOR, and CIDEr), which we can use to quantitatively evaluate the generated captions, and we saw that as we ran our algorithm through the training data, the BLEU-4 score increased over time. Additionally, we visually inspected the generated captions and saw that our machine learning pipeline progressively gets better at captioning images.

Next, we evaluated our model on the test dataset and validated that it demonstrates similar performance on test data as expected. Finally, we learned how we can use the trained model to generate captions for unseen images.

Get hands-on with 1200+ tech skills courses.