Introduction: Image Captioning with Transformers

Get an overview of the image captioning with the transformers model.

Transformer models changed the playing field for many NLP problems. They have redefined the state of the art by a significant margin compared to the previous leaders: RNN-based models. We have already studied transformers and understand what makes them tick. Transformers have access to the whole sequence of items (e.g., a sequence of tokens), as opposed to RNN-based models that look at one item at a time, making them well suited for sequential problems. Following their success in the field of NLP, researchers have successfully used transformers to solve computer vision problems. Here, we’ll learn how to use transformers to solve a multimodal problem involving both images and text: image captioning.

Applications of image captioning

Automated image captioning, or image annotation, has a wide variety of applications. One of the most prominent applications is image retrieval in search engines. Automated image captioning can be used to retrieve all the images belonging to a certain class (for example, a cat) as per the user’s request. Another application can be in social media where, when an image is uploaded by a user, the image is automatically captioned so that the user can either refine the generated caption or post it as it is.

Get hands-on with 1200+ tech skills courses.