Getting to Know the Data

Learn about the image datasets that we'll use to train the model.

Let’s first look at the data we’re working with both directly and indirectly. There are two datasets we’ll rely on:

We won’t engage the first dataset directly, but it’s essential for caption learning. This dataset contains images and their respective class labels (for example, cat, dog, and car). We’ll use a CNN that’s already trained on this dataset, so we don’t have to download and train on this dataset from scratch. Next, we’ll use the MS-COCO dataset, which contains images and their respective captions. We’ll directly learn from this dataset by mapping the image to a fixed-size feature vector using the vision transformer and then map this vector to the corresponding caption using a text-based transformer. 

ILSVRC ImageNet dataset

ImageNet is an image dataset that contains a large set of images (around one million) and their respective labels. These images belong to 1,000 different categories. This dataset is very expressive and contains almost all the objects found in the images we want to generate captions for. The figure below shows some of the classes available in the ImageNet dataset:

Get hands-on with 1200+ tech skills courses.