Introduction to Pretext Tasks

Learn about self-supervised learning and pretext tasks.

We'll cover the following

Pretext tasks

Using pretext tasks is one of the popular and earliest ways to train a model using self-supervision. Generally, a pretext task is defined beforehand, and pseudo labels PP are generated based on the attributes found in the data. For example, if the pretext task involves predicting the rotation angle of an image, the pseudo label PP will correspond to the rotation angle. The model is then optimized by minimizing the error between its prediction and pseudo labels PP.

Formally, given a source unlabeled dataset Dsource={Xi}i=1ND_{source} = \{ X_i \}_{i=1}^N (ii indexes image, NN is the total number of images), we generate pseudo labels PiP_i for each training input XiX_i. A neural network f(.)f(.) is trained on the pseudo-labeled samples to minimize the following loss:

Here, E[.]\mathbb{E}[.] is the expectation of a random variable and LCE\mathcal{L}_{CE} is the standard cross-entropy loss function. The cross-entropy function LCE\mathcal{L}_{CE} is defined as:

Here, C\mathcal{C} is the set of classes (cc indexes classes), y^\hat{y} is the model prediction of class probabilities (y^c\hat{y}_c is the predicted probability of cthc^{th} class), and yy is the actual one-hot ground truth label (yc=1y_c = 1 if cc is the ground truth class else 00).

Once the model is trained, it can be used in any downstream tasks such as classification, segmentation, detection, etc. The figure below shows the self-supervised pretext task training.

In summary, during self-supervised learning:

  • Pretext tasks are defined and pseudo labels are generated for each training sample.

  • The model is trained to predict these pseudo labels, given the input.

  • Once trained, features of the trained model are transferred to the downstream task, where only a small amount of labeled data is available.

Here are some examples of popular pretext tasks used in self-supervised learning literature.

  • Relative positioning involves predicting the relative spatial arrangement between two image patches.

  • Solving jigsaw puzzles involves predicting the permutation of a shuffled image.

  • Image rotation involves predicting the rotation angle of the image.

Designing pretext tasks

When training neural networks with pretext task-based self-supervised learning objectives, we assume that the distribution or nature of our actual transfer task is similar to the pretext task we're solving. This assumption leads to another assumption that solving the pretext task will help solve the transfer tasks very well.

However, there's a reasonably significant mismatch between what is being solved in the pretext task and what we need to achieve by the transfer tasks. Hence, pretext task-based pre-training is only sometimes suitable for all downstream jobs.

To validate this, we can experiment to determine which layer’s features of the neural network yield the best performance on a transfer task (using single linear classifiers). If the version of the last layer representations is not the best, we can say that the pretext task is not well-aligned with the downstream transfer task and might not be the right task to solve.

To understand this, let’s see an example. First, we train a ResNet in a self-supervised manner to solve jigsaw puzzles that involve asking the neural network to predict the permutation of a shuffled jigsaw image. Then, we plot the Mean Average Precision (mAP) on the y-axis when each layer representation of ResNet is transferred to the PASCAL Visual Object Classes dataset. As shown in the figure below, the last layer representations of the ResNet become so specialized for the jigsaw problem that they don’t generalize well to the downstream classification task on the PASCAL Visual Object Classes dataset.

Linear classifier evaluation of a pre-trained jigsaw model on PASCAL VOC classes

Therefore, we must carefully choose our self-supervised pre-training tasks to align well with the downstream transfer tasks.