Introduction to Self-Supervised Learning

What is self-supervised learning?

Self-supervised learning methods are a class of machine learning algorithms that learn rich neural network representations without relying on labels. These algorithms leverage the supervisory signals or pseudo labels from the structure of the unlabeled data and predict any unobserved or hidden property of the input.

For example, in computer vision, one can rotate an image by a certain degree and ask the neural network to predict the rotation angle of the picture. In this example, we didn’t use human-annotated labels to train the neural network. Instead, we defined our pseudo labels (i.e., the angle of rotation of an image), which serve as supervisory signals. After these supervisory signals or pseudo labels are created, we can use our standard supervised losses (e.g., cross-entropy) to train the neural network.

One might confuse self-supervised learning with unsupervised learning (a more known terminology). Though the assumptions about the absence of training labels are identical in both frameworks, unsupervised learning needs to be better defined. It's often misleading as it refers to learning without supervision. Self-supervised learning, on the other hand, is not unsupervised since it uses supervisory signals from the data structure. This difference is illustrated in the figure below.

Difference between supervised, unsupervised, and self-supervised learning
Difference between supervised, unsupervised, and self-supervised learning

Taxonomy of self-supervised learning

Self-Supervised Learning (SSL) algorithms are classified into four categories based on their objective functions, all of which we'll learn in this course. In addition, we can combine multiple classes of self-supervision algorithms to develop better algorithms, as we'll see in this course.

Taxonomy of self-supervised learning
Taxonomy of self-supervised learning

Self-supervised learning framework

Self-supervised learning aims to learn a neural network f=h  gf=h \ \circ \ g (here, gg is the feature extraction backbone and hh is the final classification layer) on an unlabeled source dataset Dsource={Xi}i=1ND_{source} = \{ X_i \}_{i=1}^N (ii used to index images, NN is the total number of images) such that its representations, g(.)g(.), can be transferred to a target downstream task with the help of small labeled target dataset Dtarget={(Xi,Yi)}i=1MD_{target}=\{(X_i, Y_i)\}_{i=1}^M (MM is the total number of labeled images). Here, M<N M < N.

The self-supervised learning framework consists of two steps: pre-training and transfer learning.

Pre-training step

The pre-training step involves training a neural network f=h  gf=h \ \circ \ g (here, gg is the feature extraction backbone and hh is the final classification layer) on an unlabeled source dataset Dsource={Xi}i=1ND_{source} = \{ X_i \}_{i=1}^N by minimizing a self-supervised learning loss LSSL\mathcal{L}_{SSL}.

As discussed in the previous lesson, the self-supervised learning objective will help the neural network learn rich-semantic representations by extracting the supervisory signals from the structure of the data itself. Mathematically, this step can be written as:

Here, f=hgf^* = h^* \circ g^* is the trained neural network. This is shown in the figure below.

Pre-training state in self-supervised learning
Pre-training state in self-supervised learning

Transfer learning

Once the network is trained, its feature representations can be transferred on a downstream task using a small labeled target dataset Dtarget={(Xi,Yi)}i=1MD_{target}=\{(X_i, Y_i)\}_{i=1}^M (ii is used to index images, MM is the total number of labeled images). Two standard ways to achieve this are linear classifiers and fine-tuning.

Linear classifier

Keeping the feature extractor g(.)g^*(.) fixed, we can learn a small linear classifier c(.)c(.) to minimize the cross-entropy loss over the target dataset DtargetD_{target}.

Here, cc^* is the trained linear classifier, E[.]\mathbb{E}[.] is the expectation of a random variable, and LCE\mathcal{L}_{CE} is the standard cross-entropy loss function. The cross-entropy function LCE\mathcal{L}_{CE} is defined as:

Here, C\mathcal{C} is the set of classes (cc indexes classes), y^\hat{y} is the model prediction of class probabilities (y^c\hat{y}_c is the predicted probability of cthc^{th} class), and yy is the actual one-hot ground truth label (yc=1y_c = 1 if cc is the ground truth class else 00). This is shown in the figure below.

  Linear classifier evaluation
Linear classifier evaluation

Fine-tuning

Fine-tuning is when weights of a trained neural network are used as initialization and optimized further (only for a few epochs and using a small learning rate) on a target downstream task (usually having small labeled samples). This is unlike regular training where we train the neural network from scratch on a huge number of data points.

By using the weights of feature extractor g(.)g^*(.) as a good initialization point, we optimize the whole network (gg^* and cc) to minimize the cross-entropy loss over the target dataset DtargetD_{target}.

Here, cc^* is the trained linear classifier, gg^{**} is the optimized feature backbone for the downstream task, E[.]\mathbb{E}[.] is the expectation of a random variable, and LCE\mathcal{L}_{CE} is the standard cross-entropy loss function. This is shown in the figure below.

 Fine-tuning evaluation
Fine-tuning evaluation