# Introduction to Self-Supervised Learning

Learn about self-supervised learning and its mathematical framework and taxonomy.

We'll cover the following

## What is self-supervised learning?

Self-supervised learning methods are a class of machine learning algorithms that learn rich neural network representations without relying on labels. These algorithms leverage the supervisory signals or pseudo labels from the structure of the unlabeled data and predict any unobserved or hidden property of the input.

For example, in computer vision, one can rotate an image by a certain degree and ask the neural network to predict the rotation angle of the picture. In this example, we didn’t use human-annotated labels to train the neural network. Instead, we defined our pseudo labels (i.e., the angle of rotation of an image), which serve as supervisory signals. After these supervisory signals or pseudo labels are created, we can use our standard supervised losses (e.g., cross-entropy) to train the neural network.

One might confuse self-supervised learning with unsupervised learning (a more known terminology). Though the assumptions about the absence of training labels are identical in both frameworks, unsupervised learning needs to be better defined. It's often misleading as it refers to learning without supervision. Self-supervised learning, on the other hand, is not unsupervised since it uses supervisory signals from the data structure. This difference is illustrated in the figure below.

Difference between supervised, unsupervised, and self-supervised learning

## Taxonomy of self-supervised learning

Self-Supervised Learning (SSL) algorithms are classified into four categories based on their objective functions, all of which we'll learn in this course. In addition, we can combine multiple classes of self-supervision algorithms to develop better algorithms, as we'll see in this course.

Taxonomy of self-supervised learning

## Self-supervised learning framework

Self-supervised learning aims to learn a neural network $f=h \ \circ \ g$ (here, $g$ is the feature extraction backbone and $h$ is the final classification layer) on an unlabeled source dataset $D_{source} = \{ X_i \}_{i=1}^N$ ($i$ used to index images, $N$ is the total number of images) such that its representations, $g(.)$, can be transferred to a target downstream task with the help of small labeled target dataset $D_{target}=\{(X_i, Y_i)\}_{i=1}^M$ ($M$﻿ is the total number of labeled images). Here, $M < N$.

The self-supervised learning framework consists of two steps: pre-training and transfer learning.

### Pre-training step

The pre-training step involves training a neural network $f=h \ \circ \ g$ (here, $g$ is the feature extraction backbone and $h$ is the final classification layer) on an unlabeled source dataset $D_{source} = \{ X_i \}_{i=1}^N$ by minimizing a self-supervised learning loss $\mathcal{L}_{SSL}$.

As discussed in the previous lesson, the self-supervised learning objective will help the neural network learn rich-semantic representations by extracting the supervisory signals from the structure of the data itself. Mathematically, this step can be written as:

Here, $f^* = h^* \circ g^*$ is the trained neural network. This is shown in the figure below.

Pre-training state in self-supervised learning

### Transfer learning

Once the network is trained, its feature representations can be transferred on a downstream task using a small labeled target dataset $D_{target}=\{(X_i, Y_i)\}_{i=1}^M$ ($i$ is used to index images, $M$﻿ is the total number of labeled images). Two standard ways to achieve this are linear classifiers and fine-tuning.

#### Linear classifier

Keeping the feature extractor $g^*(.)$ fixed, we can learn a small linear classifier $c(.)$ to minimize the cross-entropy loss over the target dataset $D_{target}$.

Here, $c^*$ is the trained linear classifier, $\mathbb{E}[.]$ is the expectation of a random variable, and $\mathcal{L}_{CE}$ is the standard cross-entropy loss function. The cross-entropy function $\mathcal{L}_{CE}$ is defined as:

Here, $\mathcal{C}$ is the set of classes ($c$ indexes classes), $\hat{y}$ is the model prediction of class probabilities ($\hat{y}_c$ is the predicted probability of $c^{th}$ class), and $y$ is the actual one-hot ground truth label ($y_c = 1$ if $c$ is the ground truth class else $0$). This is shown in the figure below.

Linear classifier evaluation

#### Fine-tuning

Fine-tuning is when weights of a trained neural network are used as initialization and optimized further (only for a few epochs and using a small learning rate) on a target downstream task (usually having small labeled samples). This is unlike regular training where we train the neural network from scratch on a huge number of data points.

By using the weights of feature extractor $g^*(.)$ as a good initialization point, we optimize the whole network ($g^*$ and $c$) to minimize the cross-entropy loss over the target dataset $D_{target}$.

Here, $c^*$ is the trained linear classifier, $g^{**}$ is the optimized feature backbone for the downstream task, $\mathbb{E}[.]$ is the expectation of a random variable, and $\mathcal{L}_{CE}$ is the standard cross-entropy loss function. This is shown in the figure below.

Fine-tuning evaluation