Introduction to Similarity Maximization

Explore similarity maximization techniques in self-supervised learning to obtain robust and semantically meaningful feature representations. This lesson helps you understand challenges like trivial solutions and introduces key approaches such as contrastive learning, clustering, and distillation to improve downstream task performance.

We'll cover the following...

What do we want from pre-trained features?
- A trivial solution
Similarity maximization

We previously discussed how pretext task-based self-supervised pre-training is unsuitable for all downstream tasks. However, we also understood why pretext task-based pre-training is only sometimes suitable for all downstream jobs (because of the mismatch between what is being solved in the pretext task and what we need to be achieved by the transfer task). In this chapter, we will learn similarity maximization, a popular and commonly used self-supervised paradigm that addresses the limitations of pretext task-based self-supervised learning.

What do we want from pre-trained features?

Fundamentally, after the pre-training step, we want the trained features to satisfy two important properties:

Capture semantics: We want them to represent how images relate to each other, such as whether a pair of images are similar and to what extent.
Robustness: We want them to be robust or invariant to “nuisance factors” like noise, data augmentation, occlusions, etc.

A trivial solution

Given a neural network, $f(.)$ , we want to learn features that are robust to data augmentation, that is, $f(T_1(X_i)) = f(T_2(X_i))$ . Here, $T_j(.)$ is a data augmentation strategy, and $X_i$ is the input image.

As shown in the figure above, one trivial solution to train such a network can be:

Take an image, $X_i$ , and apply two different data augmentations $T_1$ , and $T_2$ to it.
Feed $T_1(X_i)$ and $T_2(X_i)$ through $f$ .
Compute the similarity between features $f(T_1(X_i))$ and $f(T_2(X_i))$ (e.g., cosine similarity).
Get the gradients and back-propagate to maximize the similarity.

However, if we look carefully, the training above can cause the network to learn constant representations for all inputs (i.e., $f(X_i) = \text{constant}$ ). The representations will thus collapse and become unusable for downstream recognition tasks. This phenomenon can be seen in the code snippet below.

Python 3.8

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
model = nn.Sequential(nn.Linear(5,3), #model definition
                      nn.ReLU(),
                      nn.Linear(3, 2),
                      nn.ReLU())
x = torch.FloatTensor([[1, 0, 0, 1, 1], #d dummy input
                       [0, 0, 1, 1, 1],
                       [1, 1, 1, 1, 1],
                       [1, 0, 1, 0, 1],
                       [1, 0, 0, 0, 1],
                       [0, 0, 1, 0, 0]])
t1_x = x + 0.3*torch.randn(6, 5) # augmented version 1
t2_x =  x + 0.3*torch.randn(6, 5) # augmented version 2
optimizer = torch.optim.Adam(model.parameters()) # optimizer
criterion = nn.CosineSimilarity(dim=-1) # cosine similarity objective
for epoch in range(10):
  optimizer.zero_grad()
  f1, f2 = model(t1_x), model(t2_x) # features
  loss = -criterion(f1, f2).mean() # maximize cosine similarity
  loss.backward()
  optimizer.step() # update model parameters
  with torch.no_grad():
    f = model(x)
    mean = f.mean(0)
    var = f.var(0)
    print("Mean of features at epoch {}\n".format(epoch), mean.numpy())
    print("Variance deviation of features at epoch {}\n".format(epoch), var.numpy())
    print("=====================")

Lines 5–8: We define a two-layer MLP network model that takes an input of size $5$ and outputs a two-dimensional feature vector.
Lines 10–16: We define a dummy input, x, and create two augmented versions, t1_x and t2_x, by adding random Gaussian noise to it in lines 17 and 18.
Lines 20–21: We define optimizer over model parameters and the cosine similarity objective criterion.
Lines 23–28: We optimize the model for $10$ epochs by maximizing the cosine similarity between features ( f1 and f2) of t1_x and t2_x.
Lines 30–36: We print the mean and variance of the features, f = model(x), after each epoch.

The code above outputs the mean and variance of the features, f, at each epoch. As can be seen, the variance of features converges to zero, indicating that the neural network has learned constant representations for all inputs.

Similarity maximization

To avoid trivial solutions, we need to modify the similarity-maximization objective so that the neural network can learn to align semantically similar images and, at the same time, distinguish between unrelated or dissimilar images. As shown in the figure below, such a class of objectives is known as similarity maximization and can be classified into three types:

Contrastive learning: This aims to bring representations of similar images closer and dissimilar images apart.
Clustering: This aims to cluster the feature representations such that similar image features lie in the same clusters and those of dissimilar images lie in different clusters.
Distillation: This uses asymmetry in architectures (i.e., teacher and student network use different architecture) and learning rules (i.e., teacher and student network use different learning algorithms to update their parameters) to avoid trivial solutions.

1.Introduction to Self-Supervised Learning

2.Pretext Tasks

3.Similarity Maximization and Redundancy Reduction

4.Masked Image Modeling

5.Appendix

Introduction to Similarity Maximization

What do we want from pre-trained features?

A trivial solution

Similarity maximization