Introduction to Similarity Maximization
Explore similarity maximization techniques in self-supervised learning to obtain robust and semantically meaningful feature representations. This lesson helps you understand challenges like trivial solutions and introduces key approaches such as contrastive learning, clustering, and distillation to improve downstream task performance.
We'll cover the following...
We previously discussed how pretext task-based self-supervised pre-training is unsuitable for all downstream tasks. However, we also understood why pretext task-based pre-training is only sometimes suitable for all downstream jobs (because of the mismatch between what is being solved in the pretext task and what we need to be achieved by the transfer task). In this chapter, we will learn similarity maximization, a popular and commonly used self-supervised paradigm that addresses the limitations of pretext task-based self-supervised learning.
What do we want from pre-trained features?
Fundamentally, after the pre-training step, we want the trained features to satisfy two important properties:
Capture semantics: We want them to represent how images relate to each other, such as whether a pair of images are similar and to what extent.
Robustness: We want them to be robust or invariant to “nuisance factors” like noise, data augmentation, occlusions, etc.
A trivial solution
Given a neural network,
As shown in the figure above, one trivial solution to train such a network can be:
Take an image,
, and apply two different data augmentations , and to it. Feed
and through . Compute the similarity between features
and (e.g., cosine similarity). Get the gradients and back-propagate to maximize the similarity.
However, if we look carefully, the training above can cause the network to learn constant representations for all inputs (i.e.,
Lines 5–8: We define a two-layer MLP network
modelthat takes an input of sizeand outputs a two-dimensional feature vector. Lines 10–16: We define a dummy input,
x, and create two augmented versions,t1_xandt2_x, by adding random Gaussian noise to it in lines 17 and 18.Lines 20–21: We define
optimizerovermodelparameters and the cosine similarity objectivecriterion.Lines 23–28: We optimize the
modelforepochs by maximizing the cosine similarity between features ( f1andf2) oft1_xandt2_x.Lines 30–36: We print the mean and variance of the features,
f = model(x), after each epoch.
The code above outputs the mean and variance of the features, f, at each epoch. As can be seen, the variance of features converges to zero, indicating that the neural network has learned constant representations for all inputs.
Similarity maximization
To avoid trivial solutions, we need to modify the similarity-maximization objective so that the neural network can learn to align semantically similar images and, at the same time, distinguish between unrelated or dissimilar images. As shown in the figure below, such a class of objectives is known as similarity maximization and can be classified into three types:
Contrastive learning: This aims to bring representations of similar images closer and dissimilar images apart.
Clustering: This aims to cluster the feature representations such that similar image features lie in the same clusters and those of dissimilar images lie in different clusters.
Distillation: This uses asymmetry in architectures (i.e., teacher and student network use different architecture) and learning rules (i.e., teacher and student network use different learning algorithms to update their parameters) to avoid trivial solutions.