Mastering Self-Supervised Algorithms for Learning without Labels/

...

Introduction to Masked Image Modeling

Learn a new self-supervised learning paradigm, Masked Image Modeling.

We'll cover the following...

What is masked image modeling?
The framework of masked image modeling
Overview of vision transformers

Press + to interact

Applying masked image modeling can be challenging because of several reasons:

Pixels close to each other are highly correlated. As a result, sometimes the image can be reconstructed well enough, even by duplicating nearby pixels. This leads to trivial solutions and inefficient learning.
Signals at pixel levels are very raw and contain low-level information.
Signals in image data are also continuous, unlike text data, where they are discrete.

Thus, masked image modeling must be accomplished properly to avoid correlation/trivial solutions.

The framework of masked image modeling

Masked image modeling aims to predict the original signals from a masked input. As illustrated below, the framework involves the following components:

Masking strategy: A masking strategy is based on selecting the area to mask and performing masking on that area. Usually, masking is done at the image patch level rather than the pixel level (i.e., masking is applied to patches rather than to pixels). We can use various strategies for image masking, like square shape masking, random masking, etc. This masked image is used as an input to the neural network.
Encoder: This component is a neural network that should be able to take a masked image as input and extract useful latent representations to predict the original signals at the masked areas. Generally, transformer models like Vision Transformer and Swin Transformers (discussed subsequently) are used as encoder architectures.
Prediction head: This component should reconstruct the original signals at the masked region of the input when the encoder features are given as input.
Prediction target: This component calculates the loss function on prediction head output. The loss type can be a cross-entropy classification $L_1$ or $L_2$ pixel regression loss. Pixel regression means we predict the values of masked regions of the input image.

Here, $\hat{y}$ is a model prediction and $y$ is a target.

Overview of vision transformers

Most approaches in mask image modeling use masking strategies that operate at the image patch level. Instead of masking an image at each pixel, they mask $N \times N$ sized patches. For the same reason, these approaches prefer vision transformers (as they operate on the patch level) rather than convolutional neural networks as their primary network architectures. So, here is an overview of vision transformers to understand the concepts of masked image modeling better.

Patch embeddings

The first step is to represent the input image $X_i \in [0,1]^{H \times W\times C}$ (height $H$ , width $W$ , and channels $C$ ) as a sequence of $N \times N$ -sized non-overlapping patches $X_i^{\text{patched}} = [ X_i^1, X_i^2, ..., X_i^P]$ (here, $P = (\frac{H \times W}{N^2})$ is the total number of patches, and $X_i^p$ represents the $p^{th}$ patch $\in [0,1]^{N \times N \times C}$ ).

Press + to interact

Introduction to Self-Supervised Learning

Pretext Tasks

Similarity Maximization and Redundancy Reduction

Masked Image Modeling

Appendix

Introduction to Masked Image Modeling

What is masked image modeling?

The framework of masked image modeling

Overview of vision transformers

Patch embeddings