Simple Masked Image Modeling

Explore the fundamentals of Simple Masked Image Modeling (SimMIM) to predict raw pixel values of masked image patches. Understand how to generate random patch-level masks, encode masked images with a Vision Transformer, and reconstruct images using a lightweight prediction head. This lesson guides you through implementing SimMIM's masking, encoding, and training process to improve self-supervised learning on unlabelled image datasets.

We'll cover the following...

Masking strategy
Encoder
Prediction head

Masking strategy

SimMIM uses a patch-aligned random masking strategy where masking is randomly applied at a patch level (i.e., a patch is either fully visible or fully masked). By default, the algorithm uses a $32\times 32$ ( $N \times N$ ) patch size. Thus, given an image, $X_i$ , we generate a random mask $M_i \in \{0,1\}^{H\times W}$ ( $H$ and $W$ are the height and width of the image $X_i$ ). This $0$ represents that the pixel/patch is masked and $1$ ...

1.Introduction to Self-Supervised Learning

2.Pretext Tasks

3.Similarity Maximization and Redundancy Reduction

4.Masked Image Modeling

5.Appendix

Simple Masked Image Modeling

Masking strategy