Simple Masked Image Modeling
Explore the fundamentals of Simple Masked Image Modeling (SimMIM) to predict raw pixel values of masked image patches. Understand how to generate random patch-level masks, encode masked images with a Vision Transformer, and reconstruct images using a lightweight prediction head. This lesson guides you through implementing SimMIM's masking, encoding, and training process to improve self-supervised learning on unlabelled image datasets.
We'll cover the following...
Simple Masked Image Modeling (SimMIM) is a simple masked modeling framework that predicts the raw pixel values of randomly masked input image patches using a lightweight linear layer and a
Masking strategy
SimMIM uses a patch-aligned random masking strategy where masking is randomly applied at a patch level (i.e., a patch is either fully visible or fully masked). By default, the algorithm uses a