Masked Siamese Networks: Masking Strategy and Encoder
Explore the concepts behind Masked Siamese Networks and their masking strategy in self-supervised learning. Understand how to generate masked and unmasked views of images and use dual vision transformer encoders. Learn how to update target encoder parameters and derive meaningful feature representations without reconstructing image pixels.
We'll cover the following...
Masked Siamese Networks (MSNs) is a self-supervised framework that leverages similarity maximization and masked image modeling concepts. As shown in the figure below, MSNs generate two augmented views of an image, where one is masked and the other is unchanged. The objective of MSNs is to learn similar network representations (a vision transformer) for both views.
MSNs don’t explicitly reconstruct the image pixels from the masked input. Instead, they incorporate this mask-denoising step in their feature representations itself by making representations of masked and unmasked image views similar.
Inputs and masking strategy
The first step in training is to generate two kinds of input views—the anchor view and the target view. Given an image,