Masked Siamese Networks: Masking Strategy and Encoder

Learn how to implement the masking strategy and encoder layer of Masked Siamese Networks (MSNs).

We'll cover the following

Masked Siamese Networks (MSNs) is a self-supervised framework that leverages similarity maximization and masked image modeling concepts. As shown in the figure below, MSNs generate two augmented views of an image, where one is masked and the other is unchanged. The objective of MSNs is to learn similar network representations (a vision transformer) for both views.

MSNs don’t explicitly reconstruct the image pixels from the masked input. Instead, they incorporate this mask-denoising step in their feature representations itself by making representations of masked and unmasked image views similar.

Get hands-on with 1200+ tech skills courses.