Distillation: The BYOL Algorithm

Learn about self-supervised learning via distillation and get an overview of the BYOL algorithm.

We'll cover the following...

Distillation as similarity maximization

As shown in the figure below, distillation, in general, refers to transferring knowledge from a fixed (usually large) model known as teacher fteacher(.)f^{\text{teacher}}(.) to a smaller one known as student fstudent(.)f^{\text{student}}(.).

Distillation methods can also be seen as similarity maximization–based methods. Just like contrastive learning and clustering, distillation aims to prevent trivial solutions to f(X)=f(augment(X))f(X) = f(\text{augment}(X)). It does so by solving fstudent(X)=fteacher(augment(X))f^{\text{student}}(X) = f^{\text{teacher}}(\text{augment}(X)) ...