BYOL Training

Learn how to implement BYOL, a widely used distillation algorithm.

Student and teacher architectures

The student and teacher network in BYOL follows the same backbone architecture. However, the student network uses an additional MLP prediction head, p(.)p(.), to ensure asymmetry in the overall student-teacher architecture. In other words, in fteacher=ghf^{\text{teacher}} = g \circ h and fstudent=pghf^{\text{student}} = p \circ g \circ h, h(.)h(.) is the feature extractor, g(.)g(.) is the linear projection layer, and p(.)p(.) is the prediction head.

Get hands-on with 1200+ tech skills courses.