Search⌘ K
AI Features

Distillation of Embedding and Prediction Layer

Explore how to perform knowledge distillation in BERT models by transferring knowledge from teacher to student networks in both embedding and prediction layers. Understand loss functions used for embedding layer, transformer layers, and prediction layer distillation to effectively train compact BERT variants such as TinyBERT.

Embedding layer distillation

In embedding layer distillation, we transfer knowledge from the embedding layer of the teacher to the embedding layer of the student. Let ESE^S denote the embedding of the student and ETE^T denote the embedding of the teacher, then we train the network to perform embedding layer distillation by minimizing the mean squared error between the embedding of student ESE^S ...