Distillation of Embedding and Prediction Layer
Explore how to perform knowledge distillation in BERT models by transferring knowledge from teacher to student networks in both embedding and prediction layers. Understand loss functions used for embedding layer, transformer layers, and prediction layer distillation to effectively train compact BERT variants such as TinyBERT.
We'll cover the following...
We'll cover the following...
Embedding layer distillation
In embedding layer distillation, we transfer knowledge from the embedding layer of the teacher to the embedding layer of the student. Let