Distillation of Embedding and Prediction Layer

Explore how to perform knowledge distillation in BERT models by transferring knowledge from teacher to student networks in both embedding and prediction layers. Understand loss functions used for embedding layer, transformer layers, and prediction layer distillation to effectively train compact BERT variants such as TinyBERT.

We'll cover the following...