Distillation Techniques for Pre-training and Fine-tuning

Learn about performing distillation in the pre-training and fine-tuning stages.

In TinyBERT, we will use a two-stage learning framework as follows:

  • General distillation

  • Task-specific distillation

This two-stage learning framework enables the distillation in both the pre-training and fine-tuning stages. Let's take a look at how each of the stages works in detail.

General distillation

General distillation is basically the pre-training step. Here, we use the large pre-trained BERT (BERT-base) as the teacher and transfer its knowledge to the small student BERT (TinyBERT) by performing distillation. Note that we apply distillation at all the layers.

We know that the teacher BERT-base model is pre-trained on the general dataset (Wikipedia and the Toronto BookCorpus dataset). So, while performing distillation, that is, while transferring knowledge from the teacher (BERT-base) to the student (TinyBERT), we use the same general dataset.

After distillation, our student BERT will consist of knowledge from the teacher, and we can call our pre-trained student BERT a general TinyBERT.

Get hands-on with 1200+ tech skills courses.