Training the Student BERT (DistilBERT)
Explore the process of training student BERT models like DistilBERT by applying knowledge distillation techniques. Understand the combined use of distillation loss, masked language modeling loss, and cosine embedding loss. Learn how these methods create a lighter yet accurate BERT variant suitable for faster inference and edge deployment.
We can train the student BERT with the same dataset we used for pre-training the teacher BERT (BERT-base). We know that the BERT-base model is pre-trained with English Wikipedia and the Toronto BookCorpus dataset, and we can use this same dataset to train the student BERT (small BERT).
Using training strategies from the RoBERTa model
We'll borrow a few training strategies from the RoBERTa model. With RoBERTa, we train the student BERT with only the masked language modeling task, and during masked language modeling, we use dynamic masking. We also use a large batch size on every iteration.
Computing distillation loss
As shown in the following figure, we take a ...