Training the Student BERT (DistilBERT)

Explore the process of training student BERT models like DistilBERT by applying knowledge distillation techniques. Understand the combined use of distillation loss, masked language modeling loss, and cosine embedding loss. Learn how these methods create a lighter yet accurate BERT variant suitable for faster inference and edge deployment.

We'll cover the following...

Using training strategies from the RoBERTa model
Computing distillation loss
Computing student loss
Computing cosine embedding loss
Final loss function
DistilBERT vs. BERT-base model

We can train the student BERT with the same dataset we used for pre-training the teacher BERT (BERT-base). We know that the BERT-base model is pre-trained with English Wikipedia and the Toronto BookCorpus dataset, and we can use this same dataset to train the student BERT (small BERT).

Using training strategies from the RoBERTa model

We'll borrow a few training strategies from the RoBERTa model. With RoBERTa, we train the student BERT with only the masked language modeling task, and during masked language modeling, we use dynamic masking. We also use a large batch size on every iteration.

Computing distillation loss

...

1.Before We Start

2.Starting Off with BERT

3.A Primer on Transformers

Project

4.Understanding the BERT Model

5.Getting Hands-On with BERT

6.Exploring BERT Variants

7.Different BERT Variants

8.BERT Variants—Based on Knowledge Distillation

9.Applications of BERT

10.Exploring BERTSUM for Text Summarization

11.Applying BERT to Other Languages

12.Exploring Sentence and Domain-Specific BERT

13.Working with VideoBERT, BART, and More

14.Conclusion

Project

Training the Student BERT (DistilBERT)

Using training strategies from the RoBERTa model

Computing distillation loss