...

Training the Student BERT (DistilBERT)

Learn how to train the student BERT for DistilBERT and how DistillBERT differs from the BERT-base model.

We'll cover the following...

Using training strategies from the RoBERTa model
Computing distillation loss
Computing student loss
Computing cosine embedding loss
Final loss function
DistilBERT vs. BERT-base model

We can train the student BERT with the same dataset we used for pre-training the teacher BERT (BERT-base). We know that the BERT-base model is pre-trained with English Wikipedia and the Toronto BookCorpus dataset, and we can use this same dataset to train the student BERT (small BERT).

Using training strategies from the RoBERTa model

We'll borrow a few training strategies from the RoBERTa model. With RoBERTa, we train the student BERT with only the masked language modeling task, and during masked language modeling, we use dynamic masking. We also use a large batch size on every iteration.

Computing distillation loss

As shown in the following figure, we take a ...

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Training the Student BERT (DistilBERT)

Using training strategies from the RoBERTa model

Computing distillation loss