...

Training the Student Network

Learn how to transfer the knowledge from the teacher to the student network.

We'll cover the following...

The distillation loss
Difference between the soft target and hard target
Difference between soft prediction and hard prediction
The student loss
Computing student loss
Computing distillation loss
Final loss function

Okay, so how do we transfer the dark knowledge from the teacher to the student? How is the student network trained, and how does it acquire knowledge from the teacher?

Note: The student network is not pre-trained, only the teacher network is pre-trained. The teacher network is pre-trained with softmax temperature.

As shown in the following figure, we feed the input sentence to both teacher and student networks and get the probability distribution as output. The teacher network is a pre-trained network, so the probability distribution returned by the teacher network will be our target. The output of the teacher network is called a soft target, and the prediction made by the student network is called a soft prediction.

Press + to interact

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Training the Student Network

The distillation loss