Okay, so how do we transfer the dark knowledge from the teacher to the student? How is the student network trained, and how does it acquire knowledge from the teacher?

Note: The student network is not pre-trained, only the teacher network is pre-trained. The teacher network is pre-trained with softmax temperature.

As shown in the following figure, we feed the input sentence to both teacher and student networks and get the probability distribution as output. The teacher network is a pre-trained network, so the probability distribution returned by the teacher network will be our target. The output of the teacher network is called a soft target, and the prediction made by the student network is called a soft prediction.

