Knowledge Distillation

Learn about knowledge distillation and dark knowledge in detail with an example.

Knowledge distillation is a model compression technique in which a small model is trained to reproduce the behavior of a large pre-trained model. It is also referred to as teacher-student learning, where the large pre-trained model is the teacher and the small model is the student. Let's understand how knowledge distillation works with an example.

Example: Predicting the next word in a sentence

Suppose we have pre-trained a large model to predict the next word in a sentence. We call this large pre-trained model a teacher network. If we feed in a sentence and let the network predict the next word in the sentence, then it will return the probability distribution of all the words in the vocabulary being the next word, as shown in the following figure. Note that for simplicity and better understanding, we'll assume we have only five words in our vocabulary:

Get hands-on with 1200+ tech skills courses.