Summary: BERT Variants - Based on Knowledge Distillation

Let’s summarize what we have learned so far.

We'll cover the following

Key highlights

Summarized below are the main highlights of what we've learned in this chapter.

  • We started off by learning what knowledge distillation is and how it works.

  • We learned that knowledge distillation is a model compression technique in which a small model is trained to reproduce the behavior of a large pre-trained model. It is also referred to as teacher-student learning, where the large pre-trained model is the teacher and the small model is the student.

Get hands-on with 1200+ tech skills courses.