Model Compression
Explore how knowledge distillation compresses large AI models into smaller, more efficient ones by training a student model to mimic a teacher model’s behavior. Understand the theory, techniques, and practical scenarios where distillation outperforms pruning or quantization, enabling deployment on limited hardware without significant accuracy loss.
Knowledge distillation is a common topic in GenAI interviews because modern models have become enormous—Llama 4 reportedly reaches 2 trillion parameters—while real-world deployments demand smaller, efficient models. Interviewers use this question to assess whether you understand how to compress these large models into practical ones without compromising performance too much, and why distillation is a key strategy for achieving this.
They want to see that you understand what knowledge distillation is, why it’s useful, and how it relates to model efficiency, deployment constraints, and accuracy preservation. They’re also looking for awareness of other compression techniques—like pruning and quantization—and whether you can distinguish distillation from them.
A notable example is Llama 4: Meta reportedly trained a massive 2-trillion-parameter “teacher” model but only released smaller versions, such as Llama 4 Scout (109B) and Maverick (400B). These smaller models were produced by distilling knowledge from the giant model into deployable students. This illustrates the real motivation behind distillation: it transfers the capabilities of a huge model into a lightweight version that more people can actually use.
What is knowledge distillation?
Knowledge distillation is a model compression technique in which a large, high-capacity model (often called the “teacher” model) transfers its knowledge to a smaller “student” model. In simple terms, the idea is to train the smaller student model to mimic the behavior of the big teacher model. By doing so, the student model can achieve nearly the same performance as the teacher's while being much more efficient in terms of parameters and computation. A good way to think about it is that knowledge distillation refers to transferring knowledge from a large, unwieldy model or set of models to a single, smaller model that can be practically deployed under real-world constraints. Essentially, the complex knowledge encapsulated by a huge model (or even an ensemble of models) is “distilled” into a lightweight model.
We can also say that we are dealing with a mentor and its apprentice: the teacher (mentor) is an expert with a wealth of knowledge, but perhaps too slow or costly to use in practice; the student (apprentice) is less powerful, but if properly trained by the mentor, can ...