Model Compression
Explore how knowledge distillation compresses large AI models into smaller, more efficient ones by training a student model to mimic a teacher model’s behavior. Understand the theory, techniques, and practical scenarios where distillation outperforms pruning or quantization, enabling deployment on limited hardware without significant accuracy loss.
Knowledge distillation is a common topic in GenAI interviews because modern models have become enormous—Llama 4 reportedly reaches 2 trillion parameters—while real-world deployments demand smaller, efficient models. Interviewers use this question to assess whether you understand how to compress these large models into practical ones without compromising performance too much, and why distillation is a key strategy for achieving this.
They want to see that you understand what knowledge distillation is, why it’s useful, and how it relates to model efficiency, deployment constraints, and accuracy preservation. They’re also looking for awareness of other compression techniques—like pruning and quantization—and whether you can distinguish distillation from them.
A notable example is Llama 4: Meta reportedly trained a massive 2-trillion-parameter “teacher” model but only released smaller versions, such as Llama 4 Scout (109B) and Maverick (400B). These smaller models were produced by distilling knowledge from the giant model into deployable students. This illustrates the real motivation behind distillation: it transfers the capabilities of a huge model into a lightweight version that more people can actually use.
What is knowledge distillation?
Knowledge distillation is a model compression technique in which a large, high-capacity model (often called the “teacher” model) transfers its knowledge to a smaller “student” model. In simple terms, the idea is to train the smaller student model to mimic the behavior of the big teacher model. By doing so, the student model can achieve nearly the same performance as the teacher's while being much more efficient in terms of parameters and computation. A good way to think about it is that knowledge distillation refers to transferring knowledge from a large, unwieldy model or set of models to a single, smaller model that can be practically deployed under real-world constraints. Essentially, the complex knowledge encapsulated by a huge model (or even an ensemble of models) is “distilled” into a lightweight model.
We can also say that we are dealing with a mentor and its apprentice: the teacher (mentor) is an expert with a wealth of knowledge, but perhaps too slow or costly to use in practice; the student (apprentice) is less powerful, but if properly trained by the mentor, can learn to perform the task almost as well. The term distillation is used by analogy to distilling a spirit or essence—we are extracting the essence of the large model’s knowledge and bottling it into a smaller model.
Quick answer: Knowledge distillation trains a smaller model (the student) to mimic a large, high-performing model (the teacher) by learning from the teacher’s soft output probabilities or internal representations. The result is a compact model that retains much of the teacher’s performance but is far cheaper to run.
A key thing to remember is that knowledge distillation doesn’t copy a teacher model’s weights into a smaller student model—instead, the student is trained from scratch to mimic the teacher’s outputs. This often leads to better generalization than training the small model directly on raw data because the teacher’s “soft targets” provide richer learning signals than hard labels. In short, knowledge distillation is learning a small model from a large model—a form of model compression through imitation. ...