Making Distilled Models from DeepSeek-R1
Learn what distillation is and about the well-known DeepSeek distilled models.
We'll cover the following...
Imagine you’re compressing a large, detailed book into a concise guide. The original book contains deep insights, examples, and background information, but you need a version that captures the key ideas in a more efficient form. By carefully summarizing, you retain the most important lessons and core concepts, making the guide easier to use while still preserving the essence of the original. This process of transferring knowledge into a more compact and efficient form reflects the idea of distillation.
But what is distillation?
Model distillation
Now, let’s talk AI. Just like steeping tea, In AI, distillation refers to a method where we train a smaller model (called the student) to mimic a larger, more powerful model (the teacher). The goal? To retain as much intelligence as possible while making the smaller model fast and efficient. This is crucial because running massive models on everyday devices would be like trying to run a Ferrari engine inside a bicycle.
Educative byte: Not all small language models (SLMs) are distilled models. Some SLMs, like Mistral 7B or Gemma 2B, are small by design.
How model distillation works
Below is a step-by-step process of how model distillation works in general:
Step 1: Train the big teacher model. We start with a large AI model, trained on tons of data. This model is accurate and powerful but slow and computationally expensive to use.
Step 2: Generate soft labels. Instead of just using traditional hard labels (like “dog” or “cat”), the teacher model provides probability distributions. For example, instead of saying “this is definitely a cat,” it might say:
🐱 Cat: 85%
🦁 Lion: 10%
🐶 Dog: 5%
These “soft” probabilities contain hidden knowledge, such as relationships between classes (e.g., cats and lions are somewhat related).
Step 3: Train the student model. A smaller, more efficient model is trained using both the original data and the teacher’s soft labels. This helps the student learn patterns in a more nuanced way rather than just memorizing the dataset.
Step 4: Match the student’s predictions to the teacher’s. The student model is optimized to mimic the teacher’s probability outputs as closely as possible. It doesn’t just learn the right answers; it learns how the teacher thinks. ...