Transfer Learning, Fine-Tuning, and Model Compression
Explore methods to deploy large machine learning models efficiently by mastering transfer learning strategies, parameter-efficient fine-tuning, knowledge distillation, structured pruning, and neural architecture search. Understand how to balance model accuracy with latency, memory, and hardware constraints to design ready-for-production ML systems.
Suppose you are given a 300-million-parameter vision transformer that achieves strong accuracy on product recognition. The retailer wants to run it on mobile devices with a memory budget under 50 MB and inference latency below 10 ms. Your interviewer asks: “How do you get this model into production?” This trade-off tests whether you can reason beyond accuracy and account for deployment constraints. After loss functions and calibration, the next production challenge is deployment efficiency: serving a high-quality model under memory, latency, and device constraints.
This lesson covers the core toolkit for bridging the gap between model quality and deployment constraints. It covers when to fine-tune a foundation model versus train from scratch, how parameter-efficient methods such as LoRA adapt large models without copying all model weights, how knowledge distillation trains a smaller student model to approximate a larger teacher model, how structured pruning removes less useful components to fit edge-hardware constraints, and how neural architecture search helps find compact architectures under hardware constraints. The lesson ends with a decision tree that sequences these techniques into a cost-ordered compression pipeline. These are Staff+ differentiators in ML system design interviews because they demonstrate awareness of serving cost, latency, and deployment constraints beyond just model accuracy.
Transfer learning: Fine-tune or train from scratch
Transfer learning reuses learned representations from a source task or domain and applies them to a different target task. Instead of training a model from random initialization, you start from weights that already encode useful patterns, such as edge detectors in early vision layers or syntactic structure in language model embeddings.
The decision of whether to fine-tune or train from scratch depends on two axes: how similar the target task is to the source task, and how much labeled data you have for the target. These two factors create four distinct scenarios, each with a different recommended strategy.
High similarity, small data: Freeze the base layers and fine-tune only the classification head. For example, adapting an ImageNet-trained backbone for medical imaging with only 5,000 labeled images works well because low-level visual features transfer directly.
High similarity, large data: Fine-tune the entire network but apply layer-wise learning rate decay, using smaller learning rates for early layers that encode general features and higher rates for task-specific layers.
Low similarity, small data: This is the most dangerous zone. Fine-tuning risks
. Consider PEFT methods or training from scratch.catastrophic forgetting a phenomenon where aggressive weight updates overwrite the general-purpose representations learned during pre-training, degrading performance on both the original and target tasks. Low similarity, large data: Train from scratch or use pre-trained weights only for initialization. Forcing transfer from an unrelated domain wastes compute and can produce
.negative transfer a situation where using a pre-trained model actually hurts target-task performance compared to training from random initialization.
Google’s approach illustrates this well. BERT is fine-tuned for search ranking, where the task aligns closely with its pre-training objective, but highly specialized verticals with proprietary feature spaces often use task-specific models trained from scratch.
The following table summarizes these four scenarios with their associated risks and real-world examples:
Transfer Learning Strategy Selection Matrix
Scenario | Recommended Strategy | Primary Risk | Real-World Example |
High Similarity + Small Data | Freeze base and fine-tune head | Overfitting if too many layers unfrozen | Medical imaging with ImageNet backbone |
High Similarity + Large Data | Full fine-tuning with layer-wise LR decay | Catastrophic forgetting if learning rate too high | BERT fine-tuned for search ranking |
Low Similarity + Small Data | PEFT or train from scratch | Negative transfer if forced | Satellite imagery with NLP backbone |
Low Similarity + Large Data | Train from scratch or pre-trained init only | Wasted compute if transfer is negative | Custom fraud model on proprietary features |
Attention: Many candidates default to “just fine-tune a pre-trained model” without checking task similarity. In an interview, explicitly stating which quadrant your problem falls into demonstrates production maturity.
With the transfer learning decision framework established, the next question becomes: when you do fine-tune, how do you do it efficiently at scale?
LoRA and parameter-efficient fine-tuning
The multi-tenant serving problem
Full fine-tuning of a large language model with billions of parameters requires storing a complete copy of gradients and optimizer states for every task. If you serve ten different fine-tuned variants, you need ten full model copies in memory. This makes multi-tenant serving prohibitively expensive.
Parameter-efficient fine-tuning (PEFT) solves this by freezing most pre-trained weights and updating only a small number of additional or selected parameters. The base model stays shared across tasks, and only lightweight task-specific modules differ.
How LoRA works
LoRA (Low-Rank Adaptation) is a PEFT technique that decomposes weight updates into two small matrices instead of modifying the full weight matrix, reducing ...