Transfer Learning, Fine-Tuning, and Model Compression

Explore methods to deploy large machine learning models efficiently by mastering transfer learning strategies, parameter-efficient fine-tuning, knowledge distillation, structured pruning, and neural architecture search. Understand how to balance model accuracy with latency, memory, and hardware constraints to design ready-for-production ML systems.

We'll cover the following...

Transfer learning: Fine-tune or train from scratch
LoRA and parameter-efficient fine-tuning
- The multi-tenant serving problem
- How LoRA works
  - Serving advantages
Knowledge distillation for model compression
- The teacher-student framework
  - Temperature scaling and soft targets
Pruning and NAS for edge deployment
- Structured pruning
- Neural architecture search
Summary

Suppose you are given a 300-million-parameter vision transformer that achieves strong accuracy on product recognition. The retailer wants to run it on mobile devices with a memory budget under 50 MB and inference latency below 10 ms. Your interviewer asks: “How do you get this model into production?” This trade-off tests whether you can reason beyond accuracy and account for deployment constraints. After loss functions and calibration, the next production challenge is deployment efficiency: serving a high-quality model under memory, latency, and device constraints.

This lesson covers the core toolkit for bridging the gap between model quality and deployment constraints. It covers when to fine-tune a foundation model versus train from scratch, how parameter-efficient methods such as LoRA adapt large models without copying all model weights, how knowledge distillation trains a smaller student model to approximate a larger teacher model, how structured pruning removes less useful components to fit edge-hardware constraints, and how neural architecture search helps find compact architectures under hardware constraints. The lesson ends with a decision tree that sequences these techniques into a cost-ordered compression pipeline. These are Staff+ differentiators in ML system design interviews because they demonstrate awareness of serving cost, latency, and deployment constraints beyond just model accuracy.

Transfer learning: Fine-tune or train from scratch

Transfer learning reuses learned representations from a source task or domain and applies them to a different target task. Instead of training a model from random initialization, you start from weights that already encode useful patterns, such as edge detectors in early vision layers or syntactic structure in language model embeddings.

The decision of whether to fine-tune or train from scratch depends on two axes: how similar the target task is to the source task, and how much labeled data you have for the target. These two factors create four distinct scenarios, each with a different recommended strategy.

High similarity, small data: Freeze the base layers and fine-tune only the classification head. For example, adapting an ImageNet-trained backbone for medical imaging with only 5,000 labeled images works well because low-level visual features transfer directly.
High similarity, large data: Fine-tune the entire network but apply layer-wise learning rate decay, using smaller learning rates for early layers that encode general features and higher rates for task-specific layers.
Low similarity, small data: This is the most dangerous zone. Fine-tuning risks catastrophic forgettinga phenomenon where aggressive weight updates overwrite the general-purpose representations learned during pre-training, degrading performance on both the original and target tasks.. Consider PEFT methods or training from scratch.
Low similarity, large data: Train from scratch or use pre-trained weights only for initialization. Forcing transfer from an unrelated domain wastes compute and can produce negative transfera situation where using a pre-trained model actually hurts target-task performance compared to training from random initialization..

Google’s approach illustrates this well. BERT is fine-tuned for search ranking, where the task aligns closely with its pre-training objective, but highly specialized verticals with proprietary feature spaces often use task-specific models trained from scratch.

The following table summarizes these four scenarios with their associated risks and real-world examples:

Transfer Learning Strategy Selection Matrix

Scenario	Recommended Strategy	Primary Risk	Real-World Example
High Similarity + Small Data	Freeze base and fine-tune head	Overfitting if too many layers unfrozen	Medical imaging with ImageNet backbone
High Similarity + Large Data	Full fine-tuning with layer-wise LR decay	Catastrophic forgetting if learning rate too high	BERT fine-tuned for search ranking
Low Similarity + Small Data	PEFT or train from scratch	Negative transfer if forced	Satellite imagery with NLP backbone
Low Similarity + Large Data	Train from scratch or pre-trained init only	Wasted compute if transfer is negative	Custom fraud model on proprietary features

Attention: Many candidates default to “just fine-tune a pre-trained model” without checking task similarity. In an interview, explicitly stating which quadrant your problem falls into demonstrates production maturity.

With the transfer learning decision framework established, the next question becomes: when you do fine-tune, how do you do it efficiently at scale?

LoRA and parameter-efficient fine-tuning

The multi-tenant serving problem

Full fine-tuning of a large language model with billions of parameters requires storing a complete copy of gradients and optimizer states for every task. If you serve ten different fine-tuned variants, you need ten full model copies in memory. This makes multi-tenant serving prohibitively expensive.

Parameter-efficient fine-tuning (PEFT) solves this by freezing most pre-trained weights and updating only a small number of additional or selected parameters. The base model stays shared across tasks, and only lightweight task-specific modules differ.

How LoRA works

LoRA (Low-Rank Adaptation) is a PEFT technique that decomposes weight updates into two small matrices instead of modifying the full weight matrix, reducing ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Transfer Learning, Fine-Tuning, and Model Compression

Transfer learning: Fine-tune or train from scratch

Transfer Learning Strategy Selection Matrix

LoRA and parameter-efficient fine-tuning

The multi-tenant serving problem

How LoRA works