LoRA: Low-Rank Adaptation
Explore how Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models by training only small matrices while keeping the main model frozen. Learn how LoRA reduces memory requirements, mitigates catastrophic forgetting, and offers practical strategies to choose hyperparameters for task-specific adaptation.
We'll cover the following...
When you fine-tune a large language model, you are asking every single parameter to shift in response to your new data. For a 7-billion-parameter model, that means 7 billion floating-point values must receive gradient updates, be stored alongside optimizer states, and be checkpointed to disk. The compute bill adds up fast, and the risk of overwriting useful pre-trained knowledge is real. But what if the weight changes needed for your task actually occupy a tiny fraction of that enormous parameter space? This is the insight that makes Low-Rank Adaptation, or LoRA, one of the most practical breakthroughs in efficient fine-tuning. It lets you train a sliver of new parameters while keeping the original model completely frozen, cutting trainable parameters by orders of magnitude without sacrificing performance.
Why full fine-tuning hits a wall
The previous lesson established that fine-tuning is justified when you need persistent behavioral change in a model. But full fine-tuning carries steep costs that scale directly with model size.
Consider what happens during a standard fine-tuning run on a 7B-parameter model. Every parameter receives a gradient, and the Adam optimizer maintains two additional state variables (the first and second moment estimates) for each parameter. In mixed-precision training, this creates a memory footprint roughly four times the model size. For a 7B model stored in 16-bit precision, that translates to approximately 56 GB of GPU memory just for weights and optimizer states, before you even account for activations and batch data.
Attention: Many practitioners underestimate optimizer memory. Adam effectively triples the storage cost of every trainable parameter because it keeps running averages of both the gradient and the squared gradient.
Beyond memory, full fine-tuning risks
The key insight that motivates LoRA comes from research by Aghajanyan et al. (2020), which demonstrated that pre-trained language models have a low
The following diagram contrasts the two approaches at a glance.
With this visual intuition in place, the next section walks through the linear algebra that makes this decomposition work.
The math behind low-rank decomposition
Standard forward pass and the LoRA modification
In a standard transformer layer, a weight matrix
LoRA’s core idea is to never learn the full
...