Search⌘ K
AI Features

LoRA: Low-Rank Adaptation

Explore how Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models by training only small matrices while keeping the main model frozen. Learn how LoRA reduces memory requirements, mitigates catastrophic forgetting, and offers practical strategies to choose hyperparameters for task-specific adaptation.

When you fine-tune a large language model, you are asking every single parameter to shift in response to your new data. For a 7-billion-parameter model, that means 7 billion floating-point values must receive gradient updates, be stored alongside optimizer states, and be checkpointed to disk. The compute bill adds up fast, and the risk of overwriting useful pre-trained knowledge is real. But what if the weight changes needed for your task actually occupy a tiny fraction of that enormous parameter space? This is the insight that makes Low-Rank Adaptation, or LoRA, one of the most practical breakthroughs in efficient fine-tuning. It lets you train a sliver of new parameters while keeping the original model completely frozen, cutting trainable parameters by orders of magnitude without sacrificing performance.

Why full fine-tuning hits a wall

The previous lesson established that fine-tuning is justified when you need persistent behavioral change in a model. But full fine-tuning carries steep costs that scale directly with model size.

Consider what happens during a standard fine-tuning run on a 7B-parameter model. Every parameter receives a gradient, and the Adam optimizer maintains two additional state variables (the first and second moment estimates) for each parameter. In mixed-precision training, this creates a memory footprint roughly four times the model size. For a 7B model stored in 16-bit precision, that translates to approximately 56 GB of GPU memory just for weights and optimizer states, before you even account for activations and batch data.

Attention: Many practitioners underestimate optimizer memory. Adam effectively triples the storage cost of every trainable parameter because it keeps running averages of both the gradient and the squared gradient.

Beyond memory, full fine-tuning risks catastrophic forgettingthe phenomenon where a neural network loses previously learned capabilities when all its weights are updated on a narrow new dataset.. If you fine-tune a general-purpose LLM on legal documents, it may become excellent at legal reasoning but lose its ability to write Python code.

The key insight that motivates LoRA comes from research by Aghajanyan et al. (2020), which demonstrated that pre-trained language models have a low intrinsic dimensionalitya measure showing that the effective number of dimensions needed to describe the weight changes for a downstream task is far smaller than the total number of parameters.. In plain terms, the useful updates during fine-tuning live in a small subspace of the full weight matrix. This opens the door to a dramatically cheaper update strategy.

The following diagram contrasts the two approaches at a glance.

Full fine-tuning vs LoRA: comparing parameter efficiency with low-rank adaptation
Full fine-tuning vs LoRA: comparing parameter efficiency with low-rank adaptation

With this visual intuition in place, the next section walks through the linear algebra that makes this decomposition work.

The math behind low-rank decomposition

Standard forward pass and the LoRA modification

In a standard transformer layer, a weight matrix WRd×dW \in \mathbb{R}^{d \times d} transforms an input xx through the forward pass h=Wxh = Wx. During full fine-tuning, WW is updated to W+ΔWW + \Delta W, where ΔW\Delta W captures everything the model learns from the new data.

LoRA’s core idea is to never learn the full ΔW\Delta W directly. Instead, it constrains the update to a low-rank factorization:

...