LoRA: Low-Rank Adaptation

Explore how Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models by training only small matrices while keeping the main model frozen. Learn how LoRA reduces memory requirements, mitigates catastrophic forgetting, and offers practical strategies to choose hyperparameters for task-specific adaptation.

We'll cover the following...

Why full fine-tuning hits a wall
The math behind low-rank decomposition
How LoRA integrates into transformers
- Injection points and training workflow
Choosing the right rank
Conclusion

When you fine-tune a large language model, you are asking every single parameter to shift in response to your new data. For a 7-billion-parameter model, that means 7 billion floating-point values must receive gradient updates, be stored alongside optimizer states, and be checkpointed to disk. The compute bill adds up fast, and the risk of overwriting useful pre-trained knowledge is real. But what if the weight changes needed for your task actually occupy a tiny fraction of that enormous parameter space? This is the insight that makes Low-Rank Adaptation, or LoRA, one of the most practical breakthroughs in efficient fine-tuning. It lets you train a sliver of new parameters while keeping the original model completely frozen, cutting trainable parameters by orders of magnitude without sacrificing performance.

Why full fine-tuning hits a wall

The previous lesson established that fine-tuning is justified when you need persistent behavioral change in a model. But full fine-tuning carries steep costs that scale directly with model size.

Consider what happens during a standard fine-tuning run on a 7B-parameter model. Every parameter receives a gradient, and the Adam optimizer maintains two additional state variables (the first and second moment estimates) for each parameter. In mixed-precision training, this creates a memory footprint roughly four times the model size. For a 7B model stored in 16-bit precision, that translates to approximately 56 GB of GPU memory just for weights and optimizer states, before you even account for activations and batch data.

Attention: Many practitioners underestimate optimizer memory. Adam effectively triples the storage cost of every trainable parameter because it keeps running averages of both the gradient and the squared gradient.

Beyond memory, full fine-tuning risks catastrophic forgettingthe phenomenon where a neural network loses previously learned capabilities when all its weights are updated on a narrow new dataset.. If you fine-tune a general-purpose LLM on legal documents, it may become excellent at legal reasoning but lose its ability to write Python code.

The key insight that motivates LoRA comes from research by Aghajanyan et al. (2020), which demonstrated that pre-trained language models have a low intrinsic dimensionalitya measure showing that the effective number of dimensions needed to describe the weight changes for a downstream task is far smaller than the total number of parameters.. In plain terms, the useful updates during fine-tuning live in a small subspace of the full weight matrix. This opens the door to a dramatically cheaper update strategy.

The following diagram contrasts the two approaches at a glance.

With this visual intuition in place, the next section walks through the linear algebra that makes this decomposition work.

The math behind low-rank decomposition

Standard forward pass and the LoRA modification

In a standard transformer layer, a weight matrix $W \in \mathbb{R}^{d \times d}$ transforms an input $x$ through the forward pass $h = Wx$ . During full fine-tuning, $W$ is updated to $W + \Delta W$ , where $\Delta W$ captures everything the model learns from the new data.

LoRA’s core idea is to never learn the full $\Delta W$ directly. Instead, it constrains the update to a low-rank factorization:

...

1.LLM Application Architectures

2.Challenges and Risks

3.Transformers and Attention

4.Vector Databases

5.Prompt Engineering

Cloud Lab

6.Fine-Tuning

Cloud Lab

7.Model Context with LangChain

8.Agentic Workflows

Cloud Lab

9.Retrieval Augmented Generation (RAG)

Cloud Lab

Cloud Lab

10.LLM Evaluation

Cloud Lab

LoRA: Low-Rank Adaptation

Why full fine-tuning hits a wall

The math behind low-rank decomposition

Standard forward pass and the LoRA modification