Search⌘ K
AI Features

Fine-Tuning

Understand how Low-Rank Adaptation (LoRA) fine-tunes large models by training small low-rank matrices while keeping original weights frozen. Discover how QLoRA enhances this by applying quantization for extreme memory efficiency. Learn pros, cons, and comparisons with other fine-tuning methods to prepare for AI engineering interviews.

Questions about Low-Rank Adaptation (LoRA) are increasingly common in GenAI interviews because it’s one of the most important parameter-efficient fine-tuning (PEFT) techniques to emerge in recent years. Interviewers ask about LoRA to see whether you understand how modern teams adapt huge LLMs—like GPT, Claude, or Llama—without updating billions of parameters or requiring massive compute.

LoRA matters because traditional full fine-tuning is expensive: it uses enormous GPU memory, trains slowly, and forces you to store an entire model copy for every task. LoRA solves these pain points by freezing the base model and injecting small, low-rank matrices that capture task-specific updates at a tiny fraction of the cost.

In an interview, you’re expected not just to define LoRA but to explain why it exists, how its low-rank mechanism works, and how it compares with other PEFT methods such as adapters, prompt tuning, and extensions like QLoRA. These questions reveal whether you truly understand the trade-offs involved and can communicate complex ideas clearly.

By the end of this lesson, you’ll have a solid, technically grounded understanding of LoRA and be prepared to answer interview questions about when and why it’s used.

What is LoRA?

LoRA (Low-Rank Adaptation) is a technique for fine-tuning large models by adding small, trainable components to the original model, rather than modifying all the original model’s parameters.

Think of a huge pretrained model as a complex machine with billions of knobs (parameters) set in just the right way to perform general language tasks. Now, if you want this machine to perform a new task (say, be good at legal text Q&A), traditional fine-tuning would try to adjust all those billions of knobs—a very costly and delicate process.

LoRA takes a clever shortcut: it leaves the original knobs frozen in place and attaches a few new tiny knobs (small matrices) that can be tuned to achieve the desired adjustment.

Analogy learning: Imagine you have a giant painting (the pretrained model) that is mostly perfect, but you want to slightly change its style. Instead of repainting the whole thing (updating every pixel), you lay a thin, transparent overlay on it and paint only on that overlay to achieve the desired effect. The original painting stays intact, and your changes are confined to the overlay. In LoRA, that overlay is realized as a low-rank matrix added to the model’s weights, which is much cheaper to train.

From a technical perspective, LoRA injects trainable low-rank matrices into the model’s existing layers (often the weight matrices of the transformer’s attention and feedforward networks)​.

Here’s the breakdown: Suppose the pretrained model has a weight matrix W0W_0 in some layer (for example, the matrix that projects the hidden state in a transformer). This matrix might be huge (dimensions like d×kd \times k). LoRA proposes that when fine-tuning for a new task, the change to this weight—call it ΔWΔW—doesn’t need to be full-size; instead, it can be approximated by a low-rank decomposition. In math terms, LoRA assumes,

Where:

  • WARd×rW_A \in \mathbb{R}^{d \times r}

  • WBRr×k W_B \in \mathbb{R}^{r \times k}

This is for some small rank rr. As rr is chosen to be much smaller than dd or kk, the product WAWBW_A W_B ...