LoRA | Notion

LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

We live in an era of massive AI models. Think Llama or Stable Diffusion - models trained on vast amounts of data, possessing incredible general capabilities. But often, we want to adapt these powerhouses for specific needs: making a language model better at writing legal documents, generating medical reports, or even just mimicking a particular artistic style for image generation.

The traditional way to do this is called full fine-tuning. This involves taking the entire pre-trained model and continuing its training process using your specific dataset.

The Problem

While effective, full fine-tuning has significant drawbacks:

Massive Computational Cost - Training all the weights of a huge model requires powerful GPUs (often multiple) and significant time. This is often beyond the reach of individuals or smaller organizations.
Huge Memory Requirements - Loading the model and calculating gradients for billions of parameters demands enormous amounts of memory (VRAM).
Storage Nightmare - If you fine-tune a 100GB model for 10 different tasks, you end up with 10 separate models, potentially consuming 1 TB of storage! Each fine-tuned version is essentially a full copy with slightly altered weights.
Slow Task Switching - Switching between these different fine-tuned versions means unloading one massive model from memory and loading another - a slow and cumbersome process.

Researchers needed a smarter way. Could we adapt these models without retraining everything?

The paper “LoRA: Low-Rank Adaptation of Large Language Models” by Hu et al. (2021) answered this question…

The Core Idea

Researchers hypothesized that when you adapt a large pre-trained model to a specific task, you don’t need to drastically change all its weights. They drew inspiration from the mathematical concept that many large matrices can be approximated by multiplying two much smaller (“low-rank”) matrices.

Instead of directly modifying the original weights (let’s call the original weight matrix W₀), LoRA does the following:

Freezes the Original Model - All the original weights (W₀) in the pre-trained model are kept frozen. They are not trained or updated during the fine-tuning process. This saves a ton of computation and memory.
Injects Tiny Trainable Matrices - For specific layers in the original model (often the attention layers), LoRA introduces two small, trainable matrices; let’s call them A and B. The “rank” (r) determines the size of these matrices – and r is usually very small compared to the original dimensions.
Trains Only the Small Matrices - During fine-tuning, only these small matrices A and B are trained on the new, task-specific data. The gradients are calculated only for these, reducing the computational load.
Combines On-the-Fly - The output of the LoRA layer is calculated by adding the output of the original frozen layer (W₀ * input) to the output generated by the small matrices ((B * A) * input). So, the effective weight becomes W = W₀ + BA.

Think of it like this: W₀ is the huge, expert knowledge base. BA is a small, learned “adjustment” or “correction” specific to your new task.

The Problem

The Core Idea

Why Does This “Low-Rank” Thing Work?