Finetuning LLMs - LoRA & QLoRA

my notes, polished with claude

Sources

Finetuning means training a pre-trained network on new data to improve its performance on a specific task.

Two types: full fine-tuning and PEFT (parameter efficient fine tuning). PEFT has various methods - LoRA, QLoRA, etc.

Problems with full fine-tuning

You train the full network. Computationally expensive for the average user on LLMs like GPT.
Checkpoints are expensive to store - you save the entire model to disk per checkpoint. Add optimizer state and it gets worse.
Multiple fine-tuned models means reloading all weights every time you switch between them. Slow. For example: one model fine-tuned for SQL queries, another for JavaScript - swapping between them means swapping the entire weight set.

PEFT

As models get larger, full fine-tuning becomes infeasible on consumer hardware. Storing and deploying independently fine-tuned models also gets expensive fast - each one is the same size as the original pretrained model. PEFT addresses both.

Full fine-tuning also produces catastrophic forgetting - the model loses general capabilities as it overfits to new data. PEFT avoids this.

With PEFT, the small trained weights sit on top of the frozen pretrained LLM. Same base model, multiple tasks, just swap the small weights.

Benefits of PEFT over full fine-tuning

Less to train and store. If d = 1000, k = 5000, the original weight matrix has 5,000,000 parameters. With r = 5, the adaptation matrices have (1000×5) + (5×5000) = 30,000 parameters - less than 1% of the original.