Breaking Bad on Fine-Tuning an LLM

I spent two weeks running 12 experiments trying to replace a 32B model by fine-tuning a 1B model. I didn't fully succeed - but I wanted to log my experiments and things I learned along the way. Here are my notes, polished by Claude because they were a real mess.

The Problem I Was Trying to Solve

At work, we had a pipeline that scored how well a person's name and address matched their legal record. The scoring was complex - partial matches, transliteration variants, missing fields - so we used Qwen-2.5-32B to do it.

It worked well. But it was slow.

Every request had to go through a 32B model. Latency was a real bottleneck in production. The obvious question: can a much smaller model learn to do this just as well, if I fine-tune it on enough examples?

The task: take two records as input, output a 7-field JSON score:

{
  "personal_name": 0.8,
  "father_name": 1.0,
  "house_number": 1.0,
  "locality": 0.9,
  "village_city": 0.5,
  "bonus_district": 1.0,
  "pincode": 1.0
}

Simple enough structure. Started with Llama-3.2-1B-Instruct.

What Fine-Tuning Actually Is (And Why I Didn't Do Full Fine-Tuning)

The basic idea

A pre-trained LLM has already learned language, reasoning, and world knowledge from hundreds of billions of tokens. Fine-tuning means taking that model and continuing to train it on your specific task - your domain, your format, your scoring logic.

Why not full fine-tuning?

Update the full model - that's the obvious approach.

The problem: Llama-1B has ~1 billion parameters. Each stored in float32 takes 4 bytes. Just loading the model is 4GB. Training it means storing gradients (another 4GB) and optimizer states like Adam's momentum and variance (another 8GB+). You're at 16GB+ before a single training step.

For a 7B model that's ~28GB. For a 32B model, forget it - you'd need multiple A100s just to load it. And god I am so GPU poor. (If you're a founder reading this, I accept donations hihi.)

There's also catastrophic forgetting - train all weights aggressively on new data and the model can forget its general capabilities. Our scoring model gets good at JSON output but loses the ability to reason about edge cases it wasn't shown.

LoRA - train a fraction of the parameters

LoRA (Low-Rank Adaptation) is the way to go.