AdamW and Weight Decay

AdamW is essentially the "bug-fixed" version of Adam. It is the specific variant used to train almost all modern Large Language Models (including GPT-4, Llama 3, and Claude).

1. What is Weight Decay? (The "Complexity Tax")

When training a massive model, you want it to learn general patterns (e.g., "how grammar works") rather than memorising specific training examples (e.g., "The specific phone number in document #402").

If the weights in your matrices get too large, the model becomes "over-excited" and starts fitting to noise.

Weight Decay is a regularization technique that forces the model to keep its weights small and simple.

2. The "Bug" in Standard Adam

For years, researchers thought L2 Regularization (adding a penalty to the loss function) and Weight Decay (subtracting a bit from the weights directly) were mathematically identical. And they are—but only for standard SGD (Stochastic Gradient Descent).

When researchers applied L2 Regularization to Adam, they discovered a subtle problem:

Adam scales learning rates adaptively. If you add the regularization penalty to the gradient (as L2 regularization does), Adam inadvertently scales the penalty too

Weights with large gradients got very little regularization.
Weights with small gradients got too much regularization.

This meant the "trimming" was uneven, leading to models that didn't generalize as well as they should.

3. The Solution: AdamW (Decoupled Weight Decay)

Decouple the weight decay from the adaptive gradient steps.

Calculate the gradient update using Adam's fancy adaptive logic.
Apply the weight decay "tax" separately and directly to the weights.