AdamW is essentially the "bug-fixed" version of Adam. It is the specific variant used to train almost all modern Large Language Models (including GPT-4, Llama 3, and Claude).
When training a massive model, you want it to learn general patterns (e.g., "how grammar works") rather than memorising specific training examples (e.g., "The specific phone number in document #402").
If the weights in your matrices get too large, the model becomes "over-excited" and starts fitting to noise.
Weight Decay is a regularization technique that forces the model to keep its weights small and simple.
For years, researchers thought L2 Regularization (adding a penalty to the loss function) and Weight Decay (subtracting a bit from the weights directly) were mathematically identical. And they areābut only for standard SGD (Stochastic Gradient Descent).
When researchers applied L2 Regularization to Adam, they discovered a subtle problem:
Adam scales learning rates adaptively. If you add the regularization penalty to the gradient (as L2 regularization does), Adam inadvertently scales the penalty too
This meant the "trimming" was uneven, leading to models that didn't generalize as well as they should.
Decouple the weight decay from the adaptive gradient steps.