Adam (short for Adaptive Moment Estimation) is the algorithm used during the training phase to calculate the values of those weights (like W_q, W_k, W_v).

Adam is "adaptive" because it treats every single weight in your model differently.

While a traditional optimizer like SGD uses one "global" speed (learning rate) for the entire model, Adam creates a personalized speed for every single parameter.

It does this by tracking two specific pieces of "history" for every weight

1. The First Moment: Directional Memory

This is basically "Momentum." Adam calculates a moving average of the gradients (numerator)

2. The Second Moment: The Scaling Factor

This is the "Adaptive" part. Adam calculates a moving average of the squared gradients - vt,(denominator)

3. How they work together (The "Step" Math)

To update a weight, Adam uses this simplified logic:

Screenshot 2026-01-22 at 12.04.56 PM.png