Adam (short for Adaptive Moment Estimation) is the algorithm used during the training phase to calculate the values of those weights (like W_q, W_k, W_v).
Adam is "adaptive" because it treats every single weight in your model differently.
While a traditional optimizer like SGD uses one "global" speed (learning rate) for the entire model, Adam creates a personalized speed for every single parameter.
It does this by tracking two specific pieces of "history" for every weight
This is basically "Momentum." Adam calculates a moving average of the gradients (numerator)
This is the "Adaptive" part. Adam calculates a moving average of the squared gradients - vt,(denominator)
The Logic: This acts as a measure of how "volatile" or "extreme" a weight's updates have been.
The Benefit: If the gradients for a specific weight are huge and swinging wildly, scaling factor (denominator) becomes large. In the update math, Adam divides by the square root of vt, scaling factor (denominator).
To update a weight, Adam uses this simplified logic:
