Adam Optimizer

Adam (short for Adaptive Moment Estimation) is the algorithm used during the training phase to calculate the values of those weights (like W_q, W_k, W_v).

Adam is "adaptive" because it treats every single weight in your model differently.

While a traditional optimizer like SGD uses one "global" speed (learning rate) for the entire model, Adam creates a personalized speed for every single parameter.

It does this by tracking two specific pieces of "history" for every weight

1. The First Moment: Directional Memory

This is basically "Momentum." Adam calculates a moving average of the gradients (numerator)

The Logic: If a weight has been moving in the same direction for the last several steps, the model gains "confidence" in that direction.
The Benefit: It helps the optimizer blast through flat areas (plateaus) and ignore tiny, random "potholes" (noise) in the data.

2. The Second Moment: The Scaling Factor

This is the "Adaptive" part. Adam calculates a moving average of the squared gradients - vt,(denominator)

The Logic: This acts as a measure of how "volatile" or "extreme" a weight's updates have been.
The Benefit: If the gradients for a specific weight are huge and swinging wildly, scaling factor (denominator) becomes large. In the update math, Adam divides by the square root of vt, scaling factor (denominator).

3. How they work together (The "Step" Math)

To update a weight, Adam uses this simplified logic:

Screenshot 2026-01-22 at 12.04.56 PM.png