Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tur, Hao Peng
A recent NeurIPS 2025 review went viral for its now infamous question: “Who is Adam?” It amused the AI community, which generally assumes that anyone qualified to review for NeurIPS would recognize Adam as the optimization algorithm, one that, along with its successor AdamW, underlies the training of most modern AI models. But the sheer reflex to assume everyone knows Adam prompts a deeper question: are we using Adam because it is the best tool for the job, or because it has become the one we’ve all quietly agreed not to question? You see, science often advances when the status quo is challenged; and in this blog, that is precisely what we intend to do with AdamW in RLVR.
Update: Full paper available on arxiv now - https://arxiv.org/abs/2602.07729.
<aside> 💡
Contrary to the well-established wisdom that SGD performs poorly in training transformers [1, 2, 3] our findings indicate that
We begin by briefly reviewing two of the most commonly used optimizers in training Neural Networks — SGD and AdamW — highlighting their update rules and the information state each of them maintains.[1]
Notations: The mathematical notations mean the following. $\theta_t$ denotes the model parameters at iteration t, $g_t$ is the gradient vector, $\eta$ is the learning rate, $v_t$ represents the cumulative velocity (in SGD+Momentum) or the second moment estimate (in AdamW), $m_t$ is the first moment estimate, $\mu$ is the momentum coefficient, $\beta_1$ and $\beta_2$ are the exponential decay rates for the moments, $\epsilon$ is a small constant for numerical stability, and $\lambda$ is the weight decay coefficient.
This method tracks no additional state beyond the current parameters and gradient, resulting in minimal memory overhead but potentially noisy and unstable gradients, especially in high-dimensional settings.
momentum introduces an auxiliary velocity term. Here, the optimizer keeps track of a single extra state variable v, effectively averaging gradients over time and smoothing the update direction.
AdamW maintains both a first-moment estimate $m_t$ and a second-moment estimate $v_t$, along with bias corrections for each. This significantly stabilizes optimization in the highly non-convex landscapes typical of LLM training. Note that, AdamW needs to track 2 more states (m and v) as compared to SGD without momentum. However, this stabilizes the optimization trajectory with minimizing gradient variance.
<aside> 💡
In summary, SGD tracks no (or one, with momentum) auxiliary state other than the gradient themselves, whereas AdamW maintains two moment estimates. (i.e. SGD requires less GPU memory during back-propagation).
</aside>
The findings of this blog builds upon the findings of Mukherjee et al. that shows that RLVR updates a small subnetwork in LLMs. We would urge the reader to go through their findings to understand the motivation of our work here.
Additionally, the blog called “LoRA without regrets” also served as a motivating prior work. In the blog, John Schulman et al. found that LoRA works reasonably well under the right design choices. However, they also show that a rank-1-lora underperforms full-finetuning always. As we will show, SGD is more parameter efficient than a rank-1-lora, while performing on par with full-fine-tuning with AdamW.