Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tur, Hao Peng
A recent NeurIPS 2025 review went viral for its now infamous question: “Who is Adam?” It amused the AI community, which generally assumes that anyone qualified to review for NeurIPS would recognize Adam as the optimization algorithm, one that, along with its successor AdamW, underlies the training of most modern AI models. But the sheer reflex to assume everyone knows Adam prompts a deeper question: are we using Adam because it is the best tool for the job, or because it has become the one we’ve all quietly agreed not to question? You see, science often advances when the status quo is challenged; and in this blog, that is precisely what we intend to do with AdamW in RLVR.
<aside> 💡
Contrary to the well-established wisdom that SGD performs poorly in training transformers [1, 2, 3] our findings indicate that
We begin by briefly reviewing two of the most commonly used optimizers in training Neural Networks — SGD and AdamW — highlighting their update rules and the information state each of them maintains.[1]
Notations: The mathematical notations mean the following. $\theta_t$ denotes the model parameters at iteration t, $g_t$ is the gradient vector, $\eta$ is the learning rate, $v_t$ represents the cumulative velocity (in SGD+Momentum) or the second moment estimate (in AdamW), $m_t$ is the first moment estimate, $\mu$ is the momentum coefficient, $\beta_1$ and $\beta_2$ are the exponential decay rates for the moments, $\epsilon$ is a small constant for numerical stability, and $\lambda$ is the weight decay coefficient.
This method tracks no additional state beyond the current parameters and gradient, resulting in minimal memory overhead but potentially noisy and unstable gradients, especially in high-dimensional settings.
momentum introduces an auxiliary velocity term. Here, the optimizer keeps track of a single extra state variable v, effectively averaging gradients over time and smoothing the update direction.
AdamW maintains both a first-moment estimate $m_t$ and a second-moment estimate $v_t$, along with bias corrections for each. This significantly stabilizes optimization in the highly non-convex landscapes typical of LLM training. Note that, AdamW needs to track 2 more states (m and v) as compared to SGD without momentum. However, this stabilizes the optimization trajectory with minimizing gradient variance.
<aside> 💡
In summary, SGD tracks no (or one, with momentum) auxiliary state other than the gradient themselves, whereas AdamW maintains two moment estimates. (i.e. SGD requires less GPU memory during back-propagation).
</aside>
The findings of this blog builds upon the findings of Mukherjee et al. that shows that RLVR updates a small subnetwork in LLMs. We would urge the reader to go through their findings to understand the motivation of our work here.
Additionally, the blog called “LoRA without regrets” also served as a motivating prior work. In the blog, John Schulman et al. found that LoRA works reasonably well under the right design choices. However, they also show that a rank-1-lora underperforms full-finetuning always. As we will show, SGD is more parameter efficient than a rank-1-lora, while performing on par with full-fine-tuning with AdamW.
Conventional wisdom tells us that vanilla SGD does not perform well compared to complex optimizers like AdamW when training transformer based models, owing to the complexity of the loss landscape. However, in our prior work, we showed that RLVR updates a small fraction of model parameters. Which further could **suggest that RLVR is performed over a simpler optimization landscape where vanilla SGD may prove sufficient.