Authors: Devvrit Khatri*, Lovish Madaan*, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal

- Equal Contribution

<aside> 💡

tl;dr

As Reinforcement Learning (RL) becomes central to training Large Language Models (LLMs) post-training, several question remains unclear:

How do we understand ****the contribution of different design choices to the algorithm?
How do we scale, and what exactly should we scale?
Can RL scale predictably?

We introduce a predictive framework, and a practical recipe, ScaleRL, for scaling RL compute efficiently and reliably. Our framework enables studying scaling properties of RL algorithms, letting us identify which components improve compute efficiency and which lift the final asymptotic performance.

We ran over 400,000 GPU-hours of controlled ablations (so that you don’t have to), systematically testing factors such as loss type, off-policy method, precision fixes, data curriculum, normalization, etc. We combine the best-performing design choices to form a stable and predictable RL recipe, ScaleRL. And demonstrate its effectiveness by predictably scaling RL compute up to 100k GPU-hours.

This work provides both a scientific framework to understand RL scaling and a practical recipe to make RL training as predictable as pre-training.

</aside>

Why Predictable RL Scaling Matters?

Pre-training benefits from mature scaling laws (e.g., iso-FLOP curves) that give a bird’s-eye view of training dynamics. Given a fixed compute budget and data constraints, practitioners can pick a model scale and training length that stay within budget while maximizing performance.

Such mature and established scaling laws for RL do not exist (yet). We take a step toward closing this gap with a framework that fits performance–compute curves and exposes two key, interpretable metrics for each algorithm: asymptotic pass rate (the ceiling) and compute efficiency (how quickly you approach it).

Scaling RL predictably has several benefits:

We get to understand what design choices change the asymptotic pass rate vs just improve compute efficiency.
Fitting scaling curves gives us a common “scale” to compare different methods - quantitatively and qualitatively.
Predictable RL enables fitting curves on smaller scale and extrapolating, showing relative behavior of different runs at scale. This means we don’t need to finish every long RL run to deduce which method is better.
Bitter Lesson: Methods that may look promising early on can be worse when extrapolating to larger compute regime (Figures 2 and 4). We can still identify scalable methods by estimating the scaling parameters from the early training dynamics using our framework.
If a new RL method is proposed in the future, how do we know it truly achieves better performance for the same amount of compute? Using our framework, we can fit scaling curves and evaluate this quantitatively and robustly.

Using such comparison, we form ScaleRL, our recipe that predictably scales up to 100k GPU-hours.

Methodology - Predictive Framework

$Figure 1: Interpreting the sigmoidal fit equation. We provide an example fit illustrating the roles of parameters $A, B, C_{\text{mid}}$. $C_{\text{mid}}$ determines the compute point at which half of the total gain is achieved - smaller values correspond to faster ascent toward the asymptote. $B$ controls the curve’s steepness, with larger values indicating greater efficiency. $A$ represents the asymptotic performance reached at large compute scales. $R_0$ represents the inital mean reward of the starting policy.$

Figure 1: Interpreting the sigmoidal fit equation. We provide an example fit illustrating the roles of parameters $A, B, C_{\text{mid}}$. $C_{\text{mid}}$ determines the compute point at which half of the total gain is achieved - smaller values correspond to faster ascent toward the asymptote. $B$ controls the curve’s steepness, with larger values indicating greater efficiency. $A$ represents the asymptotic performance reached at large compute scales. $R_0$ represents the inital mean reward of the starting policy.