Authors: Devvrit Khatri*, Lovish Madaan*, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal
<aside> đź’ˇ
tl;dr
As Reinforcement Learning (RL) becomes central to training Large Language Models (LLMs) post-training, several question remains unclear:
We introduce a predictive framework, and a practical recipe, ScaleRL, for scaling RL compute efficiently and reliably. Our framework enables studying scaling properties of RL algorithms, letting us identify which components improve compute efficiency and which lift the final asymptotic performance.
We ran over 400,000 GPU-hours of controlled ablations (so that you don’t have to), systematically testing factors such as loss type, off-policy method, precision fixes, data curriculum, normalization, etc. We combine the best-performing design choices to form a stable and predictable RL recipe, ScaleRL. And demonstrate its effectiveness by predictably scaling RL compute up to 100k GPU-hours.
This work provides both a scientific framework to understand RL scaling and a practical recipe to make RL training as predictable as pre-training.
</aside>
Pre-training benefits from mature scaling laws (e.g., iso-FLOP curves) that give a bird’s-eye view of training dynamics. Given a fixed compute budget and data constraints, practitioners can pick a model scale and training length that stay within budget while maximizing performance.
Such mature and established scaling laws for RL do not exist (yet). We take a step toward closing this gap with a framework that fits performance–compute curves and exposes two key, interpretable metrics for each algorithm: asymptotic pass rate (the ceiling) and compute efficiency (how quickly you approach it).
Scaling RL predictably has several benefits:
Using such comparison, we form ScaleRL, our recipe that predictably scales up to 100k GPU-hours.

Figure 1: Interpreting the sigmoidal fit equation. We provide an example fit illustrating the roles of parameters $A, B, C_{\text{mid}}$. $C_{\text{mid}}$ determines the compute point at which half of the total gain is achieved - smaller values correspond to faster ascent toward the asymptote. $B$ controls the curve’s steepness, with larger values indicating greater efficiency. $A$ represents the asymptotic performance reached at large compute scales. $R_0$ represents the inital mean reward of the starting policy.