Yuxiao Qu$^{1,}$, Amrith Setlur$^{1,}$, Virginia Smith$^1$, Ruslan Salakhutdinov$^1$, Aviral Kumar$^1$**
$^$Equal Contribution, Carnegie Mellon University*
In 2025 alone, we went from the first release of the DeepSeek-R1 technical report, to the first open-source replications of reinforcement learning (RL) training with long chains of thought, to skepticism that RL merely “sharpens” whatever the pre-trained model already knows, to realization that “RL actually works”. A natural question that follows is whether we can continue to scale compute and expect RL to keep improving. Unfortunately, for current RL methods, the answer is no; results show that RL training often plateaus without maximizing reward or solving several “hard” problems even in the training dataset. In principle, a scalable training recipe should lead to continued progress on the training data as more compute is used, but these plateaus show that this does not occur. Although such saturation has not prevented models from achieving good performance on current evaluation benchmarks, it raises serious concerns about whether existing RL methods can continue to scale to increasingly harder test scenarios.
In this blog, we aim to give some perspective on the question of scaling RL compute on hard problems. Current RL recipes run into a fundamental challenge of exploration on hard problems. By exploration, we mean the approach required to discover at least one correct solution for a given problem so that the RL algorithm can now learn from this trace. The dominant exploration strategy today is fully on-policy, meaning that the model samples many rollouts itself during RL training. However, on many difficult prompts, this strategy fails to produce even a single correct rollout at any point during RL training, which means that no useful learning signal is ever obtained. The inability to train on hard training problems then brings into question the model’s generalization on similar test problems.
This post focuses on addressing this very obstacle: on-policy RL cannot learn from prompts for which the model’s generated traces receive zero reward. We first describe how classical exploration methods, such as exploration bonuses (see this survey), are not sufficient in the LLM setting and often lead to optimization pathologies. We then show how a more ”proactive” approach for exploration based on conditioning on privileged offline data can overcome this exploration bottleneck and enable RL to scale more effectively on hard problems.
Broadly speaking, irrespective of any explicit method to induce exploration (like reward bonus), on-policy RL training for LLMs operates in three regimes. The first regime is when RL sharpens, meaning that RL simply hones in on the correct trajectories (increases their likelihood) the pre-trained model already samples with high probability. This is the regime where we see pure prompt tuning outperforming RL. But then, this means RL is simply making a likely correct trace even more likely, and not discovering solutions for problems it could never sample a correct solution for. Some of our own earlier work showed that RL can be moved out of this regime into the second regime where RL discovers new solutions, by chaining useful skills (like verification, summarization, etc.) present in the pre-trained model, the combination of which are not as likely to be sampled as a single trace before running RL. This is the regime where RL usually amplifies self-verifications and response length grows over training. The success of exploration in this regime depends on the right base model and appropriate design choices (e.g., curricula with appropriate mixtures of data and token budgets) during training.

Figure 1. Three regimes of exploration: Current RL model can explore via: (1) Sharpening: simply increases likelihood on traces it can sample with high probability; (2) Chaining: chain asymmetric skills in the base model (e.g., verification-generation gap, abstraction-generation gap); (3) Guided: use offline guidance to discover solutions to very hard problems that any amount of sampling or chaining can’t solve. We will operate in the guided regime.
To go beyond sharpening or chaining done on-policy, we can look back at the classical deep RL literature for ideas on incentivizing exploration. Many exploration methods are retrospective in nature. They encourage the policy to explore randomly, identify novel behavior, and then reward the policy for producing more of that novelty. A typical instantiation of this type of exploration method is to provide a reward bonus for attaining high entropy over states or actions, or a modification of the objective that implicitly incentivizes diversity, such as optimizing pass@k scores rather than direct rewards in the LLM setting. In this section, we benchmark representative bonus-based exploration methods when running on-policy RL on hard problems, starting with our base model Qwen3-4B-Instruct, a capable instruction-following model.
Experiment setup. We first curate a set of hard math reasoning problems from DAPO, OmniMath (levels 5-8), and AceReason datasets where the base model fails to produce any successful rollout with large parallel sampling ($k=128$) and under a large length budget (32k). We then run RL training with additional: 1) a token-level entropy bonus and 2) following DAPO, a more generous importance ratio clipping term in a PPO-style policy gradient update allowing the LLM to update more aggressively on rare, off-policy rollouts. These two approaches form two popular and representative (retrospective) exploration methods for RL training of LLMs today. Other notions of novelty or dynamics prediction error do not transfer from deep RL to LLMs naturally because LLMs for math present a single-step, bandit learning problem.

Figure 2. Left: Evolution of the fraction of solvable problems (measured via the pass@8 at 16k output length). Right: average token-level entropy statistics over the course of RL training. Observe that all of these representative classical exploration methods make similar amounts of (few) problems solvable, while creating pathologies in optimization in the sense that entropy blows up. We do notice large sensitivity to the clip threshold $\epsilon_{\text{high}}$ in our runs.
Empirical findings. Observe in Figure 2 that incorporating an entropy bonus or utilizing a larger clip ratio ($\epsilon_{\text{high}}$) both increase the average token-level entropy of the trained model to substantially large values. An alternative is to run on-policy training with no entropy bonus at all (shown by the light green line). All of these approaches end up solving a similar number of problems, with no clear signs of improved solvability of the harder problems (as in no signs of “improved” exploration). The addition of these bonuses simply makes optimization pathological.