A theoretical + empirical investigation arguing that the standard approach of scaling environment parallelism in on-policy RL breaks in delayed-reward settings — and that longer rollout horizons must come first. Formalizes when and why this regime transition occurs, directly challenging recent scaling results (Mayor et al., 2025). Paper in draft; theory complete, experiments in progress.
The dominant scaling heuristic in on-policy RL is straightforward: add more parallel environments to collect more data per update. Recent work (Mayor et al., 2025) supports this, showing that scaling parallelism reduces gradient noise and improves stability.
I argue this heuristic has a blind spot. It implicitly assumes that each rollout — no matter how short — contains a usable learning signal. In delayed-reward tasks, this assumption fails: when the rollout window is shorter than the task's credit horizon, rewards fall outside the window, and the policy update degrades into extrapolation from an immature value function. More parallelism means more data, but not better data.
Under a fixed transition budget per update (|B| = N × L), you can either run many short rollouts (high parallelism N) or fewer long rollouts (longer horizon L). The right choice depends on why reward is scarce:
| Regime | Bottleneck | What helps |
|---|---|---|
| Low density (rare but immediate rewards) | Encountering reward events | More parallelism — independent attempts increase coverage |
| High depth (delayed rewards, long credit horizons) | Propagating credit across time | Longer rollouts — temporal coverage must come before variance reduction |
The regime transition is sharp: once L exceeds the task's effective delay scale, learning quality improves rapidly, and parallelism gains become secondary.
Fixed-budget scaling framework Formalizes how environment parallelism (N) and rollout horizon (L) serve fundamentally different roles under a fixed transition budget — variance reduction vs. temporal coverage.
Boundary-bootstrapping analysis Shows how short rollouts force advantage estimates to depend on value predictions beyond the rollout boundary. Early in training, when the critic is inaccurate, this creates a compounding error that parallelism cannot fix.
Active fraction metric Introduces a diagnostic measuring what fraction of timesteps in a rollout carry nontrivial learning signal — separating "data volume" from "data usefulness."
Depth vs. density taxonomy Formally distinguishes sparse rewards (low density, where parallelism helps) from delayed rewards (high depth, where horizon must come first). Shows these require opposite scaling strategies.
Gradient-level signal quality Argues that update quality should be assessed via gradient SNR and directional coherence rather than reward curves — and proves a coherence lemma showing that increasing N can shrink the effective update when parallel gradients disagree.