Agentic RL Training

Not just a Single Algorithm, but a Synergistic System of Environment Modeling, Learning Signals, Asynchronous Data Flows, Policy Optimization, and Infrastructure

Zhiyuan Hu, March 2026

Chinese Version(中文版)

Over the past year, the most significant signal from technical reports released by major foundation model companies is not the emergence of yet another PPO/GRPO variant. Instead, it is the shift of truly effective agentic RL from single-turn text optimization to systematic policy learning within environments characterized by long contexts, tool usage, partial observability, and asynchronous execution. Kimi K1.5 [1] brought long-context RL, partial rollout reuse, and mirror-descent style policy optimization to the forefront; Kimi K2 [2] and K2.5 [3] further publicly disclosed critical ideas like agentic data synthesis, multimodal RL, token-level clipping, GRM rubrics, Toggle, and PARL / Agent Swarm. MiniMax highlighted another fundamental reality: when the duration distribution of agent rollouts extends from seconds to minutes or hours, the training bottleneck is no longer just loss design. It becomes an trilemma among throughput, stability, and agent flexibility, alongside how asynchronous execution, scheduling policies, context management, and efficiency-oriented optimization objectives jointly affect the learning loop. Furthermore, GLM emphasizes phased RL—Reasoning RL, Agentic RL, and General RL are not mixed into a single run, but advanced through a sequential pipeline, leveraging asynchronous RL infrastructure and cross-stage distillation to balance long-horizon agent learning with capability retention. As a result, the central challenge in Agentic RL is no longer just how to update parameters. It is how to keep generating usable learning signals in real agent environments and how to turn online interaction trajectories into stable policy improvement.

Agentic RL training should not be viewed as a string of isolated modules, but understood around three invariants. First, the model's exploration capability must be protected. Second, the training loop must keep producing usable, non-degenerate update signals to prevent training batches where the advantage collapses. Third, the distribution shifts among training, parameter updating, and actual deployment must be controlled, because long trajectories, asynchronous execution, and tool environments naturally introduce staleness, off-policy drift, and train-serving mismatch. Related research such as GEM [4], ReMax [5], Knapsack RL [6], RL-ADA [7], as well as the technical routes of Kimi, MiniMax, and GLM, can actually all be unified under these three objectives.

I. Why Agentic RL Differs from Traditional RLHF / RLVR

The target of Agentic RL training is no longer a single-turn text mapping that outputs an answer given a prompt, but a policy interacting within an environment. This policy must handle state updates, tool calls, external observations, context management, sub-task delegation, termination conditions, as well as cost/latency/safety constraints. In other words, agentic RL is more akin to policy learning with long time horizons, partial observability, and structured action spaces, rather than simply post-hoc re-ranking of text continuation probabilities.

This directly introduces four changes in training dynamics. First, the state is no longer determined solely by user input; it consists of the historical trajectory, tool returns, environment feedback, memory summaries, and the current context. Second, the action is no longer just the next token; it could be selecting which tool to use, what parameters to fill, whether to compress the context, or whether to dispatch sub-tasks in parallel. Third, the reward becomes more delayed, sparse, and compound: it must evaluate not only the correctness of the outcome but also whether the process is accurate, efficient, token-saving, and training-time effective. Fourth, rollout times are highly uneven, making synchronous training prohibitively expensive while asynchronous training introduces distribution shifts. Therefore, the essence of agentic RL is not slapping GRPO/PPO onto longer outputs, but integrating the environment, reward, sampling, scheduling, caching, optimizer, and evaluation into a single closed loop.

II. Understanding the Three Invariants of Agentic RL

If we view Agentic RL as a policy learning system that continuously interacts, samples, and updates in real environments, then the most crucial factor is no longer which RL algorithm is used at a specific step, but whether the training loop can sustain three underlying conditions over the long term. By invariants, I do not mean quantities that are mathematically constant. Rather, despite their natural tendency to drift, they must be constantly pulled back into a learnable and optimizable range throughout the training process. More precisely, the first two are lower bounds that must not be breached: the policy exploration space cannot collapse, and learning signals cannot degenerate. The third is an upper bound that must not be exceeded: the distribution shifts among rollouts, updates, and deployments cannot spiral out of control.

1) The First Invariant: The Policy Exploration Space Cannot Collapse Prematurely

The first invariant does not mean the output should just be more random, nor that token entropy must remain high. It means that, given a state, the model retains a set of behavior trajectories that are distinguishable, semantically distinct, and actually feasible. For Agentic RL, this exploration space is not just different phrasing, but different task decomposition strategies, tool call sequences, memory read/write strategies, context organization methods, stopping conditions, and self-correction paths. It tends to change because training naturally pushes probability mass towards a few currently dominant patterns. As long as the training objective primarily rewards a specific behavior—one that is shorter, more like a standard operating procedure, or easier for a verifier to recognize—the model will gradually marginalize other paths that could also succeed. In agent scenarios, this compression is more severe than in single-turn QA because tool interfaces, scaffolds, context templates, and termination logic inherently bias toward certain fixed workflows. Maintaining this invariant is critical because it dictates whether subsequent RL has any real search space left. The value of RL is not repeatedly boosting the probability of a known best answer, but enabling the model to continuously discover previously unamplified high-return behaviors through interaction. If the exploration space collapses prematurely, subsequent sampling is mostly just superficial perturbation around the same routine. The reward spread will shrink, and new learning directions will dwindle. Training may appear to continue, but it is actually just performing local perturbations within a shrunken space.

2) The Second Invariant: Learning Signals Must Remain Non-Degenerate Even if the model retains multiple feasible paths, they are not guaranteed to be learned. Parameter updates rely not merely on the existence of alternative possibilities, but on whether the differences between trajectories can stably translate into non-zero, directionally clear, and reasonably scaled gradients. Thus, the second invariant requires the training system to continuously generate non-degenerate learning signals: different rollouts must be comparable and distinguishable, and these comparisons must ultimately manifest as parameter updates. This invariant is vulnerable because the reward structure of Agentic RL naturally induces signal collapse. Real-world tasks often feature delayed rewards, sparse outcomes, and lengthy processes, culminating in binary success/failure labels, coarse-grained rubrics, or a few high-level quality scores. Consequently, a single batch of samples easily degrades into two scenarios: simple tasks are entirely correct, while difficult tasks are entirely wrong. The former indicates local saturation; the latter shows the model has not entered a learnable region. Yet, for gradients, both yield the same result: insufficient intra-group variance, vanishing advantages, and degenerate update directions. Furthermore, long trajectories stretch credit assignment, partial observability obscures the reasons for success/failure, and tool/verifier noise further pollutes comparisons. As a result, the system superficially collects massive interaction data but practically produces unlearnable samples.

Maintaining non-degenerate learning signals dictates whether training is genuinely pushing capability boundaries or merely burning compute. Many RL failures occur not because the model is weak or data is scarce, but because the system cannot stably answer a fundamental question: near the model's current capability frontier, which behaviors are more worthy of amplification than others? Without answering this, advantages collapse, gradients approach zero, and training appears busy but stagnant. The quality of a learning signal depends not on the number of reward terms, but on the learnability of the comparisons. A reward can be complex, but if it fails to stably distinguish slightly better from slightly worse trajectories near the model's boundary, it still yields degenerate gradients. Conversely, a seemingly simple feedback mechanism that consistently exposes valid trajectory differences serves as a high-quality learning signal. Therefore, what the second invariant truly demands is not constant reward magnitude, but constant comparability and the ability to support stable updates. Agentic RL needs not more scores, but more behavior contrasts that the optimizer can actually exploit.

3) The Third Invariant: The Shift Among Model Rollout, Policy Updating, and Serving-time Distribution Must Be Controllable

Controlling these distribution shifts is critical because it determines whether training gains transfer to actual execution. If the shift is too large, a typical distortion occurs: the model appears to learn well on the samples seen by the learner, but these improvements do not stably reflect in deployment-time tool calls, context management, or long-term interaction. They might even amplify into performance degradation upon release due to off-policy bias, misaligned interfaces, or state representation mismatches. For long-trajectory agents, every minor early shift accumulates along subsequent state transitions, eventually pushing the policy toward directions that look reasonable in training but are inexecutable in real environments. The distribution shift in Agentic RL is not just caused by external environment changes; it is largely manufactured by the system itself. These seemingly infrastructure-level choices directly alter what the learner is actually optimizing. Therefore, the third invariant is not purely an algorithmic correction issue, but a system-level consistency problem.

4) Why Understand These Three Invariants Together

These invariants are not isolated elements but three coupled boundaries of the same training system. The first determines if the policy space is wide enough; the second dictates if differences within that space translate into valid gradients; the third ensures those gradients affect the correct distribution. With exploration but no signal, training becomes high-noise trial-and-error. With signal but no exploration, training quickly collapses into narrow local optima. With both exploration and signal but uncontrolled distribution shift, the learned behavior may not be what is needed during deployment. Moreover, there is inherent tension among them. Stronger exploration typically widens the behavior distribution, thins out comparisons, and makes distribution shifts harder to control. Over-pursuing stable updates easily flattens the exploration space. Designing the verifier to be overly strict in order to produce sharper learning signals can cause the model to collapse into just a few reward-hacking patterns / modes. Therefore, the true challenge of Agentic RL is not just minimizing a specific loss, but constantly maintaining exploration, signal, and distribution within the same learnable interval inside a continuously changing, asynchronous system interacting with external environments.

I. Why Agentic RL Differs from Traditional RLHF / RLVR

II. Understanding the Three Invariants of Agentic RL

1) The First Invariant: The Policy Exploration Space Cannot Collapse Prematurely

3) The Third Invariant: The Shift Among Model Rollout, Policy Updating, and Serving-time Distribution Must Be Controllable

4) Why Understand These Three Invariants Together

III. The Eight Aspects of Agentic RL