https://arxiv.org/abs/2505.24298
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu
IIIS, Tsinghua University; Ant Research; HKUST
Introduction
- Background: Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks.
- Problem: Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model.
- This stabilizes RL training but suffers from severe system-level inefficiency.
- Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization.
- Synchronous systems distribute generation across all devices, reducing the per-GPU decoding batch size.
- Motivation: This paper presents AReaL, a fully asynchronous RL system that completely decouples generation from training.
- The goal of AReaL is to completely decouple generation from training without hurting the final performance.
- Rollout worker continuously generates new outputs without waiting, leading to high GPU utilization.
- Meanwhile, the trainer workers in AREAL run parallel model updates whenever a training batch is obtained from the rollout workers.
- Once the model is updated, we synchronize the model weights in each rollout worker.
- Challenge and Method:
- In such an asynchronous design, each training batch of AREAL may contain samples generated by different model versions.
- Therefore, AReaL inocorporates a modified objective of the PPO algorithm, which can leverage samples generated from much older model versions without any performance drop.
- Incorporates several system-level optimizations, including interruptible rollout workers, dynamic batching for variable-length outputs, and parallel reward service.

Method
System overview

- Interruptible Rollout Worker: Handles two types of requests:
- The
generate
request generates responses given prompts
- The
upload_weights
request interrupts all ongoing generations and loads parameters of new versions. Upon the interruption, the rollout workers discard KV caches computed by old weights, and re-compute them using the new weights.
- Reward Service.
- Trainer Workers: Continuously sample from the replay buffer, accumulating data until reaching the configured training batch size, then perform PPO updates.
- Rollout Controller:
- Reads data from the dataset and invokes the rollout worker’s generate request.
- The received response is then sent to the reward service to obtain the reward.
- The trajectory, along with the reward, is stored in the replay buffer, waiting to be trained by the model worker.
- Calls the rollout worker’ to update weights.

Staleness-aware training
- Introduce a hyperparameter η representing the maximum permitted staleness in each training batch for stalenessaware training.