AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

https://arxiv.org/abs/2505.24298

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu

IIIS, Tsinghua University; Ant Research; HKUST

Introduction

Background: Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks.
Problem: Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model.
- This stabilizes RL training but suffers from severe system-level inefficiency.
- Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization.
- Synchronous systems distribute generation across all devices, reducing the per-GPU decoding batch size.
Motivation: This paper presents AReaL, a fully asynchronous RL system that completely decouples generation from training.
- The goal of AReaL is to completely decouple generation from training without hurting the final performance.
- Rollout worker continuously generates new outputs without waiting, leading to high GPU utilization.
- Meanwhile, the trainer workers in AREAL run parallel model updates whenever a training batch is obtained from the rollout workers.
- Once the model is updated, we synchronize the model weights in each rollout worker.
Challenge and Method:
- In such an asynchronous design, each training batch of AREAL may contain samples generated by different model versions.
- Therefore, AReaL inocorporates a modified objective of the PPO algorithm, which can leverage samples generated from much older model versions without any performance drop.
- Incorporates several system-level optimizations, including interruptible rollout workers, dynamic batching for variable-length outputs, and parallel reward service.

Method

System overview

Interruptible Rollout Worker: Handles two types of requests:
- The generate request generates responses given prompts
- The upload_weights request interrupts all ongoing generations and loads parameters of new versions. Upon the interruption, the rollout workers discard KV caches computed by old weights, and re-compute them using the new weights.
Reward Service.
Trainer Workers: Continuously sample from the replay buffer, accumulating data until reaching the configured training batch size, then perform PPO updates.
Rollout Controller:
- Reads data from the dataset and invokes the rollout worker’s generate request.
- The received response is then sent to the reward service to obtain the reward.
- The trajectory, along with the reward, is stored in the replay buffer, waiting to be trained by the model worker.
- Calls the rollout worker’ to update weights.

Staleness-aware training

Introduce a hyperparameter η representing the maximum permitted staleness in each training batch for stalenessaware training.