Chengxing Xie, Zilin Zhu, Haoran Wang, Yao Wei, Zhenyu Hou
<aside> 🔥
Github: https://github.com/THUDM/slime
Zhihu Blog(in Chinese): https://zhuanlan.zhihu.com/p/1919107858110316886
Most existing RL training frameworks are designed for tasks involving pure reasoning and limited interaction, such as math or competitive programming. Extending these frameworks to support training LLM agents often requires substantial engineering effort, as researchers must manually integrate agent-specific tools and logic into the training pipeline, like Search-R1 and ToRL. This raises a critical question: How can we design an RL training framework that supports training agents across a wide range of agent frameworks with minimal integration effort?
To address this, we introduce slime—an RL training framework designed to flexibly integrate with diverse agent frameworks and support fully asynchronous RL training out of the box.
Using slime, we demonstrate seamless integration with the OpenHands framework to train a coding agent on the SWE Bench task. Furthermore, we show that slime’s fully asynchronous training achieves significant speed improvements over conventional synchronous RL approaches.
</aside>
The recent success of RL on reasoning tasks has sparked growing interest in extending RL to more complex agentic tasks. Frameworks like Search-R1 and ToRL attempt to train LLM agents on such tasks by manually embedding tools and logic into existing RL backends (e.g., VERL). However, for tasks in SWE-Bench or MLE-Bench, it is more practical to reuse dedicated agent frameworks that are already tailored to these complex tasks. This not only reduces redundant engineering effort but also leverages well-tested tools and environments.
Despite this appeal, integrating agent frameworks into RL pipelines introduces several non-trivial challenges:
To address these limitations, we present slime, a general-purpose RL training framework built to seamlessly integrate with diverse agent frameworks. Slime enables a fully asynchronous training paradigm, decoupling trajectory generation from model updates to improve scalability, resource efficiency, and training throughput:
Figure 1: Overall Design of the slime RL Training Framework.
To effectively integrate diverse agent frameworks into the RL training process, we design the slime around three core components.
As illustrated in Figure 1, the system consists of: the RL Training System, which manages model parameter updates through its Training and Rollout Engines, exposes an API endpoint for communication, and maintains a Training Buffer that stores only finalized, ready-to-train data; an External Agent Framework, which can be any framework capable of producing complete trajectories; and the Rollout Buffer, which collects all trajectories generated by the agent framework(s), performs post-processing and filtering, and forwards valid data to the Training Buffer.
In the following sections, we detail the core designs of slime and explain how they address the key challenges discussed earlier.