Agent-Oriented Design: An Asynchronous and Decoupled Framework for Agentic Reinforcement Learning

Github: https://github.com/THUDM/slime

Zhihu Blog(in Chinese): https://zhuanlan.zhihu.com/p/1919107858110316886

Most existing RL training frameworks are designed for tasks involving pure reasoning and limited interaction, such as math or competitive programming. Extending these frameworks to support training LLM agents often requires substantial engineering effort, as researchers must manually integrate agent-specific tools and logic into the training pipeline, like Search-R1 and ToRL. This raises a critical question: How can we design an RL training framework that supports training agents across a wide range of agent frameworks with minimal integration effort?

To address this, we introduce slime—an RL training framework designed to flexibly integrate with diverse agent frameworks and support fully asynchronous RL training out of the box.

Using slime, we demonstrate seamless integration with the OpenHands framework to train a coding agent on the SWE Bench task. Furthermore, we show that slime’s fully asynchronous training achieves significant speed improvements over conventional synchronous RL approaches.

</aside>

Overview

The recent success of RL on reasoning tasks has sparked growing interest in extending RL to more complex agentic tasks. Frameworks like Search-R1 and ToRL attempt to train LLM agents on such tasks by manually embedding tools and logic into existing RL backends (e.g., VERL). However, for tasks in SWE-Bench or MLE-Bench, it is more practical to reuse dedicated agent frameworks that are already tailored to these complex tasks. This not only reduces redundant engineering effort but also leverages well-tested tools and environments.

Despite this appeal, integrating agent frameworks into RL pipelines introduces several non-trivial challenges:

Agent frameworks differ significantly in their toolsets, control flow, and APIs, making it difficult to abstract them under a unified interface.
Rollout logic often requires task-specific handling—such as filtering incomplete or invalid trajectories—which static RL pipelines struggle to support.
Trajectory generation in agentic tasks is often time-consuming and unstable. For example, solving a single SWE-Bench instance can take over 30 minutes, severely slowing down traditional synchronous RL training.

To address these limitations, we present slime, a general-purpose RL training framework built to seamlessly integrate with diverse agent frameworks. Slime enables a fully asynchronous training paradigm, decoupling trajectory generation from model updates to improve scalability, resource efficiency, and training throughput:

Unified Rollout Buffer Interface: slime introduces a modular rollout buffer that acts as a unified interface between agent frameworks and the RL framework.
Server-Based Rollout Execution: By adopting a server-based architecture, slime enables plug-and-play compatibility with API-driven agent frameworks.
Fully Asynchronous Rollout and Training: slime decouples training and rollout into separate GPU resource groups, allowing asynchronous execution that eliminates rollout bottlenecks and significantly increases training speed.

System Design

Figure 1: Overall Design of the slime RL Training Framework.

To effectively integrate diverse agent frameworks into the RL training process, we design the slime around three core components.

As illustrated in Figure 1, the system consists of: the RL Training System, which manages model parameter updates through its Training and Rollout Engines, exposes an API endpoint for communication, and maintains a Training Buffer that stores only finalized, ready-to-train data; an External Agent Framework, which can be any framework capable of producing complete trajectories; and the Rollout Buffer, which collects all trajectories generated by the agent framework(s), performs post-processing and filtering, and forwards valid data to the Training Buffer.

In the following sections, we detail the core designs of slime and explain how they address the key challenges discussed earlier.