verl reTool recipe: 使用的多轮对话和代码沙箱提高大模型数学能力

OVERVIEW

We have successfully replicated ReTool, a state-of-the-art Reinforcement Learning (RL) framework for augmented reasoning in Large Language Model (LLM) tools, using the verl framework. This replication not only validates ReTool's groundbreaking performance in structured problem solving (e.g., mathematical reasoning), but also highlights verl's robustness as a training framework for reinforcement learning and validates new features developed by verl for Agentic RL scenarios.

Link to Retool paper: https://arxiv.org/pdf/2504.11536

Retool and reproduction methods

How it works

Retool's workflow is divided into two key phases:

Cold Start and Supervised Fine Tuning (SFT)

The data generation pipeline builds a high-quality dataset containing code-enhanced inference trajectories, and supervised fine-tuning enables the model to master basic Tool call (e.g., code execution) and analysis of the execution results.
Dynamic Interaction and Policy Optimization (RL).

With the verl Reinforcement Learning framework, the model dynamically inserts code blocks during inference and interacts with the sandbox environment in real-time, generating a hybrid trajectory of natural language thinking and code snippets, sending the code to the sandbox for asynchronous execution when code termination markers are detected, and the execution results (success outputs/errors) are fed back to the model for guiding the subsequent inference. This "think-execute-feedback" cycle, together with the design of rewards based on the accuracy of the final answer, enables the model to independently optimize the Tool call strategy, and improves the reasoning efficiency and computational accuracy.

New features to reproduce retool in verl

Server-based asynchronous rollout

Since the agent needs to interact with the environment through various tool calls, in order to avoid the GPU idling while waiting for the results of the tool calls to return, we adopt an asynchronous co-processing-based mechanism to asynchronously execute the inference request for each sample, thus improving the training speed. To support request asynchronous rollout, the inference engine (server) and agent (client) are architecturally separated, and a server-based system is implemented with the following goals:

Provide a load balancing mechanism to balance the load of multiple GPUs and reduce the performance impact of long-tailed requests.
Decoupling the inference engine from the client system level to prevent agent-related logic (e.g., trace) from affecting the inference engine.

Server-based architecture diagram from verl documentation https://verl.readthedocs.io/en/latest/advance/agent_loop.html

See the table below for the components and functions of the verl server-based architecture, and refer to the architecture diagram for the work flow

Component	Role
AgentLoop	Client, implements Agent functions
AsyncLLMServerManager	Inference gateway, provides generate interface for AgentLoop
AsyncSglangServer
AsyncvLLMServer	Server, each instance is connected to one DP group of the inference engine