verl reTool recipe: 使用的多轮对话和代码沙箱提高大模型数学能力
We have successfully replicated ReTool, a state-of-the-art Reinforcement Learning (RL) framework for augmented reasoning in Large Language Model (LLM) tools, using the verl framework. This replication not only validates ReTool's groundbreaking performance in structured problem solving (e.g., mathematical reasoning), but also highlights verl's robustness as a training framework for reinforcement learning and validates new features developed by verl for Agentic RL scenarios.
Link to Retool paper: https://arxiv.org/pdf/2504.11536
Retool's workflow is divided into two key phases:
Cold Start and Supervised Fine Tuning (SFT)
The data generation pipeline builds a high-quality dataset containing code-enhanced inference trajectories, and supervised fine-tuning enables the model to master basic Tool call (e.g., code execution) and analysis of the execution results.
Dynamic Interaction and Policy Optimization (RL).
With the verl Reinforcement Learning framework, the model dynamically inserts code blocks during inference and interacts with the sandbox environment in real-time, generating a hybrid trajectory of natural language thinking and code snippets, sending the code to the sandbox for asynchronous execution when code termination markers are detected, and the execution results (success outputs/errors) are fed back to the model for guiding the subsequent inference. This "think-execute-feedback" cycle, together with the design of rewards based on the accuracy of the final answer, enables the model to independently optimize the Tool call strategy, and improves the reasoning efficiency and computational accuracy.
Since the agent needs to interact with the environment through various tool calls, in order to avoid the GPU idling while waiting for the results of the tool calls to return, we adopt an asynchronous co-processing-based mechanism to asynchronously execute the inference request for each sample, thus improving the training speed. To support request asynchronous rollout, the inference engine (server) and agent (client) are architecturally separated, and a server-based system is implemented with the following goals:
Server-based architecture diagram from verl documentation https://verl.readthedocs.io/en/latest/advance/agent_loop.html
See the table below for the components and functions of the verl server-based architecture, and refer to the architecture diagram for the work flow
Component | Role |
---|---|
AgentLoop | Client, implements Agent functions |
AsyncLLMServerManager | Inference gateway, provides generate interface for AgentLoop |
AsyncSglangServer | |
AsyncvLLMServer | Server, each instance is connected to one DP group of the inference engine |