Project Lead: Zehui Chen

Equal Core Contributor: Jinhao Jiang, Fangkai Jiao, Zehui Chen

Contributor: Xuesong Yao, Baoquan Zhong, Zhiheng Xi, Xiangsheng Li, Hailei Gong, Kang Liu, Zhengyin Du, Bingjie Wang, Yixuan Qin, Feier Zhang, Chao He

Supervision: Wayne Xin Zhao, Jiecao Chen, Yuxuan Wang

Affiliation: ByteDance Seed, Renmin University of China, NTU, Fudan University

Date: 2025.8.22

<aside> 💡

We introduce S1-Search, a 32B model that sets a new state-of-the-art among open-source models in its size class, achieving 25.9% on BrowseComp, 43.6% on BrowseComp-ZH, 65.2% on GAIA, and 71% on xbench-deepsearch. Such a model substantially narrows the performance gap with commercial deep-research systems.

Our key contribution lies in scaling RL for end-to-end deep search, along both the data and training dimensions. On the data side, we propose a difficulty-controllable and highly scalable synthesis pipeline that constructs multi-hop, fuzzified knowledge subgraphs to enforce robust evidence integration. On the training side, we extend RL exploration to 100 interaction steps and adopt a 128K context window, raising the average training context from 10K to 30K tokens and enabling reliable problem solving at lengths up to 90K. To further improve long-horizon RL training, we design a semi-asynchronous partial rollout mechanism to improve efficiency and apply strict reward shaping to suppress invalid actions and stabilize learning.

With these simple yet effective designs, our model not only surpasses previous counterparts by a large margin but also achieves performance comparable to commercial models. We expect this work to establish a stronger deep-research baseline at the 32B scale and to contribute to the advancement of research in long-horizon agent reasoning.

</aside>

Scaling of Deep Search

S1-Search is an end-to-end RL training agent, which is skillful at utilizing a browsing tool to find and aggregate information to solve long-horizon complex questions. It has two simple tools for web browsing currently: a search engine tool and an open page tool (we do not introduce more tools for now and leave it for future work). During each turn, the model first outputs CoT and then follows the function call content or final answer. The function call content will be parsed and executed, returned as tool responses. This design greatly simplifies the RL rollout implementation and enables prefix caching, which is especially useful in long-context scenarios.

The model is primarily trained using GRPO algorithm, implemented with Verl. We only adopt outcome reward as the final reward, which is the judgment produced by Qwen2.5-72B based on the model output answer and final answer. We do not introduce an additional penalty for the number of tool calls, which we find may harm the model’s ability to double-verify the source correctness of the web page.

In earlier experiments and previous studies, the number of tool calls within a rollout is limited to 30, which is found empirically to be sufficient for most cases. However, we observed that models trained under this constraint, while performing well on some simple benchmarks such as Frames, failed to generalize to more challenging tasks such as BrowseComp.

This limitation can be attributed to two primary factors:

Insufficient task difficulty during training. Most training problems require only a small number of tool calls to solve. As a result, the model rarely operates in genuinely long-context settings during rollouts, which hinders its ability to generalize to tasks requiring substantially more search steps.
Restricted search space due to maximum-turn constraints. The cap on interaction steps prevents the model from exploring larger search spaces, thereby limiting its ability to solve harder problems and inhibiting its capacity to learn effective test-time scaling strategies.

To address these challenges, we scale along both the data and training dimensions. On the data side, we curate training sets consisting of more difficult problems (e.g., multi-hop reasoning tasks, anonymized and fuzzy information features, inspired by BrowseComp) to ensure that the model needs to explore deeper before reaching the correct solution. On the training side, we increase the maximum number of RL interaction steps to 100 and allow the model to infer with a 128K context window. This significantly expands the effective training horizon: the average context length used during training rose from 10K to 30K tokens, and we observe that the model is able to successfully solve problems even in 90K-token contexts—whereas earlier models failed at this length.

Scaling of Deep Search

1. Scaling of Data Synthesis

Data Synthesis