Agentic Test-Time Scaling

<aside> ✨

Abstract

Test-Time Scaling (TTS) has emerged as a powerful technique for enhancing Large Language Model (LLM) performance on complex tasks. However, TTS methods are usually built upon heuristics. Human design the scaling scheme. For example, we might generate 64 independent traces for one problem, and then use a reward model to rank the responses, or use majority voting, or maybe use a generative llm to reveiw, verify and select the best trace.

We propose “Agentic Test-Time Scaling”. The idea is that the model should decide if it wants to do test-time scaling and how to do the scaling.

We have implemented a proof-of-concept “Agentic Test-Time Scaling” system. The agent has access to parallel rollout tools, as well as bash (executable environment) access. The parallel rollout calls will call sub-agents to attempt the questions on parallel, after which the main agent can use bash to explore the parallel traces.

📝 This is a research preview. Additional results will be released soon in Pre-print.

</aside>

Background of Test-Time Scaling

Definition and Setup of Agentic Test-Time Scaling

Ideally, if we equip an agent with access to

An LLM endpoint
Excutable environemnt (that can execute shell and python commands)

It will be able to reproduce most existing test-time scaling systemts. For example:

The inital parallel rollouts can be achieved by calling LLM endpoint
Heuristics like majority voting can be implemented with python code
More complex TTS systems that involve LLM (subagent) selection/aggregation/verification/refinement can also be achieved by calling an LLM endpoint.

We have implemented a simplified proof-of-concept “Agentic Test-Time Scaling” system. The agent has access to

Parallel rollout tools (instead of a more flexible LLM endpoint). Each rollout will exist as a file in environment. The agent can explore the rollout just as reading a file.
Excutable environemnt

In this simplified setup, the main agent can only call subagent for parallel rollouts, but not things like selection/verification…

Advantage of Agentic Test-Time Scaling

We believe moving from a static Test-Time Scaling “workflow” to a “Agentic” system will have the several

Agentic Test-Time Scaling is adaptive with respect to questions, thus leading to token efficiency. For example, for easier question, the agent may choose to only spawn 4 parallel rollouts and that would be sufficient. For hard question, the agent may choose to spawn 32 paralll rollouts… This is in contrast to most traditional TTS systems, that spends same compute budget to all questions.
Agentic Test-Time Scaling can potentially explore parallel reasoning traces more effectively, leading to more effective selection. For example, the TTS agent can use bash commands like “head”, “tail”, “sed” to explore the parallel traces efficiently. This could be more effective and efficient than trying to feed full reasoning trace to the context.
(Not Implemented Yet) In a full Agentic Test-Time Scaling system, it should be a superset of existing TTS systems, as the agent will be able to reproduce any TTS systems with the tools it has. Thus, it opens up venue for optimizing the whole system end-to-end.

Deprecated content