Shuyao Xu, Xinyu Zhu, Bryan Hooi, Yu Meng
<aside> ✨
Inference-time scaling has emerged as a powerful technique for enhancing Large Language Model (LLM) performance on complex tasks. A promising paradigm uses the LLM itself as a generative aggregator to select the best answer from multiple parallel candidates. However, this method faces limitations due to the LLM's constrained context window, which restricts the number of candidates it can evaluate simultaneously. To address this challenge, we introduce a novel approach that combines an LLM aggregator with a tournament-style selection process. This enables effective distillation of the best solution from an extensive candidate pool, significantly advancing our ability to solve the most challenging problems.
📝 This is a research preview. Additional results will be released soon in Pre-print.
🌐 This page is also available at https://www.notion.so/Tournament-Test-Time-Scaling-to-Solve-the-Hardest-Problems-S-2a3622b08b6780709b2ddb37ad30e2ab
</aside>
Researchers have been exploring the trade-off between inference-time compute and improved LLM performance to solve difficult tasks, a paradigm known as test-time scaling.
One approach is sequential test-time scaling, where the model generates extended chain-of-thought reasoning before producing the final response. Prominent work in this direction includes DeepSeek-R1 [1], OpenAI o1 [2], and Qwen QwQ models [3].
Another paradigm is parallel test-time scaling, which typically involves generating multiple responses in parallel and then selecting the best response. There are generally three methods for selection:
For the hardest problems, such as those in Humanity's Last Exam (HLE) [8], where the majority answer is often incorrect, traditional approaches face significant limitations:
Our verdict: We need a method that enables LLM aggregators to work effectively with large numbers of candidates.
We conducted extensive experiments on the HLE Integer 100 dataset using Qwen3-4B-Instruct-2507 to validate our motivation. HLE Integer 100 is a random subset of the full HLE dataset [8]

Figure 1: Comparison of Pass@k and Maj@k metrics on HLE Integer 100 dataset using Qwen3-4B-Instruct-2507. Pass@k measures the percentage of questions with at least one correct response among k samples, while Maj@k measures the accuracy when selecting the majority answer. The growing gap demonstrates that while correct answers exist in the candidate pool (Pass@k increases), majority voting fails to identify them (Maj@k remains flat). This validates the need for better aggregation methods beyond simple majority voting.