— RouterArena: Building the Evaluation Foundation for LLM Routing

Yifan Lu Rixin Liu Jiayi Yuan* Xingqi Cui Shenrun Zhang Hongyi Liu Jiarong Xing**

*: Equal Contributions, Rice University

📄 Arxiv: paper | 🔗 GitHub: RouterArena | 🤗 Huggingface: Dataset

image.png

<aside>

TL;DR

As LLMs continue to diversify, model routers become increasingly critical for connecting various models and achieving the best performance-cost trade-off. RouterArena is the first open platform & leaderboard for rigorous and comprehensive router evaluation, supporting both open-source and commercial routers. It provides principally-constructed datasets, extensive evaluation metrics, an automated evaluation framework, and a live leaderboard. We aim for RouterArena to serve as a foundation for the community to evaluate, understand, and advance the next generation of routing systems.

</aside>

The Diversifying Landscape of LLMs

image.png

Figure from paper: Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review ******

For years, our community has pursued the goal of building a single, general-purpose foundation model capable of handling all questions and tasks, and this effort has achieved remarkable success. As scaling laws kicked in, these models have rapidly expanded to trillions of parameters and now surpass human performance on a wide range of benchmarks.

However, it is becoming increasingly clear that this scaling trend may not be sustainable. First, we are hitting the data wall, where high-quality training data is running out, as highlighted by Ilya Sutskever, OpenAI’s co-founder, at NeurIPS 2024. Future scaling will depend on generating new, high-quality data, which often requires costly human labeling and curation. Second, on the compute side, scaling up isn’t just about throwing more money at GPUs; it also makes it much harder to keep these massive distributed systems stable, efficient, and reliable.

As a result, the LLM landscape is diversifying. In addition to chasing large general-purpose models, people are also exploring smaller, more efficient, and specialized ones. A good example is the Qwen family, which now includes nine categories, such as Qwen3-Coder, Qwen3-Image, and Qwen3-Guard. These models range from 0.6B to 480B parameters, with many specialized variants staying under 30B, handling relatively simple questions more efficiently.

This shift is further accelerated by startups and open-source initiatives embracing model specialization and customization. For instance, ThinkingMachine is building personalized AI systems, while rLLM provides an open framework for training domain-specific or user-tailored agents. Together, these efforts mark a clear transition from a “one-model-for-all” paradigm to a diverse ecosystem of LLMs, ranging from massive generalists to compact specialists.

The Key is Model Routers