STRONGER TOGETHER: ON-POLICY REINFORCEMENT LEARNING FOR COLLABORATIVE LLMS

Team: Yujie Zhao, Lanxiang Hu, Yang Wang, Zhijing Wu, Junbo Huang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao

Affiliations: University of California, San Diego; Intel Corporation

Website link: https://pettingllms-ai.github.io/

Paper link: https://arxiv.org/pdf/2510.11062

Code: https://github.com/pettingllms-ai/PettingLLMs

(Est. 3–5 minutes read)

TL;DR

Multi-Agent Systems (MAS) always outperform single agents by dividing complex tasks, yet this advantage shatters when simply scaling up, as more agents just amplify misalignment and instruction failure. The real breakthrough isn't more agents but smarter collaboration, which RL achieves by forcing agents to co-learn and specialize and teaching them how to actually co-evolve and unlocking massive performance gains in complex games, coding, and math.

Multi-agent is having a moment but still limited.

From Claude Code–style “team coding assistants” to mixture-of-agents and debate frameworks, orchestration is moving from demos to default practice. Meanwhile OpenAI made agents first-class citizens—launching AgentKit and framing DevDay around building software that runs and cooperates inside the chat itself. Despite this momentum, the predominant prompt-only orchestration of these agents often plateaus, still vulnerable to brittleness and inter-agent misalignment.

Why does MAS+RL gain?

On the other side, RL is becoming the most popular method for LLM agentic training. To understand this potential, it's worth examining precisely how adding on-policy RL addresses the core weaknesses of prompt-only MAS to unlock these significant collaborative gains.

Why does “prompt-only MAS” often stall — and how does RL fix it?

Observation from prior studies: MAS can only yield limited gains when two failure modes dominate: