Team: Yujie Zhao, Lanxiang Hu, Yang Wang, Zhijing Wu, Junbo Huang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao

Affiliations: University of California, San Diego; Intel Corporation

Paper link: https://arxiv.org/pdf/2510.11062

(Est. 3–5 minutes read)


TL;DR


Multi-agent is having a moment but still limited.

From Claude Code–style “team coding assistants” to mixture-of-agents and debate frameworks, orchestration is moving from demos to default practice. Meanwhile OpenAI made agents first-class citizens—launching AgentKit and framing DevDay around building software that runs and cooperates inside the chat itself. Despite this momentum, the predominant prompt-only orchestration of these agents often plateaus, still vulnerable to brittleness and inter-agent misalignment.

Why does MAS+RL gain?

On the other side, RL is becoming the most popular method for LLM agentic training. To understand this potential, it's worth examining precisely how adding on-policy RL addresses the core weaknesses of prompt-only MAS to unlock these significant collaborative gains.

Why does “prompt-only MAS” often stall — and how does RL fix it?

Observation from prior studies: MAS can only yield limited gains when two failure modes dominate: