Team: Yujie Zhao, Lanxiang Hu, Yang Wang, Zhijing Wu, Junbo Huang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao
Affiliations: University of California, San Diego; Intel Corporation
Paper link: https://arxiv.org/pdf/2510.11062
(Est. 3–5 minutes read)
From Claude Code–style “team coding assistants” to mixture-of-agents and debate frameworks, orchestration is moving from demos to default practice. Meanwhile OpenAI made agents first-class citizens—launching AgentKit and framing DevDay around building software that runs and cooperates inside the chat itself. Despite this momentum, the predominant prompt-only orchestration of these agents often plateaus, still vulnerable to brittleness and inter-agent misalignment.
On the other side, RL is becoming the most popular method for LLM agentic training. To understand this potential, it's worth examining precisely how adding on-policy RL addresses the core weaknesses of prompt-only MAS to unlock these significant collaborative gains.
Observation from prior studies: MAS can only yield limited gains when two failure modes dominate: