Author: Junnan Li

GPT-OSS-120B is a model with unique characteristics that set it apart from its contemporaries. When paired with a standard agent framework such as ReAct, it often fails miserably. However, when leveraging its native function-calling API with black box agent scaffolding, it exhibits powerful capabilities, achieving the top open-source result on the MCP Universe benchmark with a success rate of 25.54%.

This article presents an open agent scaffolding design that can fully realize the potential of GPT-OSS. We demonstrate that a simple yet specialized agent can improve the success rate of GPT-OSS-120B to 31.17% - on par with Claude-4.0-Sonnet! Paired with this HarmonyReAct agent, even the smaller GPT-OSS-20B achieves 24.24% on MCP Universe, outperforming much larger models such as Qwen3-Coder-450B and DeepSeek-V3.1.


The Harmony Format: A Model-Agent Co-Design Philosophy

Central to understanding and effectively utilizing GPT-OSS is the Harmony Format. This is not merely a template but represents a philosophy of model-agent co-design, where the ReAct (Reason and Act) pattern is integrated into the model during its training phase. The core components of ReAct—reason, act, observation, and answer—are mapped to three distinct channels within the Harmony Format:

Similar to how a ReAct agent chains multiple steps, the Harmony Format structures interaction by switching between these channels. A typical sequence is as follows:

<|start|>assistant<|channel|>analysis<|message|>
{Reasoning}
<|end|>
<|start|>assistant<|channel|>commentary to=functions.{tool_name} <|constrain|>json<|message|>
{tool_argument}
<|call|>
<|start|>functions.{tool_name} to=assistant<|channel|>commentary<|message|>
{tool_result}
<|end|>
<|start|>assistant<|channel|>analysis<|message|>
{Reasoning}...

By training the model with this structure, GPT-OSS demonstrates strong intrinsic agentic capabilities, including planning, task decomposition, and self-verification. Despite these strengths, we identified two primary challenges that hinder its performance, for which we have developed effective solutions.


Challenge 1: Output Format Adherence

A big challenge is that the model often fails to adhere strictly to the Harmony Format in its outputs. These deviations can be categorized into two types:

  1. Syntactical Errors: The model may omit or malform special tokens (e.g., <|channel|> or <|start|>), resulting in improperly formatted channel identifiers such as assistantcommentary. These errors can largely be resolved by implementing a more robust parser.
  2. Semantic Errors: These are more difficult to address with rule-based methods. The model may misplace content in an incorrect channel (e.g., placing the final answer within the analysis channel) or fail to generate required content entirely (e.g., omitting a tool call despite explicit instruction).

To address these errors, we implemented the following methodology: