Author: Junnan Li
GPT-OSS-120B is a model with unique characteristics that set it apart from its contemporaries. When paired with a standard agent framework such as ReAct, it often fails miserably. However, when leveraging its native function-calling API with black box agent scaffolding, it exhibits powerful capabilities, achieving the top open-source result on the MCP Universe benchmark with a success rate of 25.54%.
This article presents an open agent scaffolding design that can fully realize the potential of GPT-OSS. We demonstrate that a simple yet specialized agent can improve the success rate of GPT-OSS-120B to 31.17% - on par with Claude-4.0-Sonnet! Paired with this HarmonyReAct agent, even the smaller GPT-OSS-20B achieves 24.24% on MCP Universe, outperforming much larger models such as Qwen3-Coder-450B and DeepSeek-V3.1.
Central to understanding and effectively utilizing GPT-OSS is the Harmony Format. This is not merely a template but represents a philosophy of model-agent co-design, where the ReAct (Reason and Act) pattern is integrated into the model during its training phase. The core components of ReAct—reason, act, observation, and answer—are mapped to three distinct channels within the Harmony Format:
analysis
: Corresponds to the model's reasoning and planning process (reason
).commentary
: Facilitates tool use and the ingestion of tool outputs (act
and observation
).final
: Used for delivering the concluding answer (answer
).Similar to how a ReAct agent chains multiple steps, the Harmony Format structures interaction by switching between these channels. A typical sequence is as follows:
<|start|>assistant<|channel|>analysis<|message|>
{Reasoning}
<|end|>
<|start|>assistant<|channel|>commentary to=functions.{tool_name} <|constrain|>json<|message|>
{tool_argument}
<|call|>
<|start|>functions.{tool_name} to=assistant<|channel|>commentary<|message|>
{tool_result}
<|end|>
<|start|>assistant<|channel|>analysis<|message|>
{Reasoning}...
By training the model with this structure, GPT-OSS demonstrates strong intrinsic agentic capabilities, including planning, task decomposition, and self-verification. Despite these strengths, we identified two primary challenges that hinder its performance, for which we have developed effective solutions.
A big challenge is that the model often fails to adhere strictly to the Harmony Format in its outputs. These deviations can be categorized into two types:
<|channel|>
or <|start|>
), resulting in improperly formatted channel identifiers such as assistantcommentary
. These errors can largely be resolved by implementing a more robust parser.analysis
channel) or fail to generate required content entirely (e.g., omitting a tool call despite explicit instruction).To address these errors, we implemented the following methodology:
Enforce the Analysis Channel: We force the model to begin its reasoning process by appending <|start|>assistant<|channel|>analysis
to the prompt at each turn. This primes the model to output its reasoning first, which improves the structural integrity of its subsequent response.
Implement a Self-Correction Mechanism: If the agent's parser fails to identify either a tool call or a final answer in the model's response, we append a corrective instruction to the analysis
channel for the next turn:
"Cannot find <|channel|>commentary or <|channel|>final. In your next step, be careful with the channel format."
This same feedback loop is applied to any tool-parsing errors. This mechanism allows the model to recognize its own formatting error and self-correct in the subsequent step.