**Yuqian Fu†, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu†, Dongbin Zhao**

*Co-first Author | †Project Lead | Published at Mar. 27, 2026. (Work in Progress)

Code: https://github.com/hhh675597/revisiting_opd

Paper: https://arxiv.org/abs/2603.25562

<aside>

TL;DR

On-policy distillation (OPD) trains a student on its own rollouts using teacher feedback[1][2][3]. In long-horizon LLM post-training, the common sampled-token implementation can be brittle.
From a bias-variance perspective, token-level OPD is biased relative to sequence-level reverse-KL, but it admits a much tighter worst-case variance bound. Our toy study shows that stronger future-reward coupling substantially increases gradient variance and destabilizes optimization.
In practice, brittleness comes from three sources: an imbalanced one-token learning signal, unreliable teacher guidance on student-generated prefixes, and tokenizer/special-token mismatch.
We replace the one-sample comparison with a teacher top-K truncated reverse-KL over local support, together with top-p rollouts and special-token masking. This yields more stable training and better results on both reasoning and agentic multi-task benchmarks. </aside>

On-policy distillation (OPD) has become an increasingly common component in post-training pipelines for reasoning and agentic language models. Recent public reports from Thinking Machines Lab, Qwen3, MiMo-V2-Flash, and GLM-5 suggest a shared shift toward supervision on model-generated trajectories, or closely related on-policy distillation variants, as a complement to both off-policy distillation and reinforcement learning [3] [4][5][6]. This trend is easy to understand from a systems perspective. Once the student is expected to reason or act on its own rollouts, the training signal has to remain informative under the prefix distribution induced by the student rather than only under teacher trajectories. This raises a basic implementation question: what objective is OPD actually optimizing, and what changes when sequence-level reverse-KL is replaced by a token-level approximation?

1. Token-level vs sequence-level OPD

We first recall the objective behind OPD. For a prompt $x$, the sequence-level reverse-KL objective is

$$ J_{\text{OPD}}(\theta)

\mathbb{E}{x\sim D}\left[ D{\mathrm{KL}}\left(\pi_\theta(\cdot \mid x)\,\|\,q(\cdot \mid x)\right) \right]. $$

where $\pi_\theta$ and $q$ are the student and teacher models, respectively. Using the score-function identity, its gradient can be written as

$$ \nabla_\theta J_{\text{OPD}}(\theta)

\mathbb{E}{x,\, y\sim \pi\theta(\cdot \mid x)}\left[ \big(\log \pi_\theta(y \mid x)-\log q(y \mid x)\big)\, \nabla_\theta \log \pi_\theta(y \mid x) \right]. $$

For each decoding step $t$, let $c_t = (x, y_{<t})$ denote the current context, $g_t = \nabla_\theta \log \pi_\theta(y_t \mid c_t)$ the score-function gradient on token $y_t$, and

$$ r_t = \log \frac{\pi_\theta(y_t \mid c_t)}{q(y_t \mid c_t)}. $$

Using the autoregressive factorization

$$ \log \pi_\theta(y \mid x) - \log q(y \mid x) = \sum_{t'=1}^{T} r_{t'}, \qquad \nabla_\theta \log \pi_\theta(y \mid x) = \sum_{t=1}^{T} g_t, $$

we obtain a sequence-level estimator

$$ \hat g_{\text{seq}}

\sum_{t=1}^{T} \left(\sum_{t'=1}^{T} r_{t'}\right) g_t. $$

For $t' < t$, we have $\mathbb{E}[r_{t'} g_t] = 0$ because $r_{t'}$ depends only on the prefix before step $t$, while