**Yuqian Fu†, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu†, Dongbin Zhao**

*Co-first Author | †Project Lead | Published at Mar. 27, 2026. (Work in Progress)

Code: https://github.com/hhh675597/revisiting_opd

Paper: https://arxiv.org/abs/2603.25562

image.png

<aside>

TL;DR


On-policy distillation (OPD) has become an increasingly common component in post-training pipelines for reasoning and agentic language models. Recent public reports from Thinking Machines Lab, Qwen3, MiMo-V2-Flash, and GLM-5 suggest a shared shift toward supervision on model-generated trajectories, or closely related on-policy distillation variants, as a complement to both off-policy distillation and reinforcement learning [3] [4][5][6]. This trend is easy to understand from a systems perspective. Once the student is expected to reason or act on its own rollouts, the training signal has to remain informative under the prefix distribution induced by the student rather than only under teacher trajectories. This raises a basic implementation question: what objective is OPD actually optimizing, and what changes when sequence-level reverse-KL is replaced by a token-level approximation?

1. Token-level vs sequence-level OPD

We first recall the objective behind OPD. For a prompt $x$, the sequence-level reverse-KL objective is

$$ J_{\text{OPD}}(\theta)

\mathbb{E}{x\sim D}\left[ D{\mathrm{KL}}\left(\pi_\theta(\cdot \mid x)\,\|\,q(\cdot \mid x)\right) \right]. $$

where $\pi_\theta$ and $q$ are the student and teacher models, respectively. Using the score-function identity, its gradient can be written as

$$ \nabla_\theta J_{\text{OPD}}(\theta)

\mathbb{E}{x,\, y\sim \pi\theta(\cdot \mid x)}\left[ \big(\log \pi_\theta(y \mid x)-\log q(y \mid x)\big)\, \nabla_\theta \log \pi_\theta(y \mid x) \right]. $$

For each decoding step $t$, let $c_t = (x, y_{<t})$ denote the current context, $g_t = \nabla_\theta \log \pi_\theta(y_t \mid c_t)$ the score-function gradient on token $y_t$, and

$$ r_t = \log \frac{\pi_\theta(y_t \mid c_t)}{q(y_t \mid c_t)}. $$

Using the autoregressive factorization

$$ \log \pi_\theta(y \mid x) - \log q(y \mid x) = \sum_{t'=1}^{T} r_{t'}, \qquad \nabla_\theta \log \pi_\theta(y \mid x) = \sum_{t=1}^{T} g_t, $$

we obtain a sequence-level estimator

$$ \hat g_{\text{seq}}

\sum_{t=1}^{T} \left(\sum_{t'=1}^{T} r_{t'}\right) g_t. $$

For $t' < t$, we have $\mathbb{E}[r_{t'} g_t] = 0$ because $r_{t'}$ depends only on the prefix before step $t$, while