Authors: Jiacai Liu Yu Shen Zhuo Jiang $^\dagger$ Yuqian Fu**
***Co-First Authors. $^\dagger$ Independent Researcher.
**Work done at ByteDance Seed. First published at February 15, 2026.
<aside> đź’ˇ
In this blog post, we revisit the theory of on-policy distillation (OPD) and present some realistic issues in agent training—along with their solutions—when implementing OPD for model merging. Specifically :
After achieving stable, long-term OPD training, we successfully enhanced the student model to match the teacher model's performance, achieving 100% parity on both the training and test datasets. We also provided further discussions on the insights from our OPD experiments in Section 3.
</aside>
As large language models continue to expand in size and computational requirements, the need for efficient training methods that produce capable smaller models or merge multiple expert models into one has become increasingly critical. On-policy distillation (OPD) represents a powerful approach to post-training that combines the advantages of on-policy training with dense reward signals, addressing fundamental limitations in both traditional knowledge distillation and reinforcement learning methods for training language models. We begin with the optimization objective of OPD. Suppose we have training prompts that are sampled from a certain distribution $\mathcal{D}$, and for each prompt $x$ we can access to a corresponding teacher model $\pi_{T(x)}$. Then OPD minimize the reverse KL divergence between current parameterized model $\pi_\theta$ (also known as student model) and the teach model, i.e.,
$$ \begin{align}\text{OPD\,:}\quad \underset{\theta}{\min}\,\,\mathcal{L} \left(\theta\right) =\mathbb{E} _{x\sim \mathcal{D}}\left[ \mathrm{KL}\left( \pi _{\theta}\left( \cdot |x \right) ,\pi _{T\left( x \right)}\left( \cdot |x \right) \right) \right]. \end{align} $$
By the definition of KL divergence, one can show that
$$ \begin{align} \mathrm{KL}\left( \pi {\theta}\left( \cdot |x \right) ,\pi {T\left( x \right)}\left( \cdot |x \right) \right) =-\mathbb{E}{y\sim\pi{\theta}(\cdot|x)}\left[\sum_{t=0}^{\left|y\right|-1}\log\frac{\pi_{T(x)}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)} \right],\end{align} $$
where $s_t:=(x,y_0,...,y_{t-1}),a_t:=y_t$. Plugging it into (1) yields that
$$ \small{\begin{align}\text{OPD\,:}\quad \underset{\theta}{\max}\,\,\mathcal{V} \left(\theta \right) &=\mathbb{E} {x\sim \mathcal{D}}\mathbb{E}{y\sim\pi_{\theta}(\cdot|x)}\left[\sum_{t=0}^{\left|y\right|-1}\log\frac{\pi_{T(x)}\left(a_t|s_t\right)}{\pi_{\theta}\left(a_t|s_t\right)} \right]. \\ &=\mathbb{E} {x\sim \mathcal{D}}\mathbb{E}{y\sim\pi_{\theta}(\cdot|x)}\left[\sum_{t=0}^{\left|y\right|-1}\log \pi_{T(x)}\left(a_t|s_t\right)+\mathcal{H}\left(\pi_{\theta}\left(\cdot|s_t\right)\right) \right],\end{align}} $$
which implies that OPD is a special entropy-regularized finite horizon RL problem with the immediate reward $r(s_t,a_t)$ given by the teacher’s log probability $\log \pi_{T(x)}(a_t|s_t)$.
Since OPD is a special RL problem, we can use policy gradient methods to update the student model. We directly present the policy gradient of OPD in the following along with the proof.
Theorem 1 (Policy Gradient of OPD). For arbitrary parameter $\theta$, the policy gradient of OPD is
$$ \begin{align}\nabla _{\theta}\mathcal{V} \left( \theta \right) =\mathbb{E} _{x\sim \mathcal{D}}\mathbb{E} {y\sim \pi {\theta}\left( \cdot |x \right)}\left[ \sum{t=0}^{\left| y \right|-1}{\hat{A}{\theta}\left( s_t,a_t \right) \cdot \nabla _{\theta}\log \pi _{\theta}\left( a_t|s_t \right)} \right],\end{align} $$
where $s_t:=(x,y_{<t}),a_t:=y_t,$ and the advantage $\hat{A}_{\theta}\left( s_t,a_t \right)$ can be one of the following:
(i) $\sum_{t^{\prime}=0}^{\left| y \right|-1}{\log \frac{\pi {T\left( x \right)}\left( a{t^{\prime}}|s_{t^{\prime}} \right)}{\pi {\theta}\left( a{t^{\prime}}|s_{t^{\prime}} \right)}}=\frac{\log \pi _{T\left( x \right)}\left( y|x \right)}{\log \pi _{\theta}\left( y|x \right)}$ : the log ratio of full trajectory.