TL;DR:

In this blog post, we revisit the theory of on-policy distillation (OPD) and present some realistic issues in agent training—along with their solutions—when implementing OPD for model merging. Specifically :

We show that OPD is a particular form of an entropy-regularized RL problem. The REINFORCE algorithm applied to OPD, while it produces the unbiased policy gradient through the use of accumulated log-ratio-to-go, suffers from high variance and is often impractical. By contrast, simply using the immediate per-token log ratio, although it produces a partial and biased OPD gradient, achieves faster convergence due to lower variance. We demonstrate, however, that this bias does not affect convergence, as optimality is still guaranteed at the stationary point.
We show that, when computing the per-token log ratio for teacher and student with different tokenizers, manually constructing a teacher token sequence from an existing student token sequence may produce incorrect or unexpected performance compared with the teacher’s inherent token sequence distribution. This can cause the teacher model to assign extremely low probabilities to certain tokens and lead to unforeseen training issues, even when the final concatenated text remains unchanged.
We present a concrete case of OPD reward hacking, where the student learns to exploit undesirable patterns from the teacher, leading to real reward collapse. To address this, we mask the OPD loss on trajectories containing format errors (where the hacking occurs) and introduce an additional policy gradient based on outcome reward to prevent other potential exploits.

After achieving stable, long-term OPD training, we successfully enhanced the student model to match the teacher model's performance, achieving 100% parity on both the training and test datasets. We also provided further discussions on the insights from our OPD experiments in Section 3.

</aside>

1. On-Policy Distillation

1.1 Objective

As large language models continue to expand in size and computational requirements, the need for efficient training methods that produce capable smaller models or merge multiple expert models into one has become increasingly critical. On-policy distillation (OPD) represents a powerful approach to post-training that combines the advantages of on-policy training with dense reward signals, addressing fundamental limitations in both traditional knowledge distillation and reinforcement learning methods for training language models. We begin with the optimization objective of OPD. Suppose we have training prompts that are sampled from a certain distribution $\mathcal{D}$, and for each prompt $x$ we can access to a corresponding teacher model $\pi_{T(x)}$. Then OPD minimize the reverse KL divergence between current parameterized model $\pi_\theta$ (also known as student model) and the teach model, i.e.,

$$ \begin{align}\text{OPD\,:}\quad \underset{\theta}{\min}\,\,\mathcal{L} \left(\theta\right) =\mathbb{E} _{x\sim \mathcal{D}}\left[ \mathrm{KL}\left( \pi _{\theta}\left( \cdot |x \right) ,\pi _{T\left( x \right)}\left( \cdot |x \right) \right) \right]. \end{align} $$

By the definition of KL divergence, one can show that

$$ \begin{align} \mathrm{KL}\left( \pi {\theta}\left( \cdot |x \right) ,\pi {T\left( x \right)}\left( \cdot |x \right) \right) =-\mathbb{E}{y\sim\pi{\theta}(\cdot|x)}\left[\sum_{t=0}^{\left|y\right|-1}\log\frac{\pi_{T(x)}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)} \right],\end{align} $$

where $s_t:=(x,y_0,...,y_{t-1}),a_t:=y_t$. Plugging it into (1) yields that

$$ \small{\begin{align}\text{OPD\,:}\quad \underset{\theta}{\max}\,\,\mathcal{V} \left(\theta \right) &=\mathbb{E} {x\sim \mathcal{D}}\mathbb{E}{y\sim\pi_{\theta}(\cdot|x)}\left[\sum_{t=0}^{\left|y\right|-1}\log\frac{\pi_{T(x)}\left(a_t|s_t\right)}{\pi_{\theta}\left(a_t|s_t\right)} \right]. \\ &=\mathbb{E} {x\sim \mathcal{D}}\mathbb{E}{y\sim\pi_{\theta}(\cdot|x)}\left[\sum_{t=0}^{\left|y\right|-1}\log \pi_{T(x)}\left(a_t|s_t\right)+\mathcal{H}\left(\pi_{\theta}\left(\cdot|s_t\right)\right) \right],\end{align}} $$

which implies that OPD is a special entropy-regularized finite horizon RL problem with the immediate reward $r(s_t,a_t)$ given by the teacher’s log probability $\log \pi_{T(x)}(a_t|s_t)$.

1.2 Policy Gradient of OPD

Since OPD is a special RL problem, we can use policy gradient methods to update the student model. We directly present the policy gradient of OPD in the following along with the proof.

Theorem 1 (Policy Gradient of OPD). For arbitrary parameter $\theta$, the policy gradient of OPD is

$$ \begin{align}\nabla _{\theta}\mathcal{V} \left( \theta \right) =\mathbb{E} _{x\sim \mathcal{D}}\mathbb{E} {y\sim \pi {\theta}\left( \cdot |x \right)}\left[ \sum{t=0}^{\left| y \right|-1}{\hat{A}{\theta}\left( s_t,a_t \right) \cdot \nabla _{\theta}\log \pi _{\theta}\left( a_t|s_t \right)} \right],\end{align} $$

where $s_t:=(x,y_{<t}),a_t:=y_t,$ and the advantage $\hat{A}_{\theta}\left( s_t,a_t \right)$ can be one of the following:

(i) $\sum_{t^{\prime}=0}^{\left| y \right|-1}{\log \frac{\pi {T\left( x \right)}\left( a{t^{\prime}}|s_{t^{\prime}} \right)}{\pi {\theta}\left( a{t^{\prime}}|s_{t^{\prime}} \right)}}=\frac{\log \pi _{T\left( x \right)}\left( y|x \right)}{\log \pi _{\theta}\left( y|x \right)}$ : the log ratio of full trajectory.