Authors:
KeZhao Liu, From SYSU HCP Lab. liukzh9@mail2.sysu.edu.cn
YiMing Liu*, From SYSU HCP Lab. letusgo126@126.com
Jason Klein Liu, Independent Researcher. jasonkleinlove@gmail.com.
Other Contributors in the Discussion:
Jiacai Liu, From Fudan University.
Hongyu Zang.
## Contributions
The main academic contributions of this paper lie in the in-depth theoretical analysis and empirical exploration of reinforcement learning algorithms based on KL divergence, specifically in the following aspects:
Unbiased estimation of the KL divergence gradient value rather than the KL divergence value, and the conditions for effective estimation
Directly calculating the KL divergence as a loss function based on the definition over the entire vocabulary is theoretically feasible but computationally expensive. In practice, sampling-based methods are needed, retaining only the probability values of each token in the sampled trajectory.
As a loss function, it is more appropriate to estimate the unbiased gradient value of the KL divergence rather than the KL divergence value itself. In [classification_KL](#classification_KL*)***, this paper derives the analytical expression of the KL divergence gradient during sampling. The condition for effective estimation is that the effective size of the sample must be sufficiently large: sampling many samples with a stable policy.
k1, k2, k3
In [klequal](#klequal*), it is proven that the mathematical equivalence between k1 estimation in the reward function and k2 estimation as a loss function, and in [KL_loss_reward](#KL_loss_reward)***, the superiority of the k2 loss function over the k3 loss function is demonstrated.
In [k3 reward](#k3 reward), the expression of the reward function after reparameterizing the k3 loss function in the GRPO algorithm is derived.
Counterexamples are provided to argue that the variance of the k3 estimate is not always smaller than that of k1 and k2, refuting the existing literature's claim of the absolute superiority of the k3 estimate (e.g., [kl_approx](#kl_approx*), [shao2024deepseekmath](#shao2024deepseekmath)***).
Derivation and clarification of the PPO algorithm's reward function: In [subsec:equi_reward](#subsec:equi_reward*), the original reward function formula of the PPO algorithm is derived, clarifying potentially ambiguous statements in the existing literature [ouyang2022training](#ouyang2022training), [stiennon2020learning](#stiennon2020learning), [jaques2019way](#jaques2019way)***. The PPO reward function is effective because it happens to meet the condition of having a sufficiently large effective sample size in practice.