Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Kezhao Liu

Renbiao Liu

Mingtao Chen

Yiming Liu$^*$ liuym225@mail2.sysu.edu.cn

First disscuss at:https://zhuanlan.zhihu.com/p/28735759256

Better visual version is:

k2 as loss.pdf

Code at:https://github.com/OpenRLHF/OpenRLHF/pull/797 ‣

Abstract

Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation—a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $\boldsymbol{k_n}$ as a detached coefficient for the policy's score function ('$k_n$ in reward') or as a direct loss function through which gradients are propagated ('$k_n$ as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional '$k_1$ in reward' (like PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the '$k_2$ as loss' formulation is, in fact, gradient-equivalent to '$k_1$ in reward'. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted '$k_3$ as loss' (like GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of '$k_n$ as loss' methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

1. Introduction

The training of state-of-the-art Large Language Models (LLMs) is a multistage process. Following large-scale pretraining, models are refined through Supervised Fine-Tuning (SFT) to learn instruction-following behaviors. To further elevate these SFT models, a final post-training stage, Reinforcement Learning from Human Feedback (RLHF), is often employed. The objective of RLHF is twofold; it serves to align the model more closely with complex human values (Ouyang et al., 2022) and, increasingly, to push the performance limits in specialized reasoning tasks such as mathematics and code generation, as seen in models such as DeepSeek-Math (Shao et al., 2024). A core component of this RLHF process is KL regularization, implemented through a loss term derived from the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951). The KL loss serves not only to stabilize the training process but also to improve generalization by preventing the policy from overfitting the reward signal and deviating excessively from the initial SFT model (Ouyang et al., 2022; Stiennon et al., 2020).

Despite the critical role of the KL loss, its theoretical foundations in the optimization context remain underexplored. The choice of its specific mathematical form is often guided by principles from numerical value estimation, not from the perspective of gradient-based optimization. This category error has led to a proliferation of ad-hoc implementations and suboptimal algorithm designs, exemplified by recent methods like GRPO that adopt sure estimators under the mistaken assumption that good value estimation properties translate to effective gradients. This paper argues that a gradient-centric perspective is essential for designing robust and effective RLHF algorithms.

We perform a systematic, gradient-based analysis of the KL loss to address these issues. We first establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $\boldsymbol{k_n}$ as a detached coefficient ('$k_n$ in reward') or as a direct loss function ('$k_n$ as loss'). This framework allows us to analyze any implementation by examining its equivalent gradient coefficient. Using this lens, we first use the '$k_1$ as loss' case as a counterexample to demonstrate the mistakes of the value estimation perspective. We then prove that the conventional '$k_1$ in reward' and the '$k_2$ as loss' formulations are, in fact, gradient equivalent and represent the principled approach to reverse KL regularization. Finally, we analyze popular alternatives like '$k_3$ as loss', revealing their nature as biased approximations, and address a common but critical bug in their off-policy implementation.

Our main contributions are threefold:

A Gradient-Centric Framework for KL Regularization. We revisit KL regularization in RLHF, shifting the focus from value estimation to gradient optimization. We demonstrate the necessity of this perspective using '$k_1$ as loss' as a powerful counterexample, showing how an unbiased value estimator yields a completely ineffective optimization signal.