Authors: Jiahao Yu* , Zelei Cheng*** , Xian Wu, Xinyu Xing
Date: 2025-09-01
LLM-powered software engineering agents are rapidly advancing, showing great promise in automating complex coding tasks. However, as these agents tackle real-world problems, a core challenge has emerged: while we can generate many potential solutions to a problem (a strategy known as test-time scaling[1]), the performance gains are often limited if the solutions are too similar to one another[2].
This is because modern alignment techniques, such as Direct Preference Optimization (DPO), tend to inadvertently reduce the diversity of the model’s outputs. This “diversity collapse” means the model becomes overconfident in a narrow range of solutions, making it less likely to find the correct one for complex problems. If you ask an agent to generate ten solutions and it gives you the same idea repackaged ten times, you haven’t really explored the solution space.
To address this, we introduce EntroPO, an entropy-enhanced preference optimization method tailored for multi-turn, tool-using coding agents. EntroPO is designed to preserve policy diversity during fine-tuning, unlocking significant performance gains from test-time scaling.
EntroPO introduces two key innovations to overcome the limitations of existing approaches:
Technically, EntroPO is built on a carefully designed loss objective that optimizes the agent’s policy $\pi$. The goal is to maximize the expected utility $u(x,y)$ of the final solution while explicitly regularizing the policy to maintain diversity. The objective function is:
$\max_{\pi} \mathbb{E}[u(x,y) + \alpha \cdot H(\pi) - \beta \cdot H(\pi, \pi_{ref})]$
Here, $H(\pi)$ is the policy’s entropy, and $\alpha$ is a coefficient that scales its importance, directly encouraging the generation of diverse outputs. The term $H(\pi, \pi_{ref})$ is the cross-entropy between the learned policy and a reference policy $\pi_{ref}$, with $β$ as a coefficient that penalizes deviation from the original model’s behavior.
In the single-turn case, this objective has a closed-form optimal policy:
$\pi(y|x) \propto \pi_{ref}(y|x)^{\beta/\alpha} \exp(\frac{u(x,y)}{\alpha})$
This shows how the final policy balances the original model’s predictions, the learned utility, and the entropy bonus.
In the multi-turn case, EntroPO uses a backward iteration approach to find the optimal policy, solving the problem recursively from the last step $(h = H)$ down to the first $(h = 1)$.
The optimal policy $\pi_{\mathcal{M},h}$ at any step $h$ is defined in terms of the Q-value, $Q_{\mathcal{M},h}$:
$\pi_{\mathcal{M},h}(a_h|s_h) = \frac{\pi_{ref,h}(a_h|s_h)^{\beta/\alpha}}{Z_h(s_h)} \exp\left(\frac{Q_{\mathcal{M},h}(s_h, a_h)}{\alpha}\right)$
where $Z_h(s_h)$ is a normalization constant. The Q-value is defined by the value of the next state, $V_{\mathcal{M}, h+1}$, creating a recursive relationship: