Introducing EntroPO: Supercharging LLM Coding Agents by Preserving Solution Diversity

Authors: Jiahao Yu* , Zelei Cheng*** , Xian Wu, Xinyu Xing

Date: 2025-09-01

LLM-powered software engineering agents are rapidly advancing, showing great promise in automating complex coding tasks. However, as these agents tackle real-world problems, a core challenge has emerged: while we can generate many potential solutions to a problem (a strategy known as test-time scaling[1]), the performance gains are often limited if the solutions are too similar to one another[2].

This is because modern alignment techniques, such as Direct Preference Optimization (DPO), tend to inadvertently reduce the diversity of the model’s outputs. This “diversity collapse” means the model becomes overconfident in a narrow range of solutions, making it less likely to find the correct one for complex problems. If you ask an agent to generate ten solutions and it gives you the same idea repackaged ten times, you haven’t really explored the solution space.

To address this, we introduce EntroPO, an entropy-enhanced preference optimization method tailored for multi-turn, tool-using coding agents. EntroPO is designed to preserve policy diversity during fine-tuning, unlocking significant performance gains from test-time scaling.

The EntroPO Method

EntroPO introduces two key innovations to overcome the limitations of existing approaches:

Entropy-Enhanced Optimization: At its core, EntroPO modifies the preference optimization objective to explicitly encourage the model for maintaining a diverse set of potential outputs (i.e., preserving policy entropy). This directly counteracts diversity collapse, encouraging the agent to explore meaningfully different pathways to a solution rather than fixating on a single one.
Multi-Turn Trajectory Optimization: Software engineering is a multi-step process. Instead of only evaluating the final proposed code, EntroPO learns from preferences over the entire sequence of actions—the “trajectory”—an agent takes. We are the first to extend the entropy-preserving optimization to the multi-turn setting. By optimizing over the whole process, we teach the agent to develop better reasoning and decision-making skills at every step.

Technically, EntroPO is built on a carefully designed loss objective that optimizes the agent’s policy $\pi$. The goal is to maximize the expected utility $u(x,y)$ of the final solution while explicitly regularizing the policy to maintain diversity. The objective function is:

$\max_{\pi} \mathbb{E}[u(x,y) + \alpha \cdot H(\pi) - \beta \cdot H(\pi, \pi_{ref})]$

Here, $H(\pi)$ is the policy’s entropy, and $\alpha$ is a coefficient that scales its importance, directly encouraging the generation of diverse outputs. The term $H(\pi, \pi_{ref})$ is the cross-entropy between the learned policy and a reference policy $\pi_{ref}$, with $β$ as a coefficient that penalizes deviation from the original model’s behavior.

In the single-turn case, this objective has a closed-form optimal policy:

$\pi(y|x) \propto \pi_{ref}(y|x)^{\beta/\alpha} \exp(\frac{u(x,y)}{\alpha})$

This shows how the final policy balances the original model’s predictions, the learned utility, and the entropy bonus.

In the multi-turn case, EntroPO uses a backward iteration approach to find the optimal policy, solving the problem recursively from the last step $(h = H)$ down to the first $(h = 1)$.

The optimal policy $\pi_{\mathcal{M},h}$ at any step $h$ is defined in terms of the Q-value, $Q_{\mathcal{M},h}$:

$\pi_{\mathcal{M},h}(a_h|s_h) = \frac{\pi_{ref,h}(a_h|s_h)^{\beta/\alpha}}{Z_h(s_h)} \exp\left(\frac{Q_{\mathcal{M},h}(s_h, a_h)}{\alpha}\right)$

where $Z_h(s_h)$ is a normalization constant. The Q-value is defined by the value of the next state, $V_{\mathcal{M}, h+1}$, creating a recursive relationship: