Zitian Gao, Lynx Chen, Haoming Luo, Joey Zhou, Bryan Dai$^{\dagger}$

$\dagger$ : Corresponding author

GitHub: https://github.com/zitian-gao/one-shot-em

<aside> ✨

We propose One-shot Entropy Minimization, a surprisingly powerful and fully unsupervised method that rivals or surpasses reinforcement learning using just a single unlabeled data.
We conduct an in-depth analysis of the effectiveness of One-shot EM, making it a highly reasonable approach. We find that it shares many core properties with RL, yet drives model behavior in the opposite direction when viewed through the lens of logits shift.
We extensively evaluate EM and identify temperature as a key factor for both training and inference, and EM shows an opposite trend to RL with respect to inference-time temperature.
We reveal that EM is a distribution shaping tool rather than a learning method by analyzing the inconsistency in loss-reasoning curve and the logit shift effect of EM. </aside>

Using only a single piece of unlabeled data, EM actually outperforms reinforcement learning?

No labeled data, no complex reward design—just 10 steps to see results. "Entropy Minimization" might be a better fit than reinforcement learning for rapidly upgrading large language models.

Introduction

The post-training phase of large language models (LLMs) has advanced rapidly [3, 9, 20–22, 24], with models like DeepSeek-R1 [4], Kimi-K1.5 [19], and OpenAI o-series [13, 14] demonstrating remarkable reasoning abilities. However, preparing for Reinforcement Learning (RL) is never an easy task. It often requires a large amount of high-quality ground truth labeled data, along with the careful design of rule-based rewards to maximize advantage signals and prevent reward hacking.

In contrast, Entropy Minimization (EM) is entirely unsupervised. We used the EM method to train 13,440 large language models in order to eliminate randomness in training as much as possible and ensure that the experimental results and observed patterns are reliable. Our rigorous study demonstrates that using just a single piece of unlabeled data, the performance already surpasses that of traditional RL. Moreover, they typically converge within just 10 training steps, which is significantly faster than the thousands of steps often required for RL. EM is based on two direct and simple assumptions: (1) The sampling process in generation of large language models is inherently stochastic, (2) Correct answers generally have lower entropy than incorrect ones. Our study reveals that EM and RL share the same goal: unlocking the pretrained model’s latent potential without adding new knowledge [11]. Both rely on a process we call “token reranking” to maximize the model’s performance. We find that entropy minimization has the capacity to rival RL in the post-training phase.

The evaluation results of the opening figure are detailed below:

Reinforcement Learning (RL) has achieved remarkable success in the fine-tuning of large language models (LLMs) in recent years. However, the high cost of data annotation, the complexity of reward design, and the long training cycles have become major bottlenecks limiting the broader application of RL.

We proposed an extremely simple yet effective unsupervised method—One-shot Entropy Minimization. With just a single piece of unlabeled data and fewer than 10 training steps, EM can significantly improve LLM performance—sometimes even outperforming RL approaches that rely on thousands of annotated samples. This breakthrough may fundamentally reshape our understanding of LLM post-training.

From RL to EM:

The Challenges of Fine-Tuning LLMs and a New Perspective Today’s large language models (LLMs), after being pre-trained on massive datasets, exhibit impressive general capabilities. However, to push their performance to top-tier levels in specific, complex reasoning tasks (such as math, physics, or programming), post-training is required. The mainstream method for post-training has been reinforcement learning (RL), particularly RL with verifiable rewards (RLVR).