Understanding the Limitations of Diffusion LLMs through a Probabilistic Perspective

Author: Cunxiao Du, Xinyu Yang, Min Lin, Chao Du and the team

Diffusion-based large language models (Diffusion LLMs) have recently attracted significant attention as a fundamentally different architecture from standard autoregressive (AR) models. Many people believe that diffusion models are the new substitution for autoregressive LLMs because of their various advantages.

A natural question follows:

As probabilistic models for language data, how do Diffusion LLMs differ from AR LLMs when fitting the natural language?

We argue that viewing Diffusion LLMs as any-order language models offers a powerful lens for analysis. Under this lens, we present two key insights about the limitations of Diffusion LLMs:

1. Natural language exhibits strong structural biases that make some orders (notably Left-to-RIght L2R and Right-to-Left R2L) significantly easier to model than the random order.

2. Because Diffusion LLMs optimize all orders uniformly, they do not naturally concentrate their capacity on these favorable orders, leading to a significantly looser approximation of the underlying data distribution.

1. Background: Diffusion LLM and Any-Order LLM

(1) Diffusion LLM

Following Ni et al. [1], we focus on the widely used masked diffusion.

The training process operates as follows:

For a sequence of length n, uniformly randomly mask 1–n tokens.
Conditioned on the visible (unmasked) tokens, the model predicts the masked ones.
The loss is thus $\mathcal{L_{diff}} = \mathbb{E}{q{mask}}-\log p_{\theta}(x_{\text{mask}} \mid x_{\text{non-mask}})$.

A simple example:

Original:  The quick brown fox jumps over the lazy dog
Masking:   The quick [MASK] fox jumps [MASK] the lazy dog
Task:      Predict “brown” and “over” given the rest.

This formulation resembles masked language modeling, e.g., bert.

(2) Diffusion LLMs as Any-Order Language Models

A key property of (masked) diffusion over discrete tokens is that, at each step, the model is trained to predict a subset of masked / noised tokens given the remaining unmasked tokens.