Inherent Limitations of Diffusion LLMs (DLLMs)

Claim. Despite their popularity, DLLMs face two fundamental structural challenges that limit their scalability compared to autoregressive (AR) LLMs:

Unavailable KV caching due to bidirectional attention, which significantly reduces generation speed.
Intractable sequence likelihood, making post-training and RLHF-style optimization difficult.

This blog examines why these limitations exist, explores current workarounds, and discusses their associated trade-offs.

KV Cache Works in AR but Fails in DLLMs

KV cache recap. During generation, Transformers compute per-token Keys and Values; caching them avoids recomputing attention on previously seen tokens.

AR models. Use a fixed causal mask (token i only attends to tokens ≤ i). Once a token’s K/V are computed, they never change as model generates more tokens. Hence we can reuse (cache) them efficiently across steps.
DLLMs (masked diffusion / bidirectional attention). Tokens can attend both left and right, and the effective attention pattern changes as tokens are unmasked/refined across diffusion steps. When a new token is revealed or a mask changes, the context changes, so their K/V must be recomputed. This breaks standard KV caching.

Consequence. Without KV caching, per-step compute scales poorly, and long-context generation becomes much slower than in AR systems.

Workaround A — Block Diffusion

Idea. Decode in blocks and enforce a causal direction within each block, allowing reuse of earlier K/V for the current block. See: arXiv:2503.09573.

Pros:

Enables a form of KV caching inside blocks
Delivers substantial speedups compared to fully bidirectional diffusion

Cons / Trade-offs: