Author: Shuibai Zhang Date: Oct 7, 2025
<aside>
💡
</aside>
Inherent Limitations of Diffusion LLMs (DLLMs)
Claim. Despite their popularity, DLLMs face two fundamental structural challenges that limit their scalability compared to autoregressive (AR) LLMs:
- Unavailable KV caching due to bidirectional attention, which significantly reduces generation speed.
- Intractable sequence likelihood, making post-training and RLHF-style optimization difficult.
This blog examines why these limitations exist, explores current workarounds, and discusses their associated trade-offs.
KV Cache Works in AR but Fails in DLLMs
KV cache recap. During generation, Transformers compute per-token Keys and Values; caching them avoids recomputing attention on previously seen tokens.
- AR models. Use a fixed causal mask (token i only attends to tokens ≤ i). Once a token’s K/V are computed, they never change as model generates more tokens. Hence we can reuse (cache) them efficiently across steps.
- DLLMs (masked diffusion / bidirectional attention). Tokens can attend both left and right, and the effective attention pattern changes as tokens are unmasked/refined across diffusion steps. When a new token is revealed or a mask changes, the context changes, so their K/V must be recomputed. This breaks standard KV caching.
Consequence. Without KV caching, per-step compute scales poorly, and long-context generation becomes much slower than in AR systems.
Workaround A — Block Diffusion
Idea. Decode in blocks and enforce a causal direction within each block, allowing reuse of earlier K/V for the current block. See: arXiv:2503.09573.
Pros:
- Enables a form of KV caching inside blocks
- Delivers substantial speedups compared to fully bidirectional diffusion
Cons / Trade-offs: