Author: Shuibai Zhang  Date: Oct 7, 2025

<aside> 💡

</aside>

Inherent Limitations of Diffusion LLMs (DLLMs)

Claim. Despite their popularity, DLLMs face two fundamental structural challenges that limit their scalability compared to autoregressive (AR) LLMs:

  1. Unavailable KV caching due to bidirectional attention, which significantly reduces generation speed.
  2. Intractable sequence likelihood, making post-training and RLHF-style optimization difficult.

This blog examines why these limitations exist, explores current workarounds, and discusses their associated trade-offs.

KV Cache Works in AR but Fails in DLLMs

KV cache recap. During generation, Transformers compute per-token Keys and Values; caching them avoids recomputing attention on previously seen tokens.

Consequence. Without KV caching, per-step compute scales poorly, and long-context generation becomes much slower than in AR systems.

Workaround A — Block Diffusion

Idea. Decode in blocks and enforce a causal direction within each block, allowing reuse of earlier K/V for the current block. See: arXiv:2503.09573.

Pros:

Cons / Trade-offs: