Team: Zirui Wu, Lin Zheng*, Zhihui Xie, Jiacheng Ye, Jiahui Gao, Yansong Feng, Zhenguo Li, Victoria W., Guorui Zhou , Lingpeng Kong*

*: Equal Contribution

Affiliations: The University of Hong Kong, Kuaishou Technology, Huawei Noah's Ark Lab, Peking University

🤗 Huggingface 💻 Codebase

<aside> 📌

In this post, we introduce a simple yet effective method for diffusion language models to perform variable-length code infilling. Our approach features

Effective Variable-length Generation on Infilling

Although Diffusion Language Models (DLMs) have recently gained significant attention, they face a critical limitation: they require a fixed-size canvas to be specified in advance, making variable-length generation a long-standing and difficult problem. This restriction arises from standard discrete diffusion formulations that merely transmit tokens between different states in-place over a predetermined canvas size.

This limitation makes it challenging for DLMs to tackle flexible generation in real-world applications, such as infilling, where the content length must be specified a priori. To illustrate, we evaluate the performance of our Dream-Coder-7B on code infilling tasks, where the model is asked to fill the missing span given a prefix and suffix context. When the given mask length does not align with the length of the canonical solution, it struggles to infill the code and pass@1 drops by 38% compared with oracle-length performance.

In this work, we present DreamOn (Diffusion Reasoning Model with Length Control), a novel discrete diffusion algorithm designed to address the variable-length generation challenge in code infilling. Our approach enables dynamic expansion and contraction of mask tokens during inference, providing flexible length control without requiring predetermined canvas sizes.

With too few  masked tokens, diffusion models lack sufficient room for meaningful code infilling.

With too few masked tokens, diffusion models lack sufficient room for meaningful code infilling.

Too many masked tokens cause overgeneration of unnecessary code snippet   that is  incorrect.

Too many masked tokens cause overgeneration of unnecessary code snippet depth > 0 that is incorrect.

DreamOn adds mask tokens as needed

DreamOn adds mask tokens as needed

DreamOn deletes excess mask tokens.

DreamOn deletes excess mask tokens.

We believe that enabling variable-length sequence generation opens new avenues for DLMs, unlocking their potential for more sophisticated applications including adaptive prompting, flexible infilling, and seamless editing workflows, particularly in programming contexts where content length is inherently unpredictable.

DreamOn: Masked Diffusion with Augmented States

DreamOn extends standard masked diffusion models by introducing two special states <|expand|> and <|delete|> to enable precise length control. We define them in such a way that in the forward diffusion process,  tokens in both <|expand|> and <|delete|> are always transmitted to <|mask|> ; and during the backward process, <|expand|> is deterministically expanded into two <|mask|> tokens at the same position, while <|delete|> is removed from the sequence. This design allows the model to dynamically adjust sequence length.

To train the model with these special states, we construct an auxiliary sequence $\bold{z}_0$ from each original sequence $\bold{x}_0$ by 1) randomly merging token spans in $\bold{x}_0$ into <|expand|> , and 2) inserting a random number of tokens with state <|delete|> . As illustrated in the diagram below, $\bold{z}_0$ typically differs in length from $\bold{x}_0$. We then train the masked diffusion model on $\bold{z}_0$ instead of diffusing over $\bold{x}_0$, and by doing so, the model learns to denoise not only regular tokens but also special states from <|mask|>, thus achieving effective variable-length generation.

augmented-diffusion-w-fullbg.png

Implementation

Similar to <|mask|> in masked diffusion models, we define the introduced states <|expand|> and <|delete|> as special sentinel tokens in the tokenizer vocabulary, and train the model to denoise them just as if they were regular tokens. This formulation is appealing due to its ease of implementation — requiring no changes to the model architecture and supporting straightforward fine-tuning from pretrained masked diffusion models.

Training