S9 | Scaling Discrete Diffusion Language Models
Dimitri von Rütte (ETH) and Zhihan Yang (Cornell) present two papers on scaling laws of discrete diffusion LLMs that challenge the dominance of Masked Diffusion.
Dimitri von Rütte (ETH) and Zhihan Yang (Cornell) present "Scaling Behavior of Discrete Diffusion Language Models" (https://arxiv.org/abs/2512.10858) and "Scaling Beyond Masked Diffusion Language Models" (https://www.arxiv.org/abs/2602.15014), two recent papers presenting systematic scaling laws of uniform-state and hybrid discrete diffusion LLMs. Importantly, both papers challenge the dominance of Masked Diffusion.
S8 | The Diffusion Duality
Today, Subham Sahoo (IFM), Justin Deschenaux (EPFL) and Zhihan Yang (Cornell) are presenting The Diffusion Duality (ICML 2025)
Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude.
S7 | Planned Diffusion
Daniel Israel and Tian Jin discuss Planned Diffusion. Planned diffusion speeds up text generation by planning with an autoregressive model and then generating multiple spans in parallel with diffusion while keeping quality nearly the same.
Daniel Israel and Tian Jin discuss planned diffusion, a hybrid text generation method where a language model first creates a short autoregressive “plan” that splits output into independent spans, then generates those spans in parallel with diffusion, achieving significantly faster generation while maintaining near-autoregressive quality.
S6 | TiDAR: Think in Diffusion, Talk in Autoregression
Jingyu Liu will discuss TiDAR, a hybrid decoding approach that combines diffusion-style parallel drafting with autoregressive verification for high quality and high throughput.
Jingyu Liu will discuss TiDAR, a hybrid decoding approach that combines diffusion-style parallel drafting with autoregressive verification for high quality and high throughput.
S5 | Esoteric Language Models
In this talk, Zhihan Yang presents Eso-LMs, which unifies AR and diffusion language models. Eso-LMs enable exact likelihoods and KV caching while preserving parallel generation.
In this session, Zhihan Yang presents Eso-LMs, a new family of language models that unifies autoregressive and diffusion-based generation. The talk reframes masked diffusion models as any-order autoregressive models and shows how a causal attention design can overcome long-standing limitations of diffusion LMs. This perspective enables, for the first time, exact likelihood computation for masked diffusion models and introduces KV caching at inference time, while preserving their ability to generate tokens in parallel. Combined with an optimized sampling schedule, Eso-LMs achieve a new state of the art on the speed–quality Pareto frontier for unconditional text generation, delivering 14 to 65× faster inference on long contexts compared to standard diffusion models and 3 to 4× faster inference than prior semi-autoregressive approaches. The presentation highlights both the theoretical insights behind the model design and the practical gains in inference efficiency.
S4 | DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
In this talk, Shansan Gong will present DiffuCoder and discuss how diffusion language models enable global planning and iterative refinement for code generation.
In this session, Shansan Gong presents DiffuCoder, a 7B diffusion large language model for code generation. The talk explores how diffusion LLMs differ from autoregressive models in decoding behavior, showing flexible causality and temperature controlled generation order, and introduces coupled-GRPO, a diffusion native RL method that improves code performance by +4.4% on EvalPlus while reducing AR bias.
S3 | OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows
In this talk, John Nguyen presents OneFlow, a non-autoregressive multimodal model for concurrent text and image generation.
In our third session, John Nguyen presents OneFlow, a non-autoregressive multimodal model for concurrent text and image generation. OneFlow combines insertion-based edits (EditFlow) with flow matching for images, outperforming autoregressive and diffusion baselines while using up to 50% fewer training FLOPs.
S2 | Peptune: De Novo Generation of Therapeutic Peptides with Guided Discrete Diffusion
In this talk, Sophia Tang shows how discrete diffusion enables more controllable and efficient molecule generation.
Discrete diffusion models provide much greater control over the generation process, making them a strong alternative to autoregressive models. In this talk, Sophia Tang (University of Pennsylvania) presents her paper, “PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion,” explaining why and showing how discrete diffusion enables more controllable and efficient molecule generation.
S1 | Diffusion Language Models beat AR in data constrained regime
In this talk, Mihir Prabhudesai shows that diffusion LLMs excel in such settings by extracting more information from limited data.
The secret behind the success of current LLMs lies in the massive amount of data they are trained on. However, as these models continue to scale, we are rapidly approaching a data-constrained regime—where models demand more data than is available; after all, we only have one internet. In this talk, Mihir Prabhudesai shows that diffusion LLMs excel in such settings by extracting more information from limited data.