Past Sessions - Discrete Diffusion Reading Group

S23 | Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

1:02:05

July 13, 2026

S23 | Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

Zhihan Yang (Cornell) presents RePlaid, revisiting the Plaid continuous diffusion language model to challenge the view that continuous diffusion is less scalable than discrete diffusion. RePlaid needs only 20x more compute than an autoregressive model to match its perplexity, close to MDLM at 14x and far below the 64x once attributed to the original Plaid, and it reaches a state-of-the-art perplexity of 22.1 among continuous diffusion language models on OpenWebText.

Diffusion has drawn wide attention in language modeling, yet continuous diffusion has looked less scalable than discrete approaches. This work challenges that belief. The authors revisit Plaid, a likelihood-based continuous diffusion language model, and construct RePlaid by aligning its architecture with modern discrete diffusion language models. In this unified setting they establish the first scaling law for continuous models that rivals discrete ones: RePlaid sits within a 20x compute gap of autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. Against recent continuous models on OpenWebText, RePlaid reaches a new state-of-the-art perplexity bound of 22.1 with better generation quality. The paper also offers theory for why likelihood-based training helps. Optimizing the noise schedule to minimize the variance of the ELBO naturally yields linear cross-entropy over time, which spreads denoising difficulty evenly without any case-specific time reparameterization. Optimizing token embeddings via likelihood creates structured geometries and drives most of the likelihood gain. Together these results suggest that continuous diffusion, when trained via likelihood, is a competitive and scalable alternative to discrete diffusion.

S22 | Nemotron-Labs-Diffusion: A Tri-Mode Language Model

57:05

June 22, 2026

S22 | Nemotron-Labs-Diffusion: A Tri-Mode Language Model

Nemotron-Labs-Diffusion unifies autoregressive, diffusion, and self-speculation decoding in one architecture, trained with a joint objective so it can switch modes to keep throughput high. The two objectives prove complementary: diffusion improves lookahead planning, while autoregression supplies left-to-right linguistic priors.

Nemotron-Labs-Diffusion is a tri-mode language model that unifies three decoding strategies in a single architecture: autoregressive (AR), diffusion, and self-speculation. Autoregressive models decode one token at a time, which underuses hardware at low batch sizes, while diffusion models decode many tokens per forward pass but usually trail AR in accuracy and need far more data to catch up. This work trains a single model with a joint AR-diffusion objective so it can switch modes to sustain high throughput across deployment settings and concurrency levels. The study makes three points. First, the AR and diffusion objectives are complementary: diffusion improves lookahead planning, while AR provides the left-to-right priors of natural language. Second, in self-speculation mode the diffusion path drafts tokens while the AR path verifies them, beating multi-token prediction in both acceptance rate and real-device efficiency. Third, a speed-of-light analysis shows diffusion can produce up to 76.5% more tokens per forward pass than self-speculation under an optimal sampler. The released family spans 3B, 8B, and 14B parameters with base, instruct, and vision-language variants. Nemotron-Labs-Diffusion-8B decodes six times more tokens per forward than Qwen3-8B at comparable accuracy, which translates to four times higher throughput on SPEED-Bench with SGLang on a GB200 GPU.

S21 | Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

1:12:27

June 15, 2026

S21 | Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

Samson Gourevitch (École Polytechnique), Yazid Janati (MBZUAI), and Dario Shariatian (INRIA) revisit Uniform Diffusion Models (UDMs) and show that their standard parameterization is trained by a leave-one-out posterior, which predicts each clean token without seeing its own noisy observation. Correcting for this mismatch improves UDM generation, and an absorbing-state reformulation matches masked diffusion, suggesting the gap between the two comes from parameterization and sampling rather than the choice of marginals.

Samson Gourevitch (École Polytechnique), Yazid Janati (MBZUAI), and Dario Shariatian (INRIA) present their work revisiting Uniform Diffusion Models. Discrete diffusion models are usually trained by predicting clean tokens from a noisy sequence, but that prediction can be turned into reverse dynamics in several ways. For Masked Diffusion Models (MDMs) these choices largely coincide; for Uniform Diffusion Models (UDMs) they do not. This work shows that the standard plug-in parameterization for UDMs is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This exposes a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. The authors characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score, which lets them pair either parameterization with either objective. The same conversions improve inference at no extra training cost, through an informed predictor-corrector sampler and temperature sampling applied to the leave-one-out predictor. The authors further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking step. On language modeling, leave-one-out parameterizations consistently improve UDM generation, and the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion comes less from the choice of marginals than from parameterization and sampling design.

S20 | Closing the Autoregressive Gap with Continuous Bitstream Diffusion

40:40

June 8, 2026

S20 | Closing the Autoregressive Gap with Continuous Bitstream Diffusion

Georgios Batzolis (University of Cambridge) presents a continuous diffusion language model that represents text as fixed-width binary bitstreams instead of token embeddings. An entropy-gated stochastic sampler concentrates randomness where token uncertainty is highest, which narrows the quality gap to autoregressive models, while the model predicts only O(log V) bits per token rather than a full vocabulary distribution.

Georgios Batzolis (University of Cambridge) presents Entropy-Gated Continuous Bitstream Diffusion. Diffusion language models promise parallel, order-agnostic generation, but on standard benchmarks they have long lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion models over token embeddings narrowed this gap, suggesting that continuity itself is not the bottleneck. This work pushes the idea further by diffusing over bitstreams: each semantic token is encoded as a fixed-width sequence of binary bits, embedded in continuous space, and recovered from Gaussian noise by an EDM-style denoiser. Because an isolated bit has a known closed-form posterior under Gaussian corruption, the authors introduce a matched-filter residual parameterization, in which the network computes the analytic independent-bit posterior and spends its capacity only on the contextual residual. The largest gains come from the sampler. The deterministic probability-flow sampler is already competitive but over-contractive, reaching good perplexity by undershooting real-data token entropy. An entropy-gated stochastic sampler corrects this by applying Langevin-type churn on an entropy-rate grid, concentrating randomness in high-information regions and staying nearly deterministic elsewhere. On LM1B, a 130M-parameter model reaches a generative perplexity of 59.76 at matched data entropy using 256 function evaluations, surpassing prior diffusion baselines and the autoregressive reference. On OpenWebText it reaches 27.06 using four times fewer steps than prior 1024-step baselines. Because the model predicts O(log V) bitwise logits rather than a vocabulary-sized distribution, it also uses less memory and runs faster, an advantage that grows as vocabularies do.

1:18:39

June 1, 2026

S19 | ELF: Embedded Language Flows

Keya Hu and Linlu Qiu (MIT) present ELF (Embedded Language Flows), a continuous diffusion language model that runs Flow Matching in continuous embedding space and discretizes to tokens only at the final step. This design makes it easy to adapt image-domain techniques such as classifier-free guidance.

Keya Hu and Linlu Qiu (MIT) present ELF (Embedded Language Flows). Diffusion and flow-based models are the default choice for generating continuous data such as images and video, yet today's leading diffusion language models still work over discrete tokens. ELF (Embedded Language Flows) asks whether continuous models can compete with only minimal special treatment of discreteness. ELF maps tokens into a continuous embedding space using an encoder needed only at training time, then learns a continuous-time Flow Matching process that denoises embeddings from Gaussian noise toward clean ones. It parameterizes the network to predict the clean embedding rather than the velocity, which lets a single shared-weight network handle both denoising and the final decoding step. Discretization happens only at the last step, where that same network projects the embedding through a learned unembedding matrix to token logits, so ELF needs no separate decoder at inference. Because the trajectory stays continuous almost everywhere, techniques from image diffusion such as classifier-free guidance carry over with little change. Across open-web text, machine translation, and summarization, ELF reaches lower generative perplexity than leading discrete models (MDLM, Duo) and concurrent continuous ones (FLM, LangFlow), using fewer sampling steps, roughly ten times fewer training tokens, and no distillation.

S18 | Language Modeling with Spherical Geometry

1:26:34

May 25, 2026

S18 | Language Modeling with Spherical Geometry

Justin Deschenaux (EPFL) and Jannis Chemseddine (TU Berlin) present their recent works on hyperspherical language modeling. By lifting tokens onto the sphere, they define language flows along SLERP and vMF paths. The vMF path admits a closed-form score, Hyperspherical flows improve code generation over prior flow language models, and at matched NFE, the PC sampler with vMF paths improves accuracy on Sudoku.

Discrete diffusion samples from a factorized distribution that is strictly less expressive than autoregressive models, while Flow Language Models (FLMs) avoid factorized sampling but add Gaussian noise on one-hot vectors, a noise process with no clear semantic interpretation for text. Both papers explore the sphere as a natural geometry for language flows. By lifting tokens onto the hypersphere, the authors develop spherical language modeling via SLERP and vMF paths. The vMF path has a closed-form conditional score, enabling predictor-corrector samplers on the sphere. Training with rotations avoids materializing one-hot vectors, making it cheaper than standard FLM training. On TinyGSM (code generation), prior FLMs reach roughly 0% accuracy while hyperspherical flows reach 12-18%. At matched NFE, a PC sampler with vMF paths improves accuracy on Sudoku. Joint work with Caglar Gulcehre, Gregor Kornhardt, and Gabriele Steidl.

S17 | IDLM: Inverse-distilled Diffusion Language Models

1:26:26

May 18, 2026

S17 | IDLM: Inverse-distilled Diffusion Language Models

IDLM extends Inverse Distillation to discrete diffusion language models, with a uniqueness theorem and gradient-stable relaxations that enable effective training, significantly reducing the number of inference steps while preserving the teacher model's entropy and generative perplexity.

Diffusion Language Models (DLMs) have recently achieved strong results in text generation, but their multi-step sampling makes inference slow and limits practical use. This work extends Inverse Distillation, a technique originally developed for accelerating continuous diffusion models, to the discrete setting. The extension raises two challenges: the inverse distillation objective lacks uniqueness guarantees, which can yield suboptimal solutions, and backpropagation through discrete sampling is non-trivial and often unstable. The authors first prove that their inverse formulation admits a unique solution, ensuring valid optimization, and then introduce gradient-stable relaxations that make training effective. Experiments across multiple DLMs show that IDLM significantly reduces the number of inference steps while preserving the teacher model's entropy and generative perplexity.

S16 | Unifying Masked Diffusion Models with Various Generation Orders and Beyond

38:56

May 11, 2026

S16 | Unifying Masked Diffusion Models with Various Generation Orders and Beyond

Order-expressive masked diffusion (OeMDM) unifies masked diffusion, autoregressive, and block diffusion models in a single framework, and its extension LoMDM jointly learns the generation order and diffusion backbone end-to-end, outperforming prior discrete diffusion baselines on language modeling benchmarks.

Masked diffusion models (MDMs) are a promising alternative to autoregressive models for language generation, but their quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy on top of a pretrained MDM, incurring extra cost and suboptimal two-stage optimization. This paper introduces the order-expressive masked diffusion model (OeMDM), a framework spanning a broad class of diffusion processes with various generation orders that subsumes MDMs, autoregressive models, and block diffusion as special cases. Building on OeMDM, the learnable-order masked diffusion model (LoMDM) jointly learns the generation ordering and the diffusion backbone from scratch with a single objective, enabling context-dependent token ordering at sampling time. Empirically, LoMDM outperforms a range of discrete diffusion baselines across multiple language modeling benchmarks.

S15 | Planner Aware Path Learning in Diffusion Language Models Training

41:37

April 20, 2026

S15 | Planner Aware Path Learning in Diffusion Language Models Training

Fred Zhangzhi Peng presents Planner Aware Path Learning (PAPL), a simple modification to the masked diffusion loss that aligns training with planner-based inference, yielding large gains on protein modeling, text generation, and code.

In today's session, Fred Zhangzhi Peng presents a study of a fundamental mismatch in diffusion language models between training and planner-based inference. While planners enable faster, more flexible generation by selecting non-uniform denoising paths, standard training assumes uniformly random trajectories, leading to a misaligned objective. The talk introduces a theoretical analysis showing that the standard ELBO fails under planning, and proposes a new Planned ELBO (P-ELBO) along with Planner Aware Path Learning (PAPL), a simple modification to the masked diffusion loss that aligns training with inference. Empirically, PAPL yields strong gains across domains, including ~40% improvement on protein modeling, up to 4x MAUVE gains in text generation, and +23% on HumanEval pass@10 for code generation.

S14 | One-step Language Modeling via Continuous Denoising

1:13:52

April 8, 2026

S14 | One-step Language Modeling via Continuous Denoising

Flow-based Language Models (FLMs) replace factorized ancestral sampling with sample-level continuous transport via flow matching, and can be distilled into a flow map language model that generates in as few as one step, matching 8-step discrete diffusion quality with an ~8.3× speedup.

Language models based on discrete diffusion have shown promise for parallel generation, but they suffer from factorization error that causes sharp quality degradation in the few-step regime. To overcome this, Flow-based Language Models (FLMs) move from factorized ancestral sampling to sample-level continuous transport via flow matching. FLMs are high-performing through principled design choices such as a decoding-error-based time reparameterization. To enable few-step generation, the paper introduces the two-time denoiser, a novel reparameterization of the flow map that provably lies on the probability simplex, allowing the authors to distill FLM into a flow map language model (FMLM) via cross-entropy. FMLM transports noise to data in as few as one step, outperforming recent few-step discrete diffusion models and matching their 8-step quality at one step with an approximately 8.3× speedup.

S13 | The Diffusion Duality, Chapter II: Ψ-Samplers and Efficient Curriculum

0:46:50

March 23, 2026

S13 | The Diffusion Duality, Chapter II: Ψ-Samplers and Efficient Curriculum

Justin Deschenaux presents a family of Predictor-Corrector samplers for discrete diffusion models that generalize prior approaches to arbitrary noise processes and, unlike conventional methods, continue to improve as the number of sampling steps increases.

In today's session, Justin Deschenaux presents The Diffusion Duality, Chapter II: Ψ-Samplers and Efficient Curriculum. This work introduces a family of Predictor-Corrector (PC) samplers for discrete diffusion models. The method generalizes prior approaches to arbitrary noise processes and, when combined with uniform-state diffusion, overcomes the limitations of ancestral samplers. Unlike conventional methods, these samplers continue to improve as the number of sampling steps increases. The framework is evaluated on language and image modeling tasks, achieving lower perplexity on OpenWebText and improved FID/IS scores on CIFAR-10. Additionally, the work introduces a memory-efficient training curriculum that reduces training time and memory usage while maintaining strong performance.

0:54:49

March 16, 2026

S12 | Discrete Feynman-Kac Correctors

Mohsin Hasan and Viktor Ohanesian present Discrete Feynman-Kac Correctors, a framework for controlling discrete diffusion sampling at inference time using Sequential Monte Carlo, enabling temperature control and reward-guided generation without retraining.

In today's session, Mohsin Hasan and Viktor Ohanesian present their recent work on Discrete Feynman-Kac Correctors, a framework for controlling the sampling distribution of discrete diffusion models at inference time. The method uses Sequential Monte Carlo (SMC) to enable temperature control (annealing), combine multiple diffusion processes, and incorporate external reward functions, without retraining or fine tuning the original model. The framework is demonstrated on applications including Ising model sampling, improved code generation, and reward guided protein sequence generation.

S11 | CANDI: Hybrid Discrete-Continuous Diffusion Models

50:04

March 9, 2026

S11 | CANDI: Hybrid Discrete-Continuous Diffusion Models

Continuous diffusion dominates images and LLMs use continuous embeddings, yet discrete diffusion still wins for language. CANDI explains this via "temporal dissonance" and fixes it by keeping some tokens clean as anchors while corrupting the rest with Gaussian noise.

Continuous diffusion dominates image generation. LLMs process text through continuous embeddings. So why does discrete diffusion still win for language? CANDI explains why, it's a "temporal dissonance": at large vocabulary sizes, Gaussian noise destroys token identity way before it meaningfully degrades the continuous signal. The model can either learn discrete conditional structure or continuous geometry, but not both simultaneously. The fix? Keep some tokens clean as anchors for discrete structure, corrupt the rest with Gaussian noise. Decoupling the two lets the model learn both simultaneously, enabling off-the-shelf classifier guidance and better low-NFE generation.

S10 | Reasoning with Latent Tokens in Diffusion Language Models

1:04:36

March 2, 2026

S10 | Reasoning with Latent Tokens in Diffusion Language Models

Andre He (LTI @ CMU) presents why latent tokens in diffusion language models enable planning and lookahead, and how similar multi-token prediction objectives improve autoregressive reasoning.

In today's session, Andre He (LTI @ CMU) presented his paper Reasoning with Latent Tokens in Diffusion Language Models, exploring why diffusion language models outperform autoregressive (AR) models on synthetic reasoning tasks. Andre argues that diffusion models naturally maintain "latent tokens" (as joint predictions over undecoded positions) which enable planning and lookahead, and demonstrated that modulating these latent tokens creates a smooth tradeoff between inference speed and output quality. Andre further showed that introducing a similar multi-token prediction objective into AR models significantly improves their reasoning performance, suggesting latent tokens as a general mechanism for enhancing global coherence.

S9 | Scaling Discrete Diffusion Language Models

1:20:12

February 23, 2026

S9 | Scaling Discrete Diffusion Language Models

Dimitri von Rütte (ETH) and Zhihan Yang (Cornell) present two papers on scaling laws of discrete diffusion LLMs that challenge the dominance of Masked Diffusion.

Dimitri von Rütte (ETH) and Zhihan Yang (Cornell) present "Scaling Behavior of Discrete Diffusion Language Models" (https://arxiv.org/abs/2512.10858) and "Scaling Beyond Masked Diffusion Language Models" (https://www.arxiv.org/abs/2602.15014), two recent papers presenting systematic scaling laws of uniform-state and hybrid discrete diffusion LLMs. Importantly, both papers challenge the dominance of Masked Diffusion.

01:26:30

February 09, 2026

S8 | The Diffusion Duality

Today, Subham Sahoo (IFM), Justin Deschenaux (EPFL) and Zhihan Yang (Cornell) are presenting The Diffusion Duality (ICML 2025)

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude.

00:36:35

February 02, 2026

S7 | Planned Diffusion

Daniel Israel and Tian Jin discuss Planned Diffusion. Planned diffusion speeds up text generation by planning with an autoregressive model and then generating multiple spans in parallel with diffusion while keeping quality nearly the same.

Daniel Israel and Tian Jin discuss planned diffusion, a hybrid text generation method where a language model first creates a short autoregressive "plan" that splits output into independent spans, then generates those spans in parallel with diffusion, achieving significantly faster generation while maintaining near-autoregressive quality.

S6 | TiDAR: Think in Diffusion, Talk in Autoregression

00:56:26

January 19, 2026

S6 | TiDAR: Think in Diffusion, Talk in Autoregression

Jingyu Liu will discuss TiDAR, a hybrid decoding approach that combines diffusion-style parallel drafting with autoregressive verification for high quality and high throughput.

00:55:06

January 12, 2026

S5 | Esoteric Language Models

In this talk, Zhihan Yang presents Eso-LMs, which unifies AR and diffusion language models. Eso-LMs enable exact likelihoods and KV caching while preserving parallel generation.

In this session, Zhihan Yang presents Eso-LMs, a new family of language models that unifies autoregressive and diffusion-based generation. The talk reframes masked diffusion models as any-order autoregressive models and shows how a causal attention design can overcome long-standing limitations of diffusion LMs. This perspective enables, for the first time, exact likelihood computation for masked diffusion models and introduces KV caching at inference time, while preserving their ability to generate tokens in parallel. Combined with an optimized sampling schedule, Eso-LMs achieve a new state of the art on the speed-quality Pareto frontier for unconditional text generation, delivering 14 to 65x faster inference on long contexts compared to standard diffusion models and 3 to 4x faster inference than prior semi-autoregressive approaches. The presentation highlights both the theoretical insights behind the model design and the practical gains in inference efficiency.

S4 | DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

1:00:19

December 22, 2025

S4 | DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

In this talk, Shansan Gong will present DiffuCoder and discuss how diffusion language models enable global planning and iterative refinement for code generation.

In this session, Shansan Gong presents DiffuCoder, a 7B diffusion large language model for code generation. The talk explores how diffusion LLMs differ from autoregressive models in decoding behavior, showing flexible causality and temperature controlled generation order, and introduces coupled-GRPO, a diffusion native RL method that improves code performance by +4.4% on EvalPlus while reducing AR bias.

S3 | OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows

1:03:02

December 15, 2025

S3 | OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows

In this talk, John Nguyen presents OneFlow, a non-autoregressive multimodal model for concurrent text and image generation.

In our third session, John Nguyen presents OneFlow, a non-autoregressive multimodal model for concurrent text and image generation. OneFlow combines insertion-based edits (EditFlow) with flow matching for images, outperforming autoregressive and diffusion baselines while using up to 50% fewer training FLOPs.

S2 | Peptune: De Novo Generation of Therapeutic Peptides with Guided Discrete Diffusion

1:02:24

November 25, 2025

S2 | Peptune: De Novo Generation of Therapeutic Peptides with Guided Discrete Diffusion

In this talk, Sophia Tang shows how discrete diffusion enables more controllable and efficient molecule generation.

Discrete diffusion models provide much greater control over the generation process, making them a strong alternative to autoregressive models. In this talk, Sophia Tang (University of Pennsylvania) presents her paper, "PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion," explaining why and showing how discrete diffusion enables more controllable and efficient molecule generation.

S1 | Diffusion Language Models beat AR in data constrained regime

1:28:44

November 21, 2025

S1 | Diffusion Language Models beat AR in data constrained regime

In this talk, Mihir Prabhudesai shows that diffusion LLMs excel in such settings by extracting more information from limited data.

The secret behind the success of current LLMs lies in the massive amount of data they are trained on. However, as these models continue to scale, we are rapidly approaching a data-constrained regime--where models demand more data than is available; after all, we only have one internet. In this talk, Mihir Prabhudesai shows that diffusion LLMs excel in such settings by extracting more information from limited data.