Diffusion based LLMs : The Next Frontier in Language AI

May 15, 2025

The open-source large language model (LLM) landscape has long been dominated by autoregressive (AR) models like Mistral 7B, Llama 3 8B, and Gemma 7B-models optimized for left-to-right generation. But a new challenger has emerged: diffusion-based LLMs like DreamLLM-7B, LLaDA, and Mercury Coder. These models rewrite the rules of text generation by creating entire responses simultaneously, promising faster speeds and novel capabilities. Let’s explore how these architectures differ and why diffusion models might reshape AI’s future.

How Autoregressive LLMs Work (And Why They Dominate)

Autoregressive models operate like skilled typists-predicting one word at a time based on preceding context. When you ask an AR model to explain quantum physics, it:

Processes your prompt
Generates the first probable word (e.g., "Quantum")
Uses that word to predict the next ("mechanics")
Repeats until completion

A diagram of a model

AI-generated content may be incorrect.

Reference - https://hkunlp.github.io/blog/2025/dream/

This approach mirrors human writing but has inherent limitations. Just as you can’t edit paragraph one while writing paragraph three, AR models can’t revise earlier tokens once generated. This leads to issues like:

Error propagation: Early mistakes compound
Inefficiency: Sequential processing limits parallel computation
Context blindness: Difficulty revising global narrative structure

Yet AR models excel at fluency and coherence, especially when trained on massive datasets.

The Diffusion Revolution: Text as a Canvas

Diffusion models take inspiration from AI image generators like Stable Diffusion. Instead of painting pixels left-to-right, they:

Start with random "noise" (masked tokens)
Iteratively refine the entire text
Converge on coherent output

Reference - https://hkunlp.github.io/blog/2025/dream/

DreamLLM-7B, for example, begins with a fully masked response to your query. At each step, it predicts all missing tokens simultaneously, allowing global optimization. This process resembles editing a draft:

First pass: "The [MASK] principle states that particles can [MASK] in two states."
Fifth pass: "The quantum principle states that particles can exist in two states."
Final output: "The superposition principle states that quantum particles can simultaneously exist in multiple states."

Autoregressive (AR) models generate text in a left-to-right fashion, building each word one after another. This makes them less parallelizable, meaning they can't easily generate multiple parts of the text at once. Error correction is limited-once a mistake is made early in the sequence, it can affect everything that follows. AR models also require huge datasets for training to achieve high performance.

Diffusion models, on the other hand, use a global refinement process. They can update and improve the entire text in multiple passes, which allows for much higher parallelizability. Error correction is built into the process, as the model can revise earlier mistakes during each refinement step. Additionally, diffusion models tend to be more efficient in training and can achieve good results with smaller datasets compared to AR models.

Case Study: Mercury Coder vs. Mistral 7B

Inception Labs’ Mercury Coder showcases diffusion’s potential. When tasked with writing Python code to split an image, it:

Generates a rough skeleton in 0.2s
Refines variable names and logic over 14 steps
Produces working code in 1.1s total

Comparatively, Mistral 7B takes 3.4s for sequential generation. Early benchmarks suggest diffusion models achieve 5-10x speedups on long-form tasks, though AR still leads for short responses.

A graph showing the output speed of a computer

AI-generated content may be incorrect.

Reference - https://www.inceptionlabs.ai/introducing-mercury

The Road Ahead: Hybrid Models and New Possibilities

While diffusion models still have room to improve on basic grammar and fluency, their architectural advantages are undeniable. New hybrid approaches are emerging-combining AR and diffusion techniques to get the best of both worlds. Early results show:

Better mathematical reasoning
Faster long-context processing
Improved factual consistency

Diffusion allows models to "think before speaking," breaking free from the step-by-step myopia of AR models. This could enable:

Multi-step reasoning: Solve complex problems iteratively
Controlled generation: Precisely steer output style/content
Post-editing: Revise existing text non-destructively

Conclusion: A Paradigm Shift in Progress

Autoregressive models won’t disappear-their fluency and scalability ensure continued dominance for chatbots and search. But diffusion’s emergence signals a critical evolution. Just as transformers replaced RNNs, we’re entering an era where global text optimization unlocks new AI capabilities. For developers, this means:

Experiment with both architectures: Use AR models for chat, diffusion models for reasoning
Watch for hybrid systems: These may blend the best of both worlds
Rethink pipelines: Diffusion enables batch processing of long documents

The future isn’t AR vs. diffusion-it’s creative synthesis. As these paradigms converge, we’ll see LLMs that write like humans and edit like seasoned publishers.

References:

AI-Ninza’s Substack

Discussion about this post