Digital Event Horizon
NVIDIA's latest innovation, Nematron-Labs Diffusion, promises to revolutionize the field of text generation with its groundbreaking diffusion language models that offer significant improvements in performance, accuracy, and efficiency compared to traditional autoregressive models. Learn more about this exciting new development and how it can transform your applications.
NVIDIA's Nemotron-Labs Diffusion is a new family of diffusion language models that promise to revolutionize text generation with improved performance, accuracy, and efficiency.The model builds upon previous work on efficient-DLM and adds diffusion capabilities to an existing AR model, enabling the creation of a single family of models for both autoregressive and diffusion generation.Nemotron-Labs Diffusion includes text models at 3B, 8B, and 14B scales, with a 8B scale vision-language model available under the NVIDIA Source Code License.The model operates in three different generation modes: autoregressive mode, diffusion mode, and self-speculation mode, offering flexibility and performance advantages.Nemotron-Labs Diffusion achieves improved average accuracy and inference speed compared to traditional AR models, with deployment support planned for SGLang and availability through a GitHub issue tracker request.
NVIDIA has recently made a groundbreaking announcement that is set to revolutionize the field of text generation. The company's latest innovation, Nemotron-Labs Diffusion, is a new family of diffusion language models that promise to bring about significant improvements in performance, accuracy, and efficiency compared to traditional autoregressive models.
The development of Nemotron-Labs Diffusion builds upon the success of previous diffusion language models, which have historically been plagued by lower accuracy, limited compatibility with KV caching, and more difficult training. However, recent work has shown that efficient-DLM can be used to convert pretrained AR models into diffusion language models through continued pretraining and altering the attention mechanism to a block-wise approach.
Nemotron-Labs Diffusion takes this practical insight further by adding diffusion capabilities to an existing AR model. The model was trained with a joint AR and diffusion objective, allowing it to retain what it had learned during its initial AR training while diffusion added parallel drafting capability. This approach enables the creation of a single family of models that can handle both autoregressive and diffusion generation, without the need for separate model families.
The Nemotron-Labs Diffusion family includes text models at 3B, 8B, and 14B scales, all available under the commercially-friendly NVIDIA Nemotron Open Model License. Additionally, a 8B scale vision-language model (VLM) is also available under the NVIDIA Source Code License, granting broad research flexibility.
One of the key features of Nematron-Labs Diffusion is its ability to operate in three different generation modes: autoregressive mode, diffusion mode, and self-speculation mode. The autoregressive mode runs like a standard left-to-right LLM, while the diffusion mode generates block by block, gradually generating tokens over multiple steps. Self-speculation mode uses diffusion to draft multiple candidate tokens, then uses autoregressive decoding to verify them.
Performance Highlights
Nemotron-Labs Diffusion 8B achieves an improved average accuracy of 1.2% compared with Qwen3 8B. Comparing the inference speed measured in tokens per forward pass (TPF for short), the diffusion mode reaches 2.6× higher TPF than AR models, while self-speculation pushes that further to 6× for linear self-speculation and 6.4× for quadratic self-speculation, with comparable accuracy across the evaluated tasks.
Deployment and Inference
The deployment of Nematron-Labs Diffusion models will soon be supported in the main branch of SGLang. At the time of this writing, support for inference is available through this issue tracker request on GitHub. What's neat is that the integration lets you serve the same checkpoint in three different ways, picked by a single line in your algorithm config.
The model fills in a 32-token block at a time by iteratively denoising it, and a confidence threshold decides which tokens are "good enough" to commit each step. This approach enables developers to seamlessly switch between the model they use today, or Nematron-Labs Diffusion in various inference modes for ultra-fast generation speeds.
Conclusion
Nemotron-Labs Diffusion brings diffusion-style generation into a form that developers can actually use: open models, familiar AR compatibility, diffusion decoding, and self-speculative acceleration in one family. With Nematron-Labs Diffusion, developers get a new way to draft, refine, verify, and accelerate text generation, without needing to alter their applications.
To get started, explore the Nemotron-Labs Diffusion model family, read the technical report, and try the available training recipe. The future of text generation has arrived, and NVIDIA is leading the charge.
Related Information:
https://www.digitaleventhorizon.com/articles/NVIDIA-Nemotron-Labs-Diffusion-Revolutionizes-Text-Generation-A-New-Era-for-Autoregressive-and-Diffusion-Models-deh.shtml
https://huggingface.co/blog/nvidia/nemotron-labs-diffusion
https://research.nvidia.com/publication/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive
Published: Fri May 22 19:37:55 2026 by llama3.2 3B Q4_K_M