Updated
Updated · Computerworld · Jun 12
Google Unveils 26B DiffusionGemma, Claiming 4x Faster Text Generation
Updated
Updated · Computerworld · Jun 12

Google Unveils 26B DiffusionGemma, Claiming 4x Faster Text Generation

3 articles · Updated · Computerworld · Jun 12

Summary

  • DiffusionGemma generates 256-token blocks in parallel instead of one token at a time, which Google says delivers up to 4x faster inference for local, low-latency text workloads.
  • The experimental open model uses diffusion-style iterative refinement, bidirectional attention and a 26B mixture-of-experts design that activates 3.8B parameters during inference.
  • 18GB VRAM is enough to run a quantized version on high-end consumer GPUs such as Nvidia's RTX 5090, and Google released it under Apache 2.0 on Hugging Face, GitHub and cloud platforms.
  • Google says the model is aimed at interactive coding, editing and other non-linear tasks, but it concedes returns fade in high-QPS cloud serving and output quality trails standard Gemma 4.

Insights

Will hyper-efficient models like DiffusionGemma ease the global GPU shortage, or will new AI capabilities simply accelerate demand for more powerful hardware?
With AI now 'dreaming' text like images, is this the end of sequential language models, or just a niche for specific tasks?

DiffusionGemma: Google’s 4x Faster, 26B-Parameter Diffusion LLM Redefines Local Text Generation

Overview

Google unveiled DiffusionGemma in June 2026, introducing a new large language model that marks a major shift from traditional token-by-token text generation. Built on the Gemma 4 26B Mixture-of-Experts architecture, DiffusionGemma uses a diffusion-based approach instead of autoregressive methods. This allows it to generate text much faster by processing blocks of tokens in parallel, while only activating a fraction of its total parameters during inference. The result is a model that is both efficient and powerful, setting a new direction for how text can be created and used in interactive applications.

...