Updated · The Keyword | Google Product and Technology News · Jun 10
DiffusionGemma Debuts 26B Open Model, Delivering 4x Faster GPU Text Generation
Updated
Updated · The Keyword | Google Product and Technology News · Jun 10
DiffusionGemma Debuts 26B Open Model, Delivering 4x Faster GPU Text Generation
3 articles · Updated · The Keyword | Google Product and Technology News · Jun 10
Summary
Apache 2.0-licensed DiffusionGemma launched as an experimental 26B Mixture-of-Experts model that generates text in 256-token blocks, targeting low-latency local AI workflows rather than standard production use.
Up to 4x faster generation comes from parallel diffusion decoding instead of token-by-token output, reaching 1,000-plus tokens per second on an Nvidia H100 and 700-plus on an RTX 5090.
Only 3.8B parameters activate during inference, letting the quantized model run within 18GB of VRAM on high-end consumer GPUs while using bi-directional attention for editing, code infilling and other non-linear tasks.
Google says output quality still trails standard Gemma 4, making DiffusionGemma better suited to research, rapid iteration and task-specific fine-tuning than maximum-quality deployments.
The release extends diffusion methods from images into text, but Google said the speed edge is strongest on single-accelerator, low-to-medium batch workloads and fades in high-QPS cloud serving.
Google’s new AI generates text like an image. Will this finally bring truly responsive AI agents to our personal devices?
As AI models learn to write entire paragraphs at once, is the era of slow, word-by-word generation already obsolete?
DiffusionGemma: Google DeepMind’s 4x Speed Leap in Open, Parallel Text Generation with Gemma 4
Overview
DiffusionGemma, unveiled by Google DeepMind in June 2026, is an experimental open model for text generation built on the advanced Gemma 4 architecture. Unlike traditional models that generate text one token at a time, DiffusionGemma introduces a breakthrough by generating entire blocks of text simultaneously and in parallel. This shift moves the main bottleneck from memory bandwidth to computational power, enabling much faster text generation. On dedicated GPUs, DiffusionGemma can deliver speeds up to four times faster than comparable autoregressive models, achieving up to 1,000 tokens per second on a single NVIDIA H100 Tensor Core GPU.