Google releases Gemma 4 MTP drafters for faster inference

10 articles · Updated · The Keyword | Google Product and Technology News · May 5

The open-source Apache 2.0 release promises up to 3x speed gains and is available now on Hugging Face, Kaggle and Google AI Edge Gallery.
Google said the drafters use speculative decoding, letting lightweight models predict multiple tokens while larger Gemma 4 models verify them in parallel without reducing output quality or reasoning.
The company said Gemma 4 has surpassed 60 million downloads in weeks, with the update aimed at faster local, mobile and cloud deployments across tools including Transformers, MLX, vLLM and Ollama.

Why were crucial MTP speedup components excluded from Gemma 4's public open-source release?

How can users get the promised 3x speedup amid reports of bugs and zero performance gains?

Is MTP a temporary software patch or a long-term solution to AI's hardware bottleneck?