Updated
Updated · The Keyword | Google Product and Technology News · Jun 5
Google Releases Gemma 4 QAT Models, Cutting E2B Memory to 1GB on Edge Devices
Updated
Updated · The Keyword | Google Product and Technology News · Jun 5

Google Releases Gemma 4 QAT Models, Cutting E2B Memory to 1GB on Edge Devices

3 articles · Updated · The Keyword | Google Product and Technology News · Jun 5

Summary

  • Google rolled out Gemma 4 checkpoints trained with quantization-aware training, aiming to run the models locally on everyday mobile devices, laptops and consumer GPUs with less quality loss than standard compression.
  • 1GB is the new memory target for the text-only Gemma 4 E2B model, enabled by a mobile-specific format alongside Q4_0 checkpoints that shrink VRAM and storage needs while preserving model capability.
  • The mobile scheme uses static activations, channel-wise quantization, targeted 2-bit compression for token-generation components, and optimized embeddings and KV cache to reduce active memory and speed responses on edge hardware.
  • Hugging Face now hosts the weights, with support spanning llama.cpp, vLLM, Ollama, LM Studio, LiteRT-LM, Transformers.js, SGLang, MLX, Hugging Face Transformers and Unsloth for local deployment and fine-tuning.

Insights

Is Google's openly licensed Gemma 4 poised to become the 'Android' of on-device AI?
Gemma 4 now fits in 1GB of RAM, but what is the hidden cost to your phone's battery life?
Can a single super-chip truly end the complex performance bottlenecks of on-device AI?