Google Releases Gemma 4 QAT Models, Cutting E2B Memory to 1GB on Edge Devices

3 articles · Updated · The Keyword | Google Product and Technology News · Jun 5

Google rolled out Gemma 4 checkpoints trained with quantization-aware training, aiming to run the models locally on everyday mobile devices, laptops and consumer GPUs with less quality loss than standard compression.
1GB is the new memory target for the text-only Gemma 4 E2B model, enabled by a mobile-specific format alongside Q4_0 checkpoints that shrink VRAM and storage needs while preserving model capability.
The mobile scheme uses static activations, channel-wise quantization, targeted 2-bit compression for token-generation components, and optimized embeddings and KV cache to reduce active memory and speed responses on edge hardware.
Hugging Face now hosts the weights, with support spanning llama.cpp, vLLM, Ollama, LM Studio, LiteRT-LM, Transformers.js, SGLang, MLX, Hugging Face Transformers and Unsloth for local deployment and fine-tuning.