Small Hugging Face Models Top 89.2% Benchmarks With Under 7B Parameters

3 articles · Updated · KDnuggets · May 21

Models under 7 billion parameters are now delivering reasoning, coding and multilingual performance once associated with 30B-plus systems, with the article highlighting Qwen3.5-4B, Phi-4-mini and Gemma 3 4B IT.
89.2% GSM8K for Gemma 3 4B and 83.7% ARC-C for Microsoft’s 3.8B Phi-4-mini anchor the shift, while Qwen3.5-4B adds a 262,144-token context window that can extend past 1 million.
The gains come from higher-quality training data, distillation from larger reasoning models and newer architectures such as mixture-of-experts and Google’s mobile-focused MatFormer design.
Practical deployment is central: Phi-4-mini’s Q4_K_M file is 2.49 GB, Llama 3.2 3B runs at about 2 GB in Q4, and DeepSeek-R1-Distill-Qwen-1.5B fits near 1 GB.
The broader takeaway is that local AI is becoming viable for laptops, phones and edge devices, reducing the need for cloud APIs for many English, coding, structured-output and lightweight multilingual tasks.

China's efficient, open-source AI now dominates downloads. How does this shift the global tech race away from sheer model size?

Local AI promises privacy, but what new security risks arise when powerful models run on billions of personal devices?