Google Unveils 3-Bit TurboQuant, Claiming Up to 8x LLM Speed Gains

7 articles · Updated · KDnuggets · May 15

Google’s TurboQuant targets a major LLM bottleneck by compressing key-value cache memory to 3 bits without retraining, while the report says accuracy is preserved.
Two-stage compression drives that result: PolarQuant removes much of the usual quantization overhead, and QJL then corrects residual bias with an added 1-bit check.
On H100 GPU setups with long-context workloads, Google says 3-bit TurboQuant can deliver up to 8x performance versus 32-bit unquantized keys by cutting memory traffic.
A local TinyLlama test in the report showed the tradeoff more clearly on memory than speed—KV cache fell to 7.86 MB from 42.45 MB, while runtime was slower on short prompts.
That gap suggests TurboQuant’s biggest payoff is in large-scale RAG systems and 32K-plus token contexts, where cache pressure and bandwidth limits are far more severe.

Google claims an 8x speedup, but tests show accuracy drops. Is TurboQuant's memory saving worth the hidden performance costs?

Will software like TurboQuant end the gold rush for AI memory chips, or are its practical flaws too great to matter?