Google Shrinks AI Memory With No Accuracy Loss—But There's a Catch
Google Research introduced TurboQuant, a compression algorithm that reduces the size of the key-value (KV) cache—used in large language model inference—by at least 6x without accuracy loss in benchmarks. TurboQuant compresses the temporary memory required for a language model’s ongoing conversation, which becomes a bottleneck as context windows expand to millions of tokens and consume hundreds of gigabytes per session. Unlike traditional quantization methods, TurboQuant avoids the overhead of storing extra quantization constants by applying two sub-algorithms: PolarQuant, which separates vector magnitude and direction, and QJL, which represents residual error with a single sign bit. Tests on models like Gemma and Mistral showed no accuracy loss, even on challenging long-context tasks, at up to 4x compression. This breakthrough does not affect model weights and requires no model retraining or fine-tuning. If widely adopted, TurboQuant could significantly lower AI infrastructure memory needs, impacting memory hardware markets. However, it has not yet been tested at massive production scale, and its “zero loss” claim is limited to inference KV cache compression.
