Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware
📝
内容提要
Google Research unveiled TurboQuant, a novel quantization algorithm that compresses large language models’ Key-Value caches by up to 6x. With 3.5-bit compression, near-zero accuracy loss, and no...
➡️