Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware

📝

内容提要

Google Research unveiled TurboQuant, a novel quantization algorithm that compresses large language models’ Key-Value caches by up to 6x. With 3.5-bit compression, near-zero accuracy loss, and no...

🏷️

标签

➡️

继续阅读