用于大型语言模型快速推测解码的递归草拟器
原文英文,约200词,阅读约需1分钟。发表于: 。We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that achieves state-of-the-art speedup for large language models (LLMs) inference. The performance gains are...
ReDrafter是一种先进的推测解码方法,通过递归神经网络、动态树注意力算法和知识蒸馏三大技术显著加速大型语言模型推理。在Nvidia H100 GPU上,Vicuna推理加速达3.5倍,TensorRT-LLM实现2.5倍加速,Apple Silicon设备应用也达2.3倍加速。