LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

📝

内容提要

This paper was accepted at the Efficient Systems for Foundation Models Workshop at ICML 2024 The inference of transformer-based large language models consists of two sequential stages: 1) a...

➡️

继续阅读