通过强化学习实现大型语言模型的交替推理

Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased...

长链推理提升了大型语言模型的推理能力，但效率低下且首次生成时间增加。我们提出了一种新训练方法，通过强化学习引导模型交替思考与回答多步问题。实验结果显示，该方法平均减少首次生成时间80%，并提高Pass@1准确率19.3%。