Personal experience care the changing of loss/reward and test dataste performance, ensure they change with same trend, otherwise, reward hacking / invalid loss function appear adjust learning-rate...
在阅读unsloth博客的“手动自动求导”后,我尝试解析模型,发现了更多可优化的点。torchview是一个很好的工具。