AXLearn:异构基础设施上的模块化大模型训练
We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-art deep...
我们设计并实现了AXLearn,一个高性能的深度学习系统,支持大规模模型训练。AXLearn注重模块化和异构硬件支持,内部接口严格封装,便于快速开发和实验。我们提出了一种通过代码行数复杂度量化模块化的方法,确保系统在扩展时保持恒定复杂度。AXLearn在集成特性时代码量少,性能与先进系统相当,并分享了开发和运营经验。
