FastMoE

Logo of FastMoE

Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system.

We develop FastMoE, a distributed MoE training system based on PyTorch with support of both common accelerators, e.g. GPUs, and specific super computers, such as Sunway-Oceanlight super computer. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by sophisticated high-performance techniques. The system supports placing different experts on multiple workers across multiple nodes, enabling enlarging the number of experts linearly against the number of workers.

FastMoE gains 400+ stars on GitHub in 6 months after its first release, widely adopted in companies like Alibaba, Huawei, and Baidu. BAAI’s WuDao 2.0 model, a monstor with 1.75 trillion parameters, is also empowered by FastMoE.

Jiaao He
Jiaao He
Ph.D. Student
Jidong Zhai
Jidong Zhai
Professor
(长聘教授、博士生导师)