MoE

SmartMoE

Deep neural networks are growing large for stronger model ability, consuming enormous computation resources to train them. Sparsely activated models have been increasingly proposed and deployed to reduce training costs while enlarging model size.

Mingshu Zhai, Jiaao He, Zixuan Ma, Jidong Zhai

2023/12/07 MLSys

SmartMoE

FasterMoE

While FastMoE enables distributed MoE model training using PyTorch, it suffers inefficiency because of load imbalance and poor communication performance. Other state-of-the-art systems for MoE, such as GShard from Google and BASE Layers from Facebook, share the same issues.

Jiaao He, Tiago Antunes, Jidong Zhai

2022/07/02 HPC

FastMoE

Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system.

Jiaao He, Tiago Antunes, Jidong Zhai

2021/24/07 HPC