While FastMoE enables distributed MoE model training using PyTorch, it suffers inefficiency because of load imbalance and poor communication performance. Other state-of-the-art systems for MoE, such as GShard from Google and BASE Layers from Facebook, share the same issues. We develop FasterMoE to address the issues and make distributed MoE training efficient.
Specifically, we re-invent Roofline by introducing communication and computation to model the performance of a distributed training task. Based on the model, and combined with our observation of training process of real-world models, we propose expert shadowing technique to minimize the impact of load-imbalance. It broadcasts the parameters of the most popular experts, instead of receiving the inputs of the experts from all workers, guided by the performance model. Furthermore, to better utilize both communication and computation hardware, we break-down the all-to-all operators and NN computation into smaller groups, and re-schedule them in a smart way. Beyond system-level design, we found that expert selection process can be co-designed by model developer and system engineer to better utilize the inter-connection between workers. An example design is shown in a common two-layer tree topology, and we advocate such co-design to empower training of larger models more efficiently.
Experiments shows that FasterMoE achieves
17.87x speedup over a data-parallel baseline enhanced by ZeRO optimizer on
Although BASE Layers and GShard both modifies the model to achieve high performance, FasterMoE is up to
2.19x faster than them in terms of convergence time.