MEPipe: Democratizing LLM Training with Memory-Efficient Slice-Level Pipeline Scheduling on Cost-Effective Accelerators

Zhenbo Sun, Shengqi Chen, Yuanwei Wang, Jian Sha, Guanyu Feng, Wenguang Chen

January, 2025

Abstract

The training of large language models (LLMs) typically needs costly GPUs, such as NVIDIA A100 or H100. They possess substantial high-bandwidth on-chip memory and rapid interconnects like NVLinks. The exorbitant expenses associated with LLM training pose not just an economic challenge but also a societal one, as it restricts the ability to train LLMs from scratch to a selected few organizations.There is a significant interest in democratizing access to LLM training. This paper explores a potential solution by employing innovative parallel strategies on more affordable accelerators. Budget-friendly options like NVIDIA RTX 4090, while considerably less expensive and comparable in computational power to A100, are hindered by their limited memory capacity and reduced interconnect bandwidth, making the effective training of LLMs challenging.Conventional parallel strategies often result in high communication costs or excessive memory usage. Our paper introduces MEPipe, a novel approach that includes a slice-level scheduling method for sequence pipeline parallelism. This method minimizes memory consumption without incurring additional communication overhead. Besides, MEPipe utilizes fine-grained weight gradient computation to reduce idle time and mitigate imbalanced computation among slices.MEPipe has demonstrated up to 1.68× speedup (1.35× on average) on clusters equipped with 64 NVIDIA 4090 GPUs when training Llama models of varying sizes. 35% Model FLOPS Utilization (MFU) is achieved in training Llama 13B model, being 2.5x more cost-effective than A100 clusters.

Type

Conference paper

Publication

Proceedings of the Twentieth European Conference on Computer Systems

Distributed deep learning Large language model

MEPipe: Democratizing LLM Training with Memory-Efficient Slice-Level Pipeline Scheduling on Cost-Effective Accelerators

Abstract

Wenguang Chen

Professor(教授)

Professor
(教授)