|
| 1 | +# MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training |
| 2 | + |
| 3 | +Software artifact corresponding to SC'25 paper titled "MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall" |
| 4 | + |
| 5 | +### High-level Summary |
| 6 | +When training large models on limited GPU resources, additional memory tiers such as DRAM and NVMe can be leveraged as swap sapces to accomodate large model sizes. While such democratization leads to the accessiblity of training larger models, it encounters significant slowdown due to the cost of data-movement across memory tiers. *MLP-Offload* aims to mitigate these multi-tier data management challenges by a series of novel design principles such as (a) unified multi-level, multi-path asynchronous offloading using virtual tiers; (b) optimized virtual tier concurrency control for multi-path I/O; (c) cache-friendly ordering of model subgroup processing; and (d) delayed in-place mixed-precision gradient conversion during updates, that are also complemented through an I/O performance model. Please refer our paper for more details. |
| 7 | + |
| 8 | +### Composition of Software Artifacts |
| 9 | +The software artifacts in this repository are primarily decomposed into three modules as follows: |
| 10 | +1. **Megatron-DeepSpeed**: This is the repository implements core transformer capabilities, central to LLMs. The original [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) library was modified for injecting DeepSpeed specific runtime optimizations, available as [Megatron-DeepSpeed](https://github.com/deepspeedai/Megatron-DeepSpeed). This folder is present as a git submodule and points to our version of Megatron-DeepSpeed that further profiles the performance of different training phases. |
| 11 | +2. **DeepSpeed**: This repository contains the DeepSpeed LLM training runtime that proposes various optimizations such as redundancy elimination (ZeRO), offloading, subgroup-sharding, etc. For our targetted scenario of GPU constrained enviroments, we consider the ZeRO-3 optimzation that partitions the model, gradients, and optimizer states across all data-parallel ranks, leading to the most memory efficient training setup. We specifically modify the `deepspeed/runtime/zero/stage3.py` file to run using *MLP-Offload*. |
| 12 | +3. **scripts**: This folder contains the scripts required to setup and install required packages, launch experiments using different approaches, and parsers to obtain required performance metrics. |
| 13 | +4. **logs**: This folder contains a subset of logs of different approaches presented as sample. |
| 14 | + |
| 15 | +### Testbed Setup |
| 16 | + |
| 17 | +#### Hardware Requirements |
| 18 | +1. Nvidia GPU enabled node(s) with A100 or newer architectures |
| 19 | +2. Node-local NVMe(s) |
| 20 | +3. Remote storage, preferably a parallel file system |
| 21 | + |
| 22 | +#### Software Requirements and Installation |
| 23 | +1. Basic pre-requisites include: [Python (>=3.10)](https://www.python.org/downloads/release/python-3100/); [CUDA toolkit version 12.3](https://developer.nvidia.com/cuda-12-3-0-download-archive); [GCC version 11.1+](https://gcc.gnu.org/install/) |
| 24 | +2. All other software packages can be installed using [`scripts/installs.sh`](scripts/installs.sh). |
| 25 | + |
| 26 | + |
| 27 | +### Evaluations |
| 28 | + |
| 29 | +#### Running Experiments |
| 30 | +Once the pacakges are installed, change the `NVME_PATH` and `PFS_PATH` in [scripts/launch-expt.sh](scripts/launch-expt.sh) to point to node-local disk and remote storage locations of the testbed. |
| 31 | + |
| 32 | +Finally, running `bash scripts/launch-expt.sh` should start running a 40B model using the vanilla DeepSpeed ZeRO-3 offloading engine and then using MLP-Offloading. The logs become available in the `logs` directory. |
| 33 | + |
| 34 | +#### Parsing Results |
| 35 | +After the logs are created in the `logs` directory, revlevant performance profiles can be extracted through the paring script [scripts/parse-res.py](scripts/parse-res.py). Since the generated log filenames contain the root path of the file system(s) used as suffix arguments, we need to set the `LOCAL_NVME_ROOT` and `PFS_ROOT` variables to the root names of the filesystems. We supply sample logs from one of our testbed containing the profiles of 40B-120B model, representative of Figure 7, showing the average iteration time breakdown on scaling model sizes in the paper. These values deafult to `tmp` for our testbed's local NVMe; and to `vast` for our vast filesystem PFS cluster. The logs can be parsed as follows: |
| 36 | +```python |
| 37 | +> python scripts/parse-res.py |
| 38 | +``` |
| 39 | +Which outputs the following: |
| 40 | + |
| 41 | +| Model(B) | Approach | Elapsed(ms) | FWD(ms) | BWD(ms) | UPDATE(ms) | SPEEDUP | |
| 42 | +|----------|-------------------|-------------|---------|---------|------------|---------| |
| 43 | +| 40 | DeepSpeed ZeRO-3 | 242280.6 | 653.89 | 27473.02| 213610.43 | 1.00 | |
| 44 | +| 40 | MLP-Offload | 101717.3 | 639.52 | 2043.55 | 98507.29 | 2.38 | |
| 45 | +| 52 | DeepSpeed ZeRO-3 | 238597.6 | 512.46 | 28293.34| 209336.10 | 1.00 | |
| 46 | +| 52 | MLP-Offload | 92173.0 | 514.48 | 1833.24 | 89407.41 | 2.59 | |
| 47 | +| 70 | DeepSpeed ZeRO-3 | 370562.1 | 765.63 | 32905.03| 336426.77 | 1.00 | |
| 48 | +| 70 | MLP-Offload | 151337.8 | 770.55 | 2946.07 | 147183.76 | 2.45 | |
| 49 | +| 100 | DeepSpeed ZeRO-3 | 572027.2 | 1202.37 | 68341.33| 501915.92 | 1.00 | |
| 50 | +| 100 | MLP-Offload | 275528.8 | 1205.63 | 4563.45 | 269246.05 | 2.08 | |
| 51 | +| 120 | DeepSpeed ZeRO-3 | 550360.6 | 1165.51 | 73194.69| 475480.44 | 1.00 | |
| 52 | +| 120 | MLP-Offload | 288178.5 | 1160.60 | 4201.09 | 282331.47 | 1.91 | |
| 53 | + |
| 54 | + |
| 55 | + |
| 56 | + |
0 commit comments