Skip to content

Add example of DeepCompile #967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions benchmarks/deepcompile/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.log
*.pyc
*.png
151 changes: 151 additions & 0 deletions benchmarks/deepcompile/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Benchmarks for DeepCompile

## Setup

This experiment scripts require 4 nodes that has 8 A100/H100 GPUs each.
We tested the scripts with Python 3.10.12 and CUDA 12.4.

### Libraries

In addition, you need to install the following:

- PyTorch v2.6.0
- DeepSpeed (v0.16.6 or newer)
- transformers
- accelerate
- datasets v3.1

Here are an example of installation commands:

```bash
pip3 install torch==2.6.0 torchvision torchaudio
pip3 install transformers datasets==3.1 accelerate

# Install DeepSpeed
pip install deepspeed

# Clone this repository
git clone https://github.com/deepspeedai/DeepSpeedExamples
cd benchmarks/deepcompile
```

You need to set up these on all nodes.

### Setup for multiple nodes run

You need to set host names in `hostfile_n${NUM_NODES}`. The file should look like the following:

```
node-0 slots=8
node-1 slots=8
node-2 slots=8
node-3 slots=8
```

## Evaluation on throughput

The following script runs the throughput benchmark. This sweeps the following conditions:

- Models: meta-llama/Meta-Llama-3-70B-Instruct, mistralai/Mixtral-8x7B-v0.1
- Batch size: 1, 2, 4
- Sequence length: 512 1024 2048
- Frameworks and settings:
- DeepSpeed ZeRO3 (ZeRO3)
- DeepSpeed ZeRO3 +Compiler (ZeRO3 (C))
- FSDP (FSDP)
- FSDP + Compiler (FSDP (C))
- DeepCompile + proactive prefetching (DeepCompile (P))
- DeepCompile + selective unsharding (DeepCompile (S))
- DeepCompile + proactive prefetching + selective unsharding (DeepCompile (P+S))

The script downloads the models from HuggingFace Model Hub. Please make sure that you have access to the models.

```bash
export PROFILE_DIR=/path/to/profile
bash run_bench.sh
```

The logs resulting from our experiments are stored in `logs/` directory. The summary of results is output to `profiles/result.txt`. You can copy the file to `results/acc_step_1` and plot the throughput with the following commands.

```bash
python plot.py --result_dir results/acc_step_1 --metric throughput
```

Here are some example charts:

<table>
<tr>
<td><img src="results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs1.png" alt="Llama-3-70B/bs=1" width="300"></td>
<td><img src="results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png" alt="Mixtral-8x7B/bs=1" width="300"></td>
</tr>
</table>

The following script runs the benchmark with different number of gradient accumulation steps (2, 4, 8, 16).

The batch size and sequence length are fixed to 1 and 1024, respectively. (Note that FSDP doesn't work for this experiment)

```bash
bash run_bench_acc.sh
```

You can use the same script with `--acc_step_eval` to plot the results along gradient accumulation steps.

```bash
ython plot.py --result_dir results/acc_step_1_16 --acc_step_eval --metric throughput
```

Here are some example charts:

<table>
<tr>
<td><img src="results/acc_step_1_16/throughput/chart_throughput_Llama-3-70B_np32_bs1.png" alt="Llama-3-70B/bs=1" width="300"></td>
<td><img src="results/acc_step_1_16/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png" alt="Mixtral-8x7B/bs=1" width="300"></td>
</tr>
</table>

## APIs and custom optimization passes

To enable DeepCompile, simply set "deepcompile": true in the compile section of your DeepSpeed configuration JSON:

```json
{
"zero_optimization": {
"stage": 3,
},
"compile": {
"deepcompile": true,
},
}
```

In your training script, call the compile() API to invoke DeepCompile. The function signature is:

```python
def compile(self, backend=get_accelerator().get_compile_backend(), compile_kwargs={}, schedule=None) -> None:
```

You can pass a custom optimization schedule using the schedule argument. For example, to apply ZeRO-3-style partitioning and the optimizations described above, you can define the schedule as follows:

```python
schedule = []
schedule.append((0, [zero3_compile.add_z3_gather_release]))
schedule.append(
(WARMUP,
[zero3_compile.add_z3_gather_release, prefetch.schedule_prefetch, selective_gather.selective_gather]))
```

A schedule is defined as a list of tuples, where each tuple consists of:

- A step index (e.g., 0 or "WARMUP"), indicating when to apply the passes
- A list of optimization functions to apply at that step

In the example above, `add_z3_gather_release` is applied at step 0 to minimize memory usage. After a warmup phase (e.g., after the first few training iterations), additional optimizations such as prefetching and selective unsharding are applied based on profiled memory usage.
Each optimization pass takes a standardized set of arguments provided by DeepCompile. For details, please refer to the implementation of each pass:

- [ZeRO3 (All-gather and reduce-scatter insertion)](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/zero3_compile.py)
- [Proactive prefetching](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/prefetch.py)
- [Selective unsharding](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/selective_gather.py)
- [Reduce-scatter insertion (ZeRO1)](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/zero1_compile.py)
- [Adaptive offloading](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/offload_adam_states.py)
14 changes: 14 additions & 0 deletions benchmarks/deepcompile/configs/ddp_config.yaml.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
machine_rank: {{ machine_rank }}
main_training_function: main
mixed_precision: bf16
num_machines: {{ num_machines }}
num_processes: {{ num_processes }}
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
33 changes: 33 additions & 0 deletions benchmarks/deepcompile/configs/ds_config.json.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
{
{% if fp16 %}
"fp16": {
"enabled": true,
"initial_scale_power": 8
},
{% else %}
"bf16": {
"enabled": true
},
{% endif %}
"zero_optimization": {
"stage": {{ zero_stage }},
"sub_group_size": 100000000
},
"compile": {
"deepcompile": {{ deepcompile }},
"offload_activation": false,
"offload_opt_states": false,
"double_buffer": true,
"symmetric_memory": false,
"free_activation": false,
"debug_log": {{ debug_log }},
"sync_before_reduce": {{ sync_before_reduce }},
"sync_after_reduce": {{ sync_after_reduce }}
},
"gradient_accumulation_steps": {{ gradient_accumulation_steps }},
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
19 changes: 19 additions & 0 deletions benchmarks/deepcompile/configs/ds_config.yaml.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
{%- if zero_stage == 3 %}
zero3_init_flag: true
{%- endif %}
deepspeed_config_file: configs/ds_config.json
distributed_type: DEEPSPEED
machine_rank: {{ machine_rank }}
main_training_function: main
num_machines: {{ num_machines }}
num_processes: {{ num_processes }}
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
28 changes: 28 additions & 0 deletions benchmarks/deepcompile/configs/fsdp_config.yaml.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
{%- if zero_stage == 3 %}
fsdp_sharding_strategy: FULL_SHARD
{%- else %}
fsdp_sharding_strategy: SHARD_GRAD_OP
{%- endif %}
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: {{ machine_rank }}
main_training_function: main
mixed_precision: bf16
num_machines: {{ num_machines }}
num_processes: {{ num_processes }}
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: NO
main_training_function: main
mixed_precision: bf16
use_cpu: false
Loading