deepspeedai
diff --git a/‎benchmarks/deepcompile/.gitignore
Lines changed: 3 additions & 0 deletions b/‎benchmarks/deepcompile/.gitignore
Lines changed: 3 additions & 0 deletions
diff --git a/‎benchmarks/deepcompile/README.md
Lines changed: 151 additions & 0 deletions b/‎benchmarks/deepcompile/README.md
Lines changed: 151 additions & 0 deletions
diff --git a/‎benchmarks/deepcompile/configs/ddp_config.yaml.template
Lines changed: 14 additions & 0 deletions b/‎benchmarks/deepcompile/configs/ddp_config.yaml.template
Lines changed: 14 additions & 0 deletions
diff --git a/‎benchmarks/deepcompile/configs/ds_config.json.template
Lines changed: 33 additions & 0 deletions b/‎benchmarks/deepcompile/configs/ds_config.json.template
Lines changed: 33 additions & 0 deletions
diff --git a/‎benchmarks/deepcompile/configs/ds_config.yaml.template
Lines changed: 19 additions & 0 deletions b/‎benchmarks/deepcompile/configs/ds_config.yaml.template
Lines changed: 19 additions & 0 deletions
diff --git a/‎benchmarks/deepcompile/configs/fsdp_config.yaml.template
Lines changed: 28 additions & 0 deletions b/‎benchmarks/deepcompile/configs/fsdp_config.yaml.template
Lines changed: 28 additions & 0 deletions
diff --git a/‎benchmarks/deepcompile/configs/singlegpu_config.yaml.template
Lines changed: 6 additions & 0 deletions b/‎benchmarks/deepcompile/configs/singlegpu_config.yaml.template
Lines changed: 6 additions & 0 deletions
@@ -0,0 +1,3 @@
+*.log
+*.pyc
+*.png
@@ -0,0 +1,151 @@
+# Benchmarks for DeepCompile
+
+## Setup
+
+This experiment scripts require 4 nodes that has 8 A100/H100 GPUs each.
+We tested the scripts with Python 3.10.12 and CUDA 12.4.
+
+### Libraries
+
+In addition, you need to install the following:
+
+- PyTorch v2.6.0
+- DeepSpeed (v0.16.6 or newer)
+- transformers
+- accelerate
+- datasets v3.1
+
+Here are an example of installation commands:
+
+```bash
+pip3 install torch==2.6.0 torchvision torchaudio
+pip3 install transformers datasets==3.1 accelerate
+
+# Install DeepSpeed
+pip install deepspeed
+
+# Clone this repository
+git clone https://github.com/deepspeedai/DeepSpeedExamples
+cd benchmarks/deepcompile
+```
+
+You need to set up these on all nodes.
+
+### Setup for multiple nodes run
+
+You need to set host names in `hostfile_n${NUM_NODES}`. The file should look like the following:
+
+```
+node-0 slots=8
+node-1 slots=8
+node-2 slots=8
+node-3 slots=8
+```
+
+## Evaluation on throughput
+
+The following script runs the throughput benchmark. This sweeps the following conditions:
+
+- Models: meta-llama/Meta-Llama-3-70B-Instruct, mistralai/Mixtral-8x7B-v0.1
+- Batch size: 1, 2, 4
+- Sequence length: 512 1024 2048
+- Frameworks and settings:
+  - DeepSpeed ZeRO3 (ZeRO3)
+  - DeepSpeed ZeRO3 +Compiler (ZeRO3 (C))
+  - FSDP (FSDP)
+  - FSDP + Compiler (FSDP (C))
+  - DeepCompile + proactive prefetching (DeepCompile (P))
+  - DeepCompile + selective unsharding (DeepCompile (S))
+  - DeepCompile + proactive prefetching + selective unsharding (DeepCompile (P+S))
+
+The script downloads the models from HuggingFace Model Hub. Please make sure that you have access to the models.
+
+```bash
+export PROFILE_DIR=/path/to/profile
+bash run_bench.sh
+```
+
+The logs resulting from our experiments are stored in `logs/` directory. The summary of results is output to `profiles/result.txt`. You can copy the file to `results/acc_step_1` and plot the throughput with the following commands.
+
+```bash
+python plot.py --result_dir results/acc_step_1 --metric throughput
+```
+
+Here are some example charts:
+
+<table>
+  <tr>
+    <td><img src="results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs1.png" alt="Llama-3-70B/bs=1" width="300"></td>
+    <td><img src="results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png" alt="Mixtral-8x7B/bs=1" width="300"></td>
+  </tr>
+</table>
+
+The following script runs the benchmark with different number of gradient accumulation steps (2, 4, 8, 16).
+
+The batch size and sequence length are fixed to 1 and 1024, respectively. (Note that FSDP doesn't work for this experiment)
+
+```bash
+bash run_bench_acc.sh
+```
+
+You can use the same script with `--acc_step_eval` to plot the results along gradient accumulation steps.
+
+```bash
+ython plot.py --result_dir results/acc_step_1_16 --acc_step_eval --metric throughput
+```
+
+Here are some example charts:
+
+<table>
+  <tr>
+    <td><img src="results/acc_step_1_16/throughput/chart_throughput_Llama-3-70B_np32_bs1.png" alt="Llama-3-70B/bs=1" width="300"></td>
+    <td><img src="results/acc_step_1_16/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png" alt="Mixtral-8x7B/bs=1" width="300"></td>
+  </tr>
+</table>
+
+## APIs and custom optimization passes
+
+To enable DeepCompile, simply set "deepcompile": true in the compile section of your DeepSpeed configuration JSON:
+
+```json
+{   
+…
+    "zero_optimization": {
+        "stage": 3,
+    },
+    "compile": {
+        "deepcompile": true,
+    },
+…
+}
+```
+
+In your training script, call the compile() API to invoke DeepCompile. The function signature is:
+
+```python
+def compile(self, backend=get_accelerator().get_compile_backend(), compile_kwargs={}, schedule=None) -> None:
+```
+
+You can pass a custom optimization schedule using the schedule argument. For example, to apply ZeRO-3-style partitioning and the optimizations described above, you can define the schedule as follows:
+
+```python
+schedule = []
+schedule.append((0, [zero3_compile.add_z3_gather_release]))
+schedule.append(
+      (WARMUP,
+      [zero3_compile.add_z3_gather_release, prefetch.schedule_prefetch, selective_gather.selective_gather]))
+```
+
+A schedule is defined as a list of tuples, where each tuple consists of:
+
+- A step index (e.g., 0 or "WARMUP"), indicating when to apply the passes
+- A list of optimization functions to apply at that step
+
+In the example above, `add_z3_gather_release` is applied at step 0 to minimize memory usage. After a warmup phase (e.g., after the first few training iterations), additional optimizations such as prefetching and selective unsharding are applied based on profiled memory usage.
+Each optimization pass takes a standardized set of arguments provided by DeepCompile. For details, please refer to the implementation of each pass:
+
+- [ZeRO3 (All-gather and reduce-scatter insertion)](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/zero3_compile.py)
+- [Proactive prefetching](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/prefetch.py)
+- [Selective unsharding](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/selective_gather.py)
+- [Reduce-scatter insertion (ZeRO1)](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/zero1_compile.py)
+- [Adaptive offloading](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/offload_adam_states.py)
@@ -0,0 +1,14 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+machine_rank: {{ machine_rank }}
+main_training_function: main
+mixed_precision: bf16
+num_machines: {{ num_machines }}
+num_processes: {{ num_processes }}
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
@@ -0,0 +1,33 @@
+{
+    {% if fp16 %}
+    "fp16": {
+        "enabled": true,
+        "initial_scale_power": 8
+    },
+    {% else %}
+    "bf16": {
+        "enabled": true
+    },
+    {% endif %}
+    "zero_optimization": {
+        "stage": {{ zero_stage }},
+        "sub_group_size": 100000000
+    },
+    "compile": {
+        "deepcompile": {{ deepcompile }},
+        "offload_activation": false,
+        "offload_opt_states": false,
+        "double_buffer": true,
+        "symmetric_memory": false,
+        "free_activation": false,
+        "debug_log": {{ debug_log }},
+        "sync_before_reduce": {{ sync_before_reduce }},
+        "sync_after_reduce": {{ sync_after_reduce }}
+    },
+    "gradient_accumulation_steps": {{ gradient_accumulation_steps }},
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
@@ -0,0 +1,19 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  {%- if zero_stage == 3 %}
+  zero3_init_flag: true
+  {%- endif %}
+  deepspeed_config_file: configs/ds_config.json
+distributed_type: DEEPSPEED
+machine_rank: {{ machine_rank }}
+main_training_function: main
+num_machines: {{ num_machines }}
+num_processes: {{ num_processes }}
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
@@ -0,0 +1,28 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_forward_prefetch: false
+  fsdp_offload_params: false
+  {%- if zero_stage == 3 %}
+  fsdp_sharding_strategy: FULL_SHARD
+  {%- else %}
+  fsdp_sharding_strategy: SHARD_GRAD_OP
+  {%- endif %}
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: true
+machine_rank: {{ machine_rank }}
+main_training_function: main
+mixed_precision: bf16
+num_machines: {{ num_machines }}
+num_processes: {{ num_processes }}
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
@@ -0,0 +1,6 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: NO
+main_training_function: main
+mixed_precision: bf16
+use_cpu: false