diff --git a/benchmarks/deepcompile/.gitignore b/benchmarks/deepcompile/.gitignore
new file mode 100644
index 000000000..d73b31758
--- /dev/null
+++ b/benchmarks/deepcompile/.gitignore
@@ -0,0 +1,3 @@
+*.log
+*.pyc
+*.png
diff --git a/benchmarks/deepcompile/README.md b/benchmarks/deepcompile/README.md
new file mode 100644
index 000000000..4e580a47a
--- /dev/null
+++ b/benchmarks/deepcompile/README.md
@@ -0,0 +1,151 @@
+# Benchmarks for DeepCompile
+
+## Setup
+
+This experiment scripts require 4 nodes that has 8 A100/H100 GPUs each.
+We tested the scripts with Python 3.10.12 and CUDA 12.4.
+
+### Libraries
+
+In addition, you need to install the following:
+
+- PyTorch v2.6.0
+- DeepSpeed (v0.16.6 or newer)
+- transformers
+- accelerate
+- datasets v3.1
+
+Here are an example of installation commands:
+
+```bash
+pip3 install torch==2.6.0 torchvision torchaudio
+pip3 install transformers datasets==3.1 accelerate
+
+# Install DeepSpeed
+pip install deepspeed
+
+# Clone this repository
+git clone https://github.com/deepspeedai/DeepSpeedExamples
+cd benchmarks/deepcompile
+```
+
+You need to set up these on all nodes.
+
+### Setup for multiple nodes run
+
+You need to set host names in `hostfile_n${NUM_NODES}`. The file should look like the following:
+
+```
+node-0 slots=8
+node-1 slots=8
+node-2 slots=8
+node-3 slots=8
+```
+
+## Evaluation on throughput
+
+The following script runs the throughput benchmark. This sweeps the following conditions:
+
+- Models: meta-llama/Meta-Llama-3-70B-Instruct, mistralai/Mixtral-8x7B-v0.1
+- Batch size: 1, 2, 4
+- Sequence length: 512 1024 2048
+- Frameworks and settings:
+  - DeepSpeed ZeRO3 (ZeRO3)
+  - DeepSpeed ZeRO3 +Compiler (ZeRO3 (C))
+  - FSDP (FSDP)
+  - FSDP + Compiler (FSDP (C))
+  - DeepCompile + proactive prefetching (DeepCompile (P))
+  - DeepCompile + selective unsharding (DeepCompile (S))
+  - DeepCompile + proactive prefetching + selective unsharding (DeepCompile (P+S))
+
+The script downloads the models from HuggingFace Model Hub. Please make sure that you have access to the models.
+
+```bash
+export PROFILE_DIR=/path/to/profile
+bash run_bench.sh
+```
+
+The logs resulting from our experiments are stored in `logs/` directory. The summary of results is output to `profiles/result.txt`. You can copy the file to `results/acc_step_1` and plot the throughput with the following commands.
+
+```bash
+python plot.py --result_dir results/acc_step_1 --metric throughput
+```
+
+Here are some example charts:
+
+<table>
+  <tr>
+    <td><img src="results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs1.png" alt="Llama-3-70B/bs=1" width="300"></td>
+    <td><img src="results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png" alt="Mixtral-8x7B/bs=1" width="300"></td>
+  </tr>
+</table>
+
+The following script runs the benchmark with different number of gradient accumulation steps (2, 4, 8, 16).
+
+The batch size and sequence length are fixed to 1 and 1024, respectively. (Note that FSDP doesn't work for this experiment)
+
+```bash
+bash run_bench_acc.sh
+```
+
+You can use the same script with `--acc_step_eval` to plot the results along gradient accumulation steps.
+
+```bash
+ython plot.py --result_dir results/acc_step_1_16 --acc_step_eval --metric throughput
+```
+
+Here are some example charts:
+
+<table>
+  <tr>
+    <td><img src="results/acc_step_1_16/throughput/chart_throughput_Llama-3-70B_np32_bs1.png" alt="Llama-3-70B/bs=1" width="300"></td>
+    <td><img src="results/acc_step_1_16/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png" alt="Mixtral-8x7B/bs=1" width="300"></td>
+  </tr>
+</table>
+
+## APIs and custom optimization passes
+
+To enable DeepCompile, simply set "deepcompile": true in the compile section of your DeepSpeed configuration JSON:
+
+```json
+{   
+…
+    "zero_optimization": {
+        "stage": 3,
+    },
+    "compile": {
+        "deepcompile": true,
+    },
+…
+}
+```
+
+In your training script, call the compile() API to invoke DeepCompile. The function signature is:
+
+```python
+def compile(self, backend=get_accelerator().get_compile_backend(), compile_kwargs={}, schedule=None) -> None:
+```
+
+You can pass a custom optimization schedule using the schedule argument. For example, to apply ZeRO-3-style partitioning and the optimizations described above, you can define the schedule as follows:
+
+```python
+schedule = []
+schedule.append((0, [zero3_compile.add_z3_gather_release]))
+schedule.append(
+      (WARMUP,
+      [zero3_compile.add_z3_gather_release, prefetch.schedule_prefetch, selective_gather.selective_gather]))
+```
+
+A schedule is defined as a list of tuples, where each tuple consists of:
+
+- A step index (e.g., 0 or "WARMUP"), indicating when to apply the passes
+- A list of optimization functions to apply at that step
+
+In the example above, `add_z3_gather_release` is applied at step 0 to minimize memory usage. After a warmup phase (e.g., after the first few training iterations), additional optimizations such as prefetching and selective unsharding are applied based on profiled memory usage.
+Each optimization pass takes a standardized set of arguments provided by DeepCompile. For details, please refer to the implementation of each pass:
+
+- [ZeRO3 (All-gather and reduce-scatter insertion)](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/zero3_compile.py)
+- [Proactive prefetching](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/prefetch.py)
+- [Selective unsharding](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/selective_gather.py)
+- [Reduce-scatter insertion (ZeRO1)](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/zero1_compile.py)
+- [Adaptive offloading](https://github.com/deepspeedai/DeepSpeed/blob/tohtana/deepcompile/deepspeed/compile/passes/offload_adam_states.py)
diff --git a/benchmarks/deepcompile/configs/ddp_config.yaml.template b/benchmarks/deepcompile/configs/ddp_config.yaml.template
new file mode 100644
index 000000000..947b06949
--- /dev/null
+++ b/benchmarks/deepcompile/configs/ddp_config.yaml.template
@@ -0,0 +1,14 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+machine_rank: {{ machine_rank }}
+main_training_function: main
+mixed_precision: bf16
+num_machines: {{ num_machines }}
+num_processes: {{ num_processes }}
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
diff --git a/benchmarks/deepcompile/configs/ds_config.json.template b/benchmarks/deepcompile/configs/ds_config.json.template
new file mode 100644
index 000000000..b5eb1589c
--- /dev/null
+++ b/benchmarks/deepcompile/configs/ds_config.json.template
@@ -0,0 +1,33 @@
+{
+    {% if fp16 %}
+    "fp16": {
+        "enabled": true,
+        "initial_scale_power": 8
+    },
+    {% else %}
+    "bf16": {
+        "enabled": true
+    },
+    {% endif %}
+    "zero_optimization": {
+        "stage": {{ zero_stage }},
+        "sub_group_size": 100000000
+    },
+    "compile": {
+        "deepcompile": {{ deepcompile }},
+        "offload_activation": false,
+        "offload_opt_states": false,
+        "double_buffer": true,
+        "symmetric_memory": false,
+        "free_activation": false,
+        "debug_log": {{ debug_log }},
+        "sync_before_reduce": {{ sync_before_reduce }},
+        "sync_after_reduce": {{ sync_after_reduce }}
+    },
+    "gradient_accumulation_steps": {{ gradient_accumulation_steps }},
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
\ No newline at end of file
diff --git a/benchmarks/deepcompile/configs/ds_config.yaml.template b/benchmarks/deepcompile/configs/ds_config.yaml.template
new file mode 100644
index 000000000..f130fbea7
--- /dev/null
+++ b/benchmarks/deepcompile/configs/ds_config.yaml.template
@@ -0,0 +1,19 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  {%- if zero_stage == 3 %}
+  zero3_init_flag: true
+  {%- endif %}
+  deepspeed_config_file: configs/ds_config.json
+distributed_type: DEEPSPEED
+machine_rank: {{ machine_rank }}
+main_training_function: main
+num_machines: {{ num_machines }}
+num_processes: {{ num_processes }}
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
\ No newline at end of file
diff --git a/benchmarks/deepcompile/configs/fsdp_config.yaml.template b/benchmarks/deepcompile/configs/fsdp_config.yaml.template
new file mode 100644
index 000000000..ec1cebaea
--- /dev/null
+++ b/benchmarks/deepcompile/configs/fsdp_config.yaml.template
@@ -0,0 +1,28 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_forward_prefetch: false
+  fsdp_offload_params: false
+  {%- if zero_stage == 3 %}
+  fsdp_sharding_strategy: FULL_SHARD
+  {%- else %}
+  fsdp_sharding_strategy: SHARD_GRAD_OP
+  {%- endif %}
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: true
+machine_rank: {{ machine_rank }}
+main_training_function: main
+mixed_precision: bf16
+num_machines: {{ num_machines }}
+num_processes: {{ num_processes }}
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
diff --git a/benchmarks/deepcompile/configs/singlegpu_config.yaml.template b/benchmarks/deepcompile/configs/singlegpu_config.yaml.template
new file mode 100644
index 000000000..8763d4d2a
--- /dev/null
+++ b/benchmarks/deepcompile/configs/singlegpu_config.yaml.template
@@ -0,0 +1,6 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: NO
+main_training_function: main
+mixed_precision: bf16
+use_cpu: false
diff --git a/benchmarks/deepcompile/gen_chart_acc_steps.py b/benchmarks/deepcompile/gen_chart_acc_steps.py
new file mode 100644
index 000000000..8b3cbd920
--- /dev/null
+++ b/benchmarks/deepcompile/gen_chart_acc_steps.py
@@ -0,0 +1,263 @@
+import argparse
+import re
+import pandas as pd
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+def throughput_calculator(micro_batch_size, acc_steps, np, elapsed_time_per_iter,
+                          hidden_size, num_attention_heads, num_key_value_heads,
+                          ffn_hidden_size, num_layers, padded_vocab_size, seq_len,
+                          topk: int, swiglu: bool, checkpoint_activations: bool):
+    batch_size = micro_batch_size * acc_steps * np
+    samples_per_second = batch_size / elapsed_time_per_iter
+
+    head_dim = hidden_size // num_attention_heads
+    gqa = num_attention_heads // num_key_value_heads
+    ffn_multiplier = 3 if swiglu else 2
+    macs_per_flops = 2
+
+    pre_and_post_mha_gemm_macs = batch_size * num_layers * (1 + (2 // gqa) + 1) * (hidden_size**2) * seq_len
+    mha_bgemm_macs = batch_size * num_layers * 2 * head_dim * num_attention_heads * (seq_len**2)
+    ffn_gemm_macs = batch_size * num_layers * ffn_multiplier * ffn_hidden_size * hidden_size * seq_len * topk
+    logit_lmhead_gemm_macs = batch_size * padded_vocab_size * hidden_size * seq_len
+
+    fwd_macs = pre_and_post_mha_gemm_macs + mha_bgemm_macs + ffn_gemm_macs + logit_lmhead_gemm_macs
+    bwd_macs = 2 * fwd_macs
+    fwd_bwd_macs = fwd_macs + bwd_macs
+
+    if checkpoint_activations:
+        fwd_bwd_macs += fwd_macs
+
+    flops_per_iteration = fwd_bwd_macs * macs_per_flops
+    tflops = flops_per_iteration / (elapsed_time_per_iter * np * (10**12))
+    return samples_per_second, tflops
+
+
+model_info = {
+    "meta-llama/Meta-Llama-3-8B": {
+        "hidden_size": 4096,
+        "num_attention_heads": 32,
+        "num_key_value_heads": 8,
+        "ffn_hidden_size": 16384,
+        "num_layers": 32,
+        "padded_vocab_size": 32000,
+        "topk": 1,
+        "swiglu": True  # Meta-Llama-3ではswigluが使われていると仮定
+    },
+    "meta-llama/Meta-Llama-3-70B-Instruct": {
+        "hidden_size": 8192,
+        "num_attention_heads": 64,
+        "num_key_value_heads": 8,
+        "ffn_hidden_size": 32768,
+        "num_layers": 80,
+        "padded_vocab_size": 32000,
+        "topk": 1,
+        "swiglu": True  # Meta-Llama-3ではswigluが使われていると仮定
+    },
+    "mistralai/Mixtral-8x7B-v0.1": {
+        "hidden_size": 4096,
+        "num_attention_heads": 32,
+        "num_key_value_heads": 8,
+        "ffn_hidden_size": 16384,
+        "num_layers": 32,
+        "padded_vocab_size": 32000,
+        "topk": 2,  # MixtralではMoEで2エキスパート
+        "swiglu": False  # Mistralはswigluを使っていないと仮定
+    }
+}
+
+parser = argparse.ArgumentParser(description="Plot performance metrics.")
+parser.add_argument("--metric", choices=["iteration_time", "throughput", "flops", "mfu", "peak_mem"], required=True,
+                    help="Metric to plot: 'iteration_time', 'flops', 'mfu', or 'peak_mem'")
+parser.add_argument("--result_dir", type=str, required=True, help="Path to the directory containing results.txt")
+parser.add_argument("--result_file", type=str, default="results.txt", help="Name of the result file")
+args = parser.parse_args()
+
+
+# データのパース
+pattern = re.compile(
+    r"(?P<timestamp>\d+) (?P<model>[\w./-]+) ds=(?P<ds>\w+) np=(?P<np>\d+) batch_size=(?P<batch_size>\d+) "
+    r"seq=(?P<seq>\d+) acc=(?P<acc>\d+) ac=(?P<ac>\w+) compile=(?P<compile>\w+) iteration time: (?P<iteration_time>[\d.]+) "
+    r"alloc_mem: (?P<alloc_mem>\d+) peak_mem: (?P<peak_mem>\d+)"
+)
+pattern_ctime = re.compile(
+    r"(?P<timestamp>\d+) (?P<model>[\w./-]+) ds=(?P<ds>\w+) np=(?P<np>\d+) batch_size=(?P<batch_size>\d+) "
+    r"seq=(?P<seq>\d+) acc=(?P<acc>\d+) ac=(?P<ac>\w+) compile=(?P<compile>\w+) passes=(?P<passes>[\w,_]+) compile_time=(?P<compile_time>[\d.]+) iteration time: (?P<iteration_time>[\d.]+) "
+    r"alloc_mem: (?P<alloc_mem>\d+) peak_mem: (?P<peak_mem>\d+)"
+)
+pattern_cs = re.compile(
+    r"(?P<timestamp>\d+) (?P<model>[\w./-]+) ds=(?P<ds>\w+) np=(?P<np>\d+) batch_size=(?P<batch_size>\d+) "
+    r"seq=(?P<seq>\d+) acc=(?P<acc>\d+) ac=(?P<ac>\w+) compile=(?P<compile>\w+) schedule=(?P<schedule>\w+) passes=(?P<passes>[\w,_]+) compile_time=(?P<compile_time>[\d.]+) iteration time: (?P<iteration_time>[\d.]+) "
+    r"alloc_mem: (?P<alloc_mem>\d+) peak_mem: (?P<peak_mem>\d+)"
+)
+
+file = Path(args.result_dir) / args.result_file
+matches = []
+with open(file) as f:
+    for line in f:
+        match = pattern.match(line)
+        if not match:
+            match = pattern_ctime.match(line)
+        if not match:
+            match = pattern_cs.match(line)
+        if not match:
+            print(f"Not matched: {line}")
+        if match:
+            d = match.groupdict()
+            if "passes" not in d:
+                d["passes"] = ""
+            if "compile_time" not in d:
+                d["compile_time"] = 0
+            if "schedule" not in d:
+                d["schedule"] = d["compile"]
+            matches.append(d)
+
+df = pd.DataFrame(matches)
+
+# 型変換
+df["ds"] = df["ds"] == "True"
+df["compile"] = df["compile"] == "True"
+df["np"] = df["np"].astype(int)
+df["batch_size"] = df["batch_size"].astype(int)  # batch_sizeをfloatに変換
+df["seq"] = df["seq"].astype(int)
+df["iteration_time"] = df["iteration_time"].astype(float)  # iteration_timeをfloatに変換
+df["alloc_mem"] = df["alloc_mem"].astype(float)
+df["peak_mem"] = df["peak_mem"].astype(float)
+df["acc"] = df["acc"].astype(int)  # accも明示的にint型へ
+df["ac"] = df["ac"] == "True"  # acを真偽値に変換
+df["compile_time"] = df["compile_time"].astype(float)
+df["schedule"] = df["schedule"] == "True"
+
+
+# モデルごとの計算とプロット
+grouped = df.groupby(["model", "np", "batch_size"])
+
+theoretical_peak = 312  # 理論ピーク性能 (TFLOPS)
+
+
+LABEL_ZERO3 = "ZeRO3"
+LABEL_ZERO3_C = "ZeRO3 (C)"
+LABEL_FSDP = "FSDP"
+LABEL_DC_PS = "DeepCompile (P+S)"
+LABEL_DC_P = "DeepCompile (P)"
+LABEL_DC_S = "DeepCompile (S)"
+
+for (model, np, batch_size), group in grouped:
+    group = group.sort_values("acc")
+    acc_labels = group["acc"].unique()
+
+    print(f"acc_labels: {acc_labels}")
+
+    metric_values = {LABEL_ZERO3: [0] * len(acc_labels),
+                     LABEL_ZERO3_C: [0] * len(acc_labels),
+                     LABEL_FSDP: [0] * len(acc_labels),
+                    LABEL_DC_PS: [0] * len(acc_labels),
+                    LABEL_DC_P: [0] * len(acc_labels),
+                    LABEL_DC_S: [0] * len(acc_labels)}
+                    
+    for _, row in group.iterrows():
+
+        if row["ds"] and not row["compile"]:
+            category = LABEL_ZERO3
+        elif not row["ds"] and not row["compile"]:
+            category = LABEL_FSDP
+        elif row["ds"] and row["compile"]:
+            if not row["schedule"]:
+                category = LABEL_ZERO3_C
+            elif row["passes"] == "" or row["passes"] == 'prefetch,selective_gather':
+                category = LABEL_DC_PS
+                # print(f"found prefetch,selective_gather")
+            elif row["passes"] == 'prefetch':
+                category = LABEL_DC_P
+                # print(f"found prefetch")
+            elif row["passes"] == 'selective_gather':
+                category = LABEL_DC_S
+                # print(f"found selective_gather")
+            else:
+                print(f"Unknown category: {row}")
+                continue
+        else:
+            print(f"Unknown category: {row}")
+            continue
+
+        acc_index = list(acc_labels).index(row["acc"])
+        if args.metric == "iteration_time":
+            metric_values[category][acc_index] = row["iteration_time"]
+        elif args.metric == "peak_mem":
+            metric_values[category][acc_index] = row["peak_mem"] / (1024**3)
+        elif args.metric == "throughput":
+            metric_values[category][acc_index] = row["batch_size"] * row["seq"] * row["acc"] / row["iteration_time"]
+        elif args.metric in ["flops", "mfu"]:
+            # モデル情報を使用して FLOPs を計算
+            model_params = model_info[row["model"]]
+            samples_per_second, tflops = throughput_calculator(
+                micro_batch_size=row["batch_size"],
+                acc_steps=row["acc"],  # ログから取得
+                np=row["np"],
+                elapsed_time_per_iter=row["iteration_time"],
+                hidden_size=model_params["hidden_size"],
+                num_attention_heads=model_params["num_attention_heads"],
+                num_key_value_heads=model_params["num_key_value_heads"],
+                ffn_hidden_size=model_params["ffn_hidden_size"],
+                num_layers=model_params["num_layers"],
+                padded_vocab_size=model_params["padded_vocab_size"],
+                seq_len=row["seq"],
+                topk=model_params["topk"],
+                swiglu=model_params["swiglu"],  # モデル定義から取得
+                checkpoint_activations=row["ac"]  # ログから取得
+            )
+            if args.metric == "flops":
+                metric_values[category][acc_index] = tflops
+            elif args.metric == "mfu":
+                metric_values[category][acc_index] = tflops / theoretical_peak
+
+    # グラフ作成
+    x = range(len(acc_labels))
+    width = 0.15  # 棒グラフの幅
+    ylabel = {
+        "iteration_time": "Iteration Time (s)",
+        "flops": "TFLOPS",
+        "throughput": "Throughput (tokens/s/GPU)",
+        "mfu": "MFU",
+        "peak_mem": "Peak Memory (GB)"
+    }[args.metric]
+
+    plt.figure(figsize=(10, 8))
+    adjust = - 0.5 * width
+    plt.bar([i - width*2 + adjust for i in x], metric_values[LABEL_ZERO3], width, label=LABEL_ZERO3, alpha=0.7)
+    plt.bar([i - width + adjust for i in x], metric_values[LABEL_ZERO3_C], width, label=LABEL_ZERO3_C, alpha=0.7)
+    plt.bar([i + adjust for i in x], metric_values[LABEL_FSDP], width, label=LABEL_FSDP, alpha=0.7)
+    plt.bar([i + width + adjust for i in x], metric_values[LABEL_DC_P], width, label=LABEL_DC_P, alpha=0.7)
+    plt.bar([i + width*2 + adjust for i in x], metric_values[LABEL_DC_S], width, label=LABEL_DC_S, alpha=0.7)
+    plt.bar([i + width*3 + adjust for i in x], metric_values[LABEL_DC_PS], width, label=LABEL_DC_PS, alpha=0.7)
+
+    gain_zero3 = [metric_values[LABEL_DC_PS][i] / metric_values[LABEL_ZERO3][i] for i in range(len(acc_labels))]
+    print(f"model {model} np {np} batch_size {batch_size} {LABEL_ZERO3} metric_values: {metric_values[LABEL_ZERO3]} gain_zero3: {gain_zero3}")
+    print(f"model {model} np {np} batch_size {batch_size} {LABEL_DC_PS} metric_values: {metric_values[LABEL_DC_PS]}")
+
+    model = model.split('/')[1]
+    model = model.replace("Meta-Llama-3-8B", "Llama-3-8B")
+    model = model.replace("Meta-Llama-3-70B-Instruct", "Llama-3-70B")
+    model = model.replace("Mixtral-8x7B-v0.1", "Mixtral-8x7B")
+
+    plt.title(f"Model: {model}, #GPUs: {np}, Batch Size: {batch_size}", fontsize=24)
+    plt.xlabel("Acc Steps", fontsize=24)
+    plt.ylabel(ylabel, fontsize=24)
+    plt.xticks(x, acc_labels, fontsize=24)
+
+    if args.metric == "peak_mem":
+        plt.ylim(0, 80)
+
+    plt.yticks(fontsize=20)
+    plt.legend(loc="lower right", fontsize=18)
+    plt.grid(axis="y")
+
+    # ファイル保存
+    metric_name = args.metric
+    model = model.replace("/", "_")
+    chart_dir = Path(args.result_dir) / Path(metric_name)
+    chart_dir.mkdir(parents=True, exist_ok=True)
+    conf_str = f"{metric_name}_{model}_np{np}_bs{batch_size}"
+    img_path = chart_dir / f"chart_{conf_str}.png"
+    plt.savefig(str(img_path))
+    plt.close()
diff --git a/benchmarks/deepcompile/generate_conf.py b/benchmarks/deepcompile/generate_conf.py
new file mode 100644
index 000000000..b901bd9d5
--- /dev/null
+++ b/benchmarks/deepcompile/generate_conf.py
@@ -0,0 +1,52 @@
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+
+import argparse
+from jinja2 import Template
+from pathlib import Path
+
+def get_args():
+    parser = argparse.ArgumentParser(description='Config generation')
+
+    parser.add_argument('--machine_rank', type=int, help='machine_rank')
+    parser.add_argument('--num_machines', type=int, help='num_machines')
+    parser.add_argument('--num_processes', type=int, help='num_processes')
+    parser.add_argument('--zero_stage', type=int, choices=[0, 1, 2, 3], help='ZeRO stage')
+    parser.add_argument('--fp16', action='store_true', help='Use fp16')
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1)
+    parser.add_argument('--deepcompile', action='store_true', help='Use deepcompile')
+    parser.add_argument('--debug_log', action='store_true', help='Debug log')
+    parser.add_argument('--sync_before_reduce', action='store_true', help='Sync before reduce')
+    parser.add_argument('--sync_after_reduce', action='store_true', help='Sync after reduce')
+    parser.add_argument('--sync_before_allgather', action='store_true', help='Sync before allgather')
+    parser.add_argument('--sync_after_allgather', action='store_true', help='Sync after allgather')
+                        
+    parser.add_argument('--template_file', type=Path, help='Template file')
+    parser.add_argument('--output_file', type=Path, help='Output file')
+
+    return parser.parse_args()
+
+
+def main(args):
+    with open(args.template_file, 'r') as f:
+        template = Template(f.read())
+
+    with open(args.output_file, 'w') as f:
+        f.write(template.render(machine_rank=args.machine_rank,
+                                num_machines=args.num_machines,
+                                num_processes=args.num_processes,
+                                zero_stage=args.zero_stage,
+                                fp16=args.fp16,
+                                gradient_accumulation_steps=args.gradient_accumulation_steps,
+                                deepcompile=str(args.deepcompile).lower(),
+                                debug_log=str(args.debug_log).lower(),
+                                sync_before_reduce=str(args.sync_before_reduce).lower(),
+                                sync_after_reduce=str(args.sync_after_reduce).lower(),
+                                sync_before_allgather=str(args.sync_before_allgather).lower(),
+                                sync_after_allgather=str(args.sync_after_allgather).lower()))
+
+if __name__ == '__main__':
+    args = get_args()
+    main(args)
diff --git a/benchmarks/deepcompile/hostfile_n4 b/benchmarks/deepcompile/hostfile_n4
new file mode 100644
index 000000000..6d23cdd7f
--- /dev/null
+++ b/benchmarks/deepcompile/hostfile_n4
@@ -0,0 +1,4 @@
+node-0 slots=8
+node-1 slots=8
+node-2 slots=8
+node-3 slots=8
diff --git a/benchmarks/deepcompile/plot.py b/benchmarks/deepcompile/plot.py
new file mode 100644
index 000000000..e55fa1e37
--- /dev/null
+++ b/benchmarks/deepcompile/plot.py
@@ -0,0 +1,258 @@
+import argparse
+import re
+import pandas as pd
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+def throughput_calculator(micro_batch_size, acc_steps, np, elapsed_time_per_iter,
+                          hidden_size, num_attention_heads, num_key_value_heads,
+                          ffn_hidden_size, num_layers, padded_vocab_size, seq_len,
+                          topk: int, swiglu: bool, checkpoint_activations: bool):
+    batch_size = micro_batch_size * acc_steps * np
+    samples_per_second = batch_size / elapsed_time_per_iter
+
+    head_dim = hidden_size // num_attention_heads
+    gqa = num_attention_heads // num_key_value_heads
+    ffn_multiplier = 3 if swiglu else 2
+    macs_per_flops = 2
+
+    pre_and_post_mha_gemm_macs = batch_size * num_layers * (1 + (2 // gqa) + 1) * (hidden_size**2) * seq_len
+    mha_bgemm_macs = batch_size * num_layers * 2 * head_dim * num_attention_heads * (seq_len**2)
+    ffn_gemm_macs = batch_size * num_layers * ffn_multiplier * ffn_hidden_size * hidden_size * seq_len * topk
+    logit_lmhead_gemm_macs = batch_size * padded_vocab_size * hidden_size * seq_len
+
+    fwd_macs = pre_and_post_mha_gemm_macs + mha_bgemm_macs + ffn_gemm_macs + logit_lmhead_gemm_macs
+    bwd_macs = 2 * fwd_macs
+    fwd_bwd_macs = fwd_macs + bwd_macs
+
+    if checkpoint_activations:
+        fwd_bwd_macs += fwd_macs
+
+    flops_per_iteration = fwd_bwd_macs * macs_per_flops
+    tflops = flops_per_iteration / (elapsed_time_per_iter * np * (10**12))
+    return samples_per_second, tflops
+
+
+model_info = {
+    "meta-llama/Meta-Llama-3-8B": {
+        "hidden_size": 4096,
+        "num_attention_heads": 32,
+        "num_key_value_heads": 8,
+        "ffn_hidden_size": 16384,
+        "num_layers": 32,
+        "padded_vocab_size": 32000,
+        "topk": 1,
+        "swiglu": True
+    },
+    "meta-llama/Meta-Llama-3-70B-Instruct": {
+        "hidden_size": 8192,
+        "num_attention_heads": 64,
+        "num_key_value_heads": 8,
+        "ffn_hidden_size": 32768,
+        "num_layers": 80,
+        "padded_vocab_size": 32000,
+        "topk": 1,
+        "swiglu": True
+    },
+    "mistralai/Mixtral-8x7B-v0.1": {
+        "hidden_size": 4096,
+        "num_attention_heads": 32,
+        "num_key_value_heads": 8,
+        "ffn_hidden_size": 16384,
+        "num_layers": 32,
+        "padded_vocab_size": 32000,
+        "topk": 2,
+        "swiglu": False
+    }
+}
+
+parser = argparse.ArgumentParser(description="Plot performance metrics.")
+parser.add_argument("--metric", choices=["iteration_time", "throughput", "flops", "mfu", "peak_mem"], required=True,
+                    help="Metric to plot: 'iteration_time', 'flops', 'mfu', or 'peak_mem'")
+parser.add_argument("--result_dir", type=str, required=True, help="Path to the directory containing results.txt")
+parser.add_argument("--result_file", type=str, default="results.txt", help="Name of the result file")
+parser.add_argument("--acc_step_eval", action="store_true", help="Evaluate the accuracy of the model")
+args = parser.parse_args()
+
+
+pattern = re.compile(
+    r"(?P<timestamp>\d+) (?P<model>[\w./-]+) ds=(?P<ds>\w+) np=(?P<np>\d+) batch_size=(?P<batch_size>\d+) "
+    r"seq=(?P<seq>\d+) zero_stage=(?P<zero_stage>\d+) acc=(?P<acc>\d+) ac=(?P<ac>\w+) compile=(?P<compile>\w+) deepcompile=(?P<deepcompile>\w+) "
+    f"passes=(?P<passes>[\w,_]+) compile_time=(?P<compile_time>[\d.]+) iteration time: (?P<iteration_time>[\d.]+) "
+    r"alloc_mem: (?P<alloc_mem>\d+) peak_mem: (?P<peak_mem>\d+)"
+)
+file = Path(args.result_dir) / args.result_file
+matches = []
+with open(file) as f:
+    for line in f:
+        match = pattern.match(line)
+        if not match:
+            print(f"Not matched: {line}")
+        if match:
+            d = match.groupdict()
+            if "passes" not in d:
+                d["passes"] = ""
+            if "compile_time" not in d:
+                d["compile_time"] = 0
+            if "deepcompile" not in d:
+                d["deepcompile"] = d["compile"]
+            matches.append(d)
+
+df = pd.DataFrame(matches)
+print(df)
+
+df["ds"] = df["ds"] == "True"
+df["compile"] = df["compile"] == "True"
+df["np"] = df["np"].astype(int)
+df["batch_size"] = df["batch_size"].astype(int)
+df["seq"] = df["seq"].astype(int)
+df["iteration_time"] = df["iteration_time"].astype(float)
+df["alloc_mem"] = df["alloc_mem"].astype(float)
+df["peak_mem"] = df["peak_mem"].astype(float)
+df["acc"] = df["acc"].astype(int)
+df["ac"] = df["ac"] == "True"
+df["compile_time"] = df["compile_time"].astype(float)
+df["deepcompile"] = df["deepcompile"] == "True"
+
+
+grouped = df.groupby(["model", "np", "batch_size"])
+
+# We used A100
+theoretical_peak = 312
+
+LABEL_ZERO3 = "ZeRO3"
+LABEL_ZERO3_C = "ZeRO3 (C)"
+LABEL_FSDP = "FSDP"
+LABEL_FSDP_C = "FSDP (C)"
+LABEL_DC_PS = "DeepCompile (P+S)"
+LABEL_DC_P = "DeepCompile (P)"
+LABEL_DC_S = "DeepCompile (S)"
+
+for (model, np, batch_size), group in grouped:
+
+    sort_group_name = "acc" if args.acc_step_eval else "seq"
+
+    group = group.sort_values(sort_group_name)
+    labels = group[sort_group_name].unique()
+
+    metric_values = {LABEL_ZERO3: [0] * len(labels),
+                    LABEL_ZERO3_C: [0] * len(labels),
+                    LABEL_FSDP: [0] * len(labels),
+                    LABEL_FSDP_C: [0] * len(labels),
+                    LABEL_DC_PS: [0] * len(labels),
+                    LABEL_DC_P: [0] * len(labels),
+                    LABEL_DC_S: [0] * len(labels)}
+                    
+    for _, row in group.iterrows():
+        if row["ds"] and not row["compile"]:
+            category = LABEL_ZERO3
+        elif not row["ds"]:
+            if row["compile"]:
+                category = LABEL_FSDP_C
+            else:
+                category = LABEL_FSDP
+        elif row["ds"] and row["compile"]:
+            if not row["deepcompile"]:
+                category = LABEL_ZERO3_C
+            elif row["passes"] == "" or row["passes"] == 'prefetch,selective_gather':
+                category = LABEL_DC_PS
+            elif row["passes"] == 'prefetch':
+                category = LABEL_DC_P
+            elif row["passes"] == 'selective_gather':
+                category = LABEL_DC_S
+            else:
+                print(f"Unknown category1 : {row}")
+                continue
+        else:
+            print(f"Unknown category2 : {row}")
+            continue
+
+        group_index = list(labels).index(row[sort_group_name])
+        if args.metric == "iteration_time":
+            metric_values[category][group_index] = row["iteration_time"]
+        elif args.metric == "peak_mem":
+            metric_values[category][group_index] = row["peak_mem"] / (1024**3)
+        elif args.metric == "throughput":
+            metric_values[category][group_index] = row["batch_size"] * row["seq"] / row["iteration_time"] * row["acc"]
+        elif args.metric in ["flops", "mfu"]:
+            model_params = model_info[row["model"]]
+            samples_per_second, tflops = throughput_calculator(
+                micro_batch_size=row["batch_size"],
+                acc_steps=row["acc"],
+                np=row["np"],
+                elapsed_time_per_iter=row["iteration_time"],
+                hidden_size=model_params["hidden_size"],
+                num_attention_heads=model_params["num_attention_heads"],
+                num_key_value_heads=model_params["num_key_value_heads"],
+                ffn_hidden_size=model_params["ffn_hidden_size"],
+                num_layers=model_params["num_layers"],
+                padded_vocab_size=model_params["padded_vocab_size"],
+                seq_len=row["seq"],
+                topk=model_params["topk"],
+                swiglu=model_params["swiglu"],
+                checkpoint_activations=row["ac"]
+            )
+            if args.metric == "flops":
+                metric_values[category][group_index] = tflops
+            elif args.metric == "mfu":
+                metric_values[category][group_index] = tflops / theoretical_peak
+
+    x = range(len(labels))
+    width = 0.1
+    ylabel = {
+        "iteration_time": "Iteration Time (s)",
+        "flops": "TFLOPS",
+        "throughput": "Throughput (tokens/s/GPU)",
+        "mfu": "MFU",
+        "peak_mem": "Peak Memory (GB)"
+    }[args.metric]
+
+    if args.metric == "peak_mem":
+        plt.figure(figsize=(7, 8))
+    else:
+        plt.figure(figsize=(10, 8))
+    adjust = - .0 * width
+    plt.bar([i - width*3 + adjust for i in x], metric_values[LABEL_ZERO3], width, label=LABEL_ZERO3, alpha=0.7)
+    plt.bar([i - width*2 + adjust for i in x], metric_values[LABEL_ZERO3_C], width, label=LABEL_ZERO3_C, alpha=0.7)
+    plt.bar([i - width + adjust for i in x], metric_values[LABEL_FSDP], width, label=LABEL_FSDP, alpha=0.7)
+    plt.bar([i + adjust for i in x], metric_values[LABEL_FSDP_C], width, label=LABEL_FSDP_C, alpha=0.7)
+    plt.bar([i + width + adjust for i in x], metric_values[LABEL_DC_P], width, label=LABEL_DC_P, alpha=0.7)
+    plt.bar([i + width*2 + adjust for i in x], metric_values[LABEL_DC_S], width, label=LABEL_DC_S, alpha=0.7)
+    plt.bar([i + width*3 + adjust for i in x], metric_values[LABEL_DC_PS], width, label=LABEL_DC_PS, alpha=0.7)
+
+    gain_zero3 = [metric_values[LABEL_DC_PS][i] / metric_values[LABEL_ZERO3][i] for i in range(len(labels))]
+    print(f"model {model} np {np} batch_size {batch_size} {LABEL_ZERO3} metric_values: {metric_values[LABEL_ZERO3]} gain_zero3: {gain_zero3}")
+    gain_fsdp = [0 if metric_values[LABEL_FSDP][i] == 0 else metric_values[LABEL_DC_PS][i] / metric_values[LABEL_FSDP][i] for i in range(len(labels))]
+    print(f"model {model} np {np} batch_size {batch_size} {LABEL_FSDP} metric_values: {metric_values[LABEL_FSDP]} gain_fsdp: {gain_fsdp}")
+    print(f"model {model} np {np} batch_size {batch_size} {LABEL_DC_PS} metric_values: {metric_values[LABEL_DC_PS]}")
+
+    model = model.split('/')[1]
+    model = model.replace("Meta-Llama-3-8B", "Llama-3-8B")
+    model = model.replace("Meta-Llama-3-70B-Instruct", "Llama-3-70B")
+    model = model.replace("Mixtral-8x7B-v0.1", "Mixtral-8x7B")
+
+    plt.title(f"{model}, #GPUs: {np}, Batch Size: {batch_size}", fontsize=20)
+    if args.acc_step_eval:
+        plt.xlabel("Accumulation Steps", fontsize=20)
+    else:
+        plt.xlabel("Sequence Length", fontsize=20)
+    plt.ylabel(ylabel, fontsize=20)
+    plt.xticks(x, labels, fontsize=20)
+    plt.yticks(fontsize=20)
+
+    if args.metric == "peak_mem":
+        plt.ylim(0, 80)
+        plt.legend(loc="lower right", fontsize=16)
+    else:
+        plt.legend(loc="lower right", fontsize=18)
+
+    plt.grid(axis="y")
+
+    metric_name = args.metric
+    model = model.replace("/", "_")
+    chart_dir = Path(args.result_dir) / Path(metric_name)
+    chart_dir.mkdir(parents=True, exist_ok=True)
+    conf_str = f"{metric_name}_{model}_np{np}_bs{batch_size}"
+    img_path = chart_dir / f"chart_{conf_str}.png"
+    plt.savefig(str(img_path))
+    plt.close()
diff --git a/benchmarks/deepcompile/plot_common.py b/benchmarks/deepcompile/plot_common.py
new file mode 100644
index 000000000..8ebdadbeb
--- /dev/null
+++ b/benchmarks/deepcompile/plot_common.py
@@ -0,0 +1,251 @@
+import argparse
+import re
+import pandas as pd
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+def throughput_calculator(micro_batch_size, acc_steps, np, elapsed_time_per_iter,
+                          hidden_size, num_attention_heads, num_key_value_heads,
+                          ffn_hidden_size, num_layers, padded_vocab_size, seq_len,
+                          topk: int, swiglu: bool, checkpoint_activations: bool):
+    batch_size = micro_batch_size * acc_steps * np
+    samples_per_second = batch_size / elapsed_time_per_iter
+
+    head_dim = hidden_size // num_attention_heads
+    gqa = num_attention_heads // num_key_value_heads
+    ffn_multiplier = 3 if swiglu else 2
+    macs_per_flops = 2
+
+    pre_and_post_mha_gemm_macs = batch_size * num_layers * (1 + (2 // gqa) + 1) * (hidden_size**2) * seq_len
+    mha_bgemm_macs = batch_size * num_layers * 2 * head_dim * num_attention_heads * (seq_len**2)
+    ffn_gemm_macs = batch_size * num_layers * ffn_multiplier * ffn_hidden_size * hidden_size * seq_len * topk
+    logit_lmhead_gemm_macs = batch_size * padded_vocab_size * hidden_size * seq_len
+
+    fwd_macs = pre_and_post_mha_gemm_macs + mha_bgemm_macs + ffn_gemm_macs + logit_lmhead_gemm_macs
+    bwd_macs = 2 * fwd_macs
+    fwd_bwd_macs = fwd_macs + bwd_macs
+
+    if checkpoint_activations:
+        fwd_bwd_macs += fwd_macs
+
+    flops_per_iteration = fwd_bwd_macs * macs_per_flops
+    tflops = flops_per_iteration / (elapsed_time_per_iter * np * (10**12))
+    return samples_per_second, tflops
+
+
+model_info = {
+    "meta-llama/Meta-Llama-3-8B": {
+        "hidden_size": 4096,
+        "num_attention_heads": 32,
+        "num_key_value_heads": 8,
+        "ffn_hidden_size": 16384,
+        "num_layers": 32,
+        "padded_vocab_size": 32000,
+        "topk": 1,
+        "swiglu": True
+    },
+    "meta-llama/Meta-Llama-3-70B-Instruct": {
+        "hidden_size": 8192,
+        "num_attention_heads": 64,
+        "num_key_value_heads": 8,
+        "ffn_hidden_size": 32768,
+        "num_layers": 80,
+        "padded_vocab_size": 32000,
+        "topk": 1,
+        "swiglu": True
+    },
+    "mistralai/Mixtral-8x7B-v0.1": {
+        "hidden_size": 4096,
+        "num_attention_heads": 32,
+        "num_key_value_heads": 8,
+        "ffn_hidden_size": 16384,
+        "num_layers": 32,
+        "padded_vocab_size": 32000,
+        "topk": 2,
+        "swiglu": False
+    }
+}
+
+parser = argparse.ArgumentParser(description="Plot performance metrics.")
+parser.add_argument("--metric", choices=["iteration_time", "throughput", "flops", "mfu", "peak_mem"], required=True,
+                    help="Metric to plot: 'iteration_time', 'flops', 'mfu', or 'peak_mem'")
+parser.add_argument("--result_dir", type=str, required=True, help="Path to the directory containing results.txt")
+parser.add_argument("--result_file", type=str, default="results.txt", help="Name of the result file")
+args = parser.parse_args()
+
+
+pattern = re.compile(
+    r"(?P<timestamp>\d+) (?P<model>[\w./-]+) ds=(?P<ds>\w+) np=(?P<np>\d+) batch_size=(?P<batch_size>\d+) "
+    r"seq=(?P<seq>\d+) zero_stage=(?P<zero_stage>\d+) acc=(?P<acc>\d+) ac=(?P<ac>\w+) compile=(?P<compile>\w+) schedule=(?P<schedule>\w+) "
+    f"passes=(?P<passes>[\w,_]+) compile_time=(?P<compile_time>[\d.]+) iteration time: (?P<iteration_time>[\d.]+) "
+    r"alloc_mem: (?P<alloc_mem>\d+) peak_mem: (?P<peak_mem>\d+)"
+)
+file = Path(args.result_dir) / args.result_file
+matches = []
+with open(file) as f:
+    for line in f:
+        match = pattern.match(line)
+        if not match:
+            print(f"Not matched: {line}")
+        if match:
+            d = match.groupdict()
+            if "passes" not in d:
+                d["passes"] = ""
+            if "compile_time" not in d:
+                d["compile_time"] = 0
+            if "schedule" not in d:
+                d["schedule"] = d["compile"]
+            matches.append(d)
+
+df = pd.DataFrame(matches)
+print(df)
+
+df["ds"] = df["ds"] == "True"
+df["compile"] = df["compile"] == "True"
+df["np"] = df["np"].astype(int)
+df["batch_size"] = df["batch_size"].astype(int)
+df["seq"] = df["seq"].astype(int)
+df["iteration_time"] = df["iteration_time"].astype(float)
+df["alloc_mem"] = df["alloc_mem"].astype(float)
+df["peak_mem"] = df["peak_mem"].astype(float)
+df["acc"] = df["acc"].astype(int)
+df["ac"] = df["ac"] == "True"
+df["compile_time"] = df["compile_time"].astype(float)
+df["schedule"] = df["schedule"] == "True"
+
+
+grouped = df.groupby(["model", "np", "batch_size"])
+
+# We used A100
+theoretical_peak = 312
+
+LABEL_ZERO3 = "ZeRO3"
+LABEL_ZERO3_C = "ZeRO3 (C)"
+LABEL_FSDP = "FSDP"
+LABEL_FSDP_C = "FSDP (C)"
+LABEL_DC_PS = "DeepCompile (P+S)"
+LABEL_DC_P = "DeepCompile (P)"
+LABEL_DC_S = "DeepCompile (S)"
+
+for (model, np, batch_size), group in grouped:
+    group = group.sort_values("seq")
+    seq_labels = group["seq"].unique()
+
+    metric_values = {LABEL_ZERO3: [0] * len(seq_labels),
+                    LABEL_ZERO3_C: [0] * len(seq_labels),
+                    LABEL_FSDP: [0] * len(seq_labels),
+                    LABEL_FSDP_C: [0] * len(seq_labels),
+                    LABEL_DC_PS: [0] * len(seq_labels),
+                    LABEL_DC_P: [0] * len(seq_labels),
+                    LABEL_DC_S: [0] * len(seq_labels)}
+                    
+    for _, row in group.iterrows():
+        if row["ds"] and not row["compile"]:
+            category = LABEL_ZERO3
+        elif not row["ds"]:
+            if row["compile"]:
+                category = LABEL_FSDP_C
+            else:
+                category = LABEL_FSDP
+        elif row["ds"] and row["compile"]:
+            if not row["schedule"]:
+                category = LABEL_ZERO3_C
+            elif row["passes"] == "" or row["passes"] == 'prefetch,selective_gather':
+                category = LABEL_DC_PS
+            elif row["passes"] == 'prefetch':
+                category = LABEL_DC_P
+            elif row["passes"] == 'selective_gather':
+                category = LABEL_DC_S
+            else:
+                print(f"Unknown category1 : {row}")
+                continue
+        else:
+            print(f"Unknown category2 : {row}")
+            continue
+
+        seq_index = list(seq_labels).index(row["seq"])
+        if args.metric == "iteration_time":
+            metric_values[category][seq_index] = row["iteration_time"]
+        elif args.metric == "peak_mem":
+            metric_values[category][seq_index] = row["peak_mem"] / (1024**3)
+        elif args.metric == "throughput":
+            metric_values[category][seq_index] = row["batch_size"] * row["seq"] / row["iteration_time"]
+        elif args.metric in ["flops", "mfu"]:
+            model_params = model_info[row["model"]]
+            samples_per_second, tflops = throughput_calculator(
+                micro_batch_size=row["batch_size"],
+                acc_steps=row["acc"],
+                np=row["np"],
+                elapsed_time_per_iter=row["iteration_time"],
+                hidden_size=model_params["hidden_size"],
+                num_attention_heads=model_params["num_attention_heads"],
+                num_key_value_heads=model_params["num_key_value_heads"],
+                ffn_hidden_size=model_params["ffn_hidden_size"],
+                num_layers=model_params["num_layers"],
+                padded_vocab_size=model_params["padded_vocab_size"],
+                seq_len=row["seq"],
+                topk=model_params["topk"],
+                swiglu=model_params["swiglu"],
+                checkpoint_activations=row["ac"]
+            )
+            if args.metric == "flops":
+                metric_values[category][seq_index] = tflops
+            elif args.metric == "mfu":
+                metric_values[category][seq_index] = tflops / theoretical_peak
+
+    x = range(len(seq_labels))
+    width = 0.1
+    ylabel = {
+        "iteration_time": "Iteration Time (s)",
+        "flops": "TFLOPS",
+        "throughput": "Throughput (tokens/s/GPU)",
+        "mfu": "MFU",
+        "peak_mem": "Peak Memory (GB)"
+    }[args.metric]
+
+    if args.metric == "peak_mem":
+        plt.figure(figsize=(7, 8))
+    else:
+        plt.figure(figsize=(10, 8))
+    adjust = - .0 * width
+    plt.bar([i - width*3 + adjust for i in x], metric_values[LABEL_ZERO3], width, label=LABEL_ZERO3, alpha=0.7)
+    plt.bar([i - width*2 + adjust for i in x], metric_values[LABEL_ZERO3_C], width, label=LABEL_ZERO3_C, alpha=0.7)
+    plt.bar([i - width + adjust for i in x], metric_values[LABEL_FSDP], width, label=LABEL_FSDP, alpha=0.7)
+    plt.bar([i + adjust for i in x], metric_values[LABEL_FSDP_C], width, label=LABEL_FSDP_C, alpha=0.7)
+    plt.bar([i + width + adjust for i in x], metric_values[LABEL_DC_P], width, label=LABEL_DC_P, alpha=0.7)
+    plt.bar([i + width*2 + adjust for i in x], metric_values[LABEL_DC_S], width, label=LABEL_DC_S, alpha=0.7)
+    plt.bar([i + width*3 + adjust for i in x], metric_values[LABEL_DC_PS], width, label=LABEL_DC_PS, alpha=0.7)
+
+    gain_zero3 = [metric_values[LABEL_DC_PS][i] / metric_values[LABEL_ZERO3][i] for i in range(len(seq_labels))]
+    print(f"model {model} np {np} batch_size {batch_size} {LABEL_ZERO3} metric_values: {metric_values[LABEL_ZERO3]} gain_zero3: {gain_zero3}")
+    gain_fsdp = [metric_values[LABEL_DC_PS][i] / metric_values[LABEL_FSDP][i] for i in range(len(seq_labels))]
+    print(f"model {model} np {np} batch_size {batch_size} {LABEL_FSDP} metric_values: {metric_values[LABEL_FSDP]} gain_fsdp: {gain_fsdp}")
+    print(f"model {model} np {np} batch_size {batch_size} {LABEL_DC_PS} metric_values: {metric_values[LABEL_DC_PS]}")
+
+    model = model.split('/')[1]
+    model = model.replace("Meta-Llama-3-8B", "Llama-3-8B")
+    model = model.replace("Meta-Llama-3-70B-Instruct", "Llama-3-70B")
+    model = model.replace("Mixtral-8x7B-v0.1", "Mixtral-8x7B")
+
+    plt.title(f"{model}, #GPUs: {np}, Batch Size: {batch_size}", fontsize=20)
+    plt.xlabel("Sequence Length", fontsize=20)
+    plt.ylabel(ylabel, fontsize=20)
+    plt.xticks(x, seq_labels, fontsize=20)
+    plt.yticks(fontsize=20)
+
+    if args.metric == "peak_mem":
+        plt.ylim(0, 80)
+        plt.legend(loc="lower right", fontsize=16)
+    else:
+        plt.legend(loc="lower right", fontsize=18)
+
+    plt.grid(axis="y")
+
+    metric_name = args.metric
+    model = model.replace("/", "_")
+    chart_dir = Path(args.result_dir) / Path(metric_name)
+    chart_dir.mkdir(parents=True, exist_ok=True)
+    conf_str = f"{metric_name}_{model}_np{np}_bs{batch_size}"
+    img_path = chart_dir / f"chart_{conf_str}.png"
+    plt.savefig(str(img_path))
+    plt.close()
diff --git a/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs1.png b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs1.png
new file mode 100644
index 000000000..3aa22100a
Binary files /dev/null and b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs1.png differ
diff --git a/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs2.png b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs2.png
new file mode 100644
index 000000000..666df8570
Binary files /dev/null and b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs2.png differ
diff --git a/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs4.png b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs4.png
new file mode 100644
index 000000000..92909c148
Binary files /dev/null and b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Llama-3-70B_np32_bs4.png differ
diff --git a/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png
new file mode 100644
index 000000000..6758d6fce
Binary files /dev/null and b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png differ
diff --git a/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs2.png b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs2.png
new file mode 100644
index 000000000..0f42e564e
Binary files /dev/null and b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs2.png differ
diff --git a/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs4.png b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs4.png
new file mode 100644
index 000000000..f9bbf34ae
Binary files /dev/null and b/benchmarks/deepcompile/results/acc_step_1/throughput/chart_throughput_Mixtral-8x7B_np32_bs4.png differ
diff --git a/benchmarks/deepcompile/results/acc_step_1_16/throughput/chart_throughput_Llama-3-70B_np32_bs1.png b/benchmarks/deepcompile/results/acc_step_1_16/throughput/chart_throughput_Llama-3-70B_np32_bs1.png
new file mode 100644
index 000000000..557af85a4
Binary files /dev/null and b/benchmarks/deepcompile/results/acc_step_1_16/throughput/chart_throughput_Llama-3-70B_np32_bs1.png differ
diff --git a/benchmarks/deepcompile/results/acc_step_1_16/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png b/benchmarks/deepcompile/results/acc_step_1_16/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png
new file mode 100644
index 000000000..c3f992b1f
Binary files /dev/null and b/benchmarks/deepcompile/results/acc_step_1_16/throughput/chart_throughput_Mixtral-8x7B_np32_bs1.png differ
diff --git a/benchmarks/deepcompile/run.sh b/benchmarks/deepcompile/run.sh
new file mode 100644
index 000000000..57da03193
--- /dev/null
+++ b/benchmarks/deepcompile/run.sh
@@ -0,0 +1,225 @@
+#!/bin/bash
+
+
+NUM_NODES=${NUM_NODES:-$(wc -l < /job/hostfile)}
+NGPUS_PER_NODE=${NGPUS_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}
+NUM_PROCESSES=$((${NUM_NODES} * ${NGPUS_PER_NODE}))
+
+BACKEND="deepspeed"
+MODEL="meta-llama/Meta-Llama-3-8B"
+ZERO_STAGE=3
+COMPILE=0
+PASSES="ALL"
+EXTRA_OPTS=""
+
+EAGER=0
+DEEPCOMPILE=0
+GRADIENT_ACCUMULATION_STEPS=1
+ACTIVATION_CHECKPOINTING=1
+BATCH_SIZE=1
+SEQ_LENGTH=512
+DEBUG_LOG=0
+SYNC_BEFORE_REDUCE=0
+SYNC_AFTER_REDUCE=0
+SYNC_BEFORE_ALLGATHER=0
+SYNC_AFTER_ALLGATHER=0
+
+echo "NUM_NODES: ${NUM_NODES} NGPUS_PER_NODE: ${NGPUS_PER_NODE} NUM_PROCESSES: ${NUM_PROCESSES}"
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --host-ip)
+            HOST_IP="$2"
+            shift 2
+            ;;
+        --backend)
+            BACKEND="$2"
+            shift 2
+            ;;
+        --zero-stage)
+            ZERO_STAGE="$2"
+            shift 2
+            ;;
+        --batch-size)
+            BATCH_SIZE="$2"
+            EXTRA_OPTS="${EXTRA_OPTS} --batch_size $2"
+            shift 2
+            ;;
+        --seq-length)
+            SEQ_LENGTH="$2"
+            EXTRA_OPTS="${EXTRA_OPTS} --seq_length $2"
+            shift 2
+            ;;
+        --gradient-accumulation-steps)
+            GRADIENT_ACCUMULATION_STEPS="$2"
+            EXTRA_OPTS="${EXTRA_OPTS} --gradient_accumulation_steps $2"
+            shift 2
+            ;;
+        --activation-checkpointing)
+            ACTIVATION_CHECKPOINTING=1
+            EXTRA_OPTS="${EXTRA_OPTS} --activation_checkpointing"
+            shift
+            ;;   
+        --compile)
+            COMPILE=1
+            EXTRA_OPTS="${EXTRA_OPTS} $1"
+            shift
+            ;;
+        --eager)
+            EAGER=1
+            EXTRA_OPTS="${EXTRA_OPTS} --backend eager"
+            shift
+            ;;
+        --deepcompile)
+            DEEPCOMPILE=1
+            shift
+            ;;
+        --passes)
+            PASSES="$2"
+            EXTRA_OPTS="${EXTRA_OPTS} $1 $2"
+            shift 2
+            ;;
+        --profile)
+            EXTRA_OPTS="${EXTRA_OPTS} $1"
+            shift
+            ;;
+        --profile-dir)
+            EXTRA_OPTS="${EXTRA_OPTS} --profile_dir $2"
+            shift 2
+            ;;
+        --model)
+            MODEL="$2"
+            shift 2
+            ;;
+        --num-layers)
+            EXTRA_OPTS="${EXTRA_OPTS} --num_layers $2"
+            shift 2
+            ;;
+        --debug-log)
+            DEBUG_LOG=1
+            shift
+            ;;
+        --sync-before-reduce)
+            SYNC_BEFORE_REDUCE=1
+            shift
+            ;;
+        --sync-after-reduce)
+            SYNC_AFTER_REDUCE=1
+            shift
+            ;;
+        --sync-before-allgather)
+            SYNC_BEFORE_ALLGATHER=1
+            shift
+            ;;
+        --sync-after-allgather)
+            SYNC_AFTER_ALLGATHER=1
+            shift
+            ;;
+        *)
+            EXTRA_OPTS="${EXTRA_OPTS} $1"
+            shift
+            ;;
+    esac
+done
+
+
+
+export NCCL_DEBUG=WARN
+
+CONFIG_TEMPLATE=configs/ds_config.yaml.template
+if [ "${BACKEND}" == "fsdp" ]; then
+    CONFIG_TEMPLATE=configs/fsdp_config.yaml.template
+elif [ "${BACKEND}" == "ddp" ]; then
+    CONFIG_TEMPLATE=configs/ddp_config.yaml.template
+elif [ "${BACKEND}" == "singlegpu" ]; then
+    CONFIG_TEMPLATE=configs/singlegpu_config.yaml.template
+elif [ "${BACKEND}" != "deepspeed" ]; then
+    echo "Invalid backend: ${BACKEND}"
+    exit 1
+fi
+
+if [ "${BACKEND}" != "deepspeed" ]; then
+    ZERO_STAGE=0
+fi
+
+echo "HOST_IP: ${HOST_IP}"
+echo "NUM_NODES: ${NUM_NODES}"
+echo "NUM_PROCESSES: ${NUM_PROCESSES}"
+echo "BACKEND: ${BACKEND}"
+echo "ZERO_STAGE: ${ZERO_STAGE}"
+echo "MODEL: ${MODEL}"
+echo "GRADIENT_ACCUMULATION_STEPS: ${GRADIENT_ACCUMULATION_STEPS}"
+echo "EXTRA_OPTS: ${EXTRA_OPTS}"
+
+MACHINE_RANK=$(hostname | sed 's/[^0-9]*//g')
+
+python generate_conf.py \
+    --machine_rank ${MACHINE_RANK} \
+    --num_machines ${NUM_NODES} \
+    --num_processes ${NUM_PROCESSES} \
+    --zero_stage ${ZERO_STAGE} \
+    --template_file ${CONFIG_TEMPLATE} \
+    --output_file configs/config.yaml
+
+GAS_OPTS="--gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEPS}"
+
+if [ "${BACKEND}" == "deepspeed" ]; then
+    DEEPCOMPILE_OPTS=""
+    if [ "${DEEPCOMPILE}" == "1" ]; then
+        DEEPCOMPILE_OPTS="--deepcompile"
+    fi
+
+    DEBUG_LOG_OPTS=""
+    if [ "${DEBUG_LOG}" == "1" ]; then
+        DEBUG_LOG_OPTS="--debug_log"
+    fi
+
+    SYNC_BEFORE_REDUCE_OPTS=""
+    if [ "${SYNC_BEFORE_REDUCE}" == "1" ]; then
+        SYNC_BEFORE_REDUCE_OPTS="--sync_before_reduce"
+    fi
+    
+    SYNC_AFTER_REDUCE_OPTS=""
+    if [ "${SYNC_AFTER_REDUCE}" == "1" ]; then
+        SYNC_AFTER_REDUCE_OPTS="--sync_after_reduce"
+    fi
+
+    SYNC_BEFORE_ALLGATHER_OPTS=""
+    if [ "${SYNC_BEFORE_ALLGATHER}" == "1" ]; then
+        SYNC_BEFORE_ALLGATHER_OPTS="--sync_before_allgather"
+    fi
+
+    SYNC_AFTER_ALLGATHER_OPTS=""
+    if [ "${SYNC_AFTER_ALLGATHER}" == "1" ]; then
+        SYNC_AFTER_ALLGATHER_OPTS="--sync_after_allgather"
+    fi
+
+    python generate_conf.py \
+        --machine_rank ${MACHINE_RANK} \
+        --num_machines ${NUM_NODES} \
+        --num_processes ${NUM_PROCESSES} \
+        --zero_stage ${ZERO_STAGE} \
+        --gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEPS} \
+        ${DEEPCOMPILE_OPTS} ${DEBUG_LOG_OPTS} \
+        ${SYNC_BEFORE_REDUCE_OPTS} ${SYNC_AFTER_REDUCE_OPTS} \
+        ${SYNC_BEFORE_ALLGATHER_OPTS} ${SYNC_AFTER_ALLGATHER_OPTS} \
+        --template_file configs/ds_config.json.template \
+        --output_file configs/ds_config.json
+fi
+
+#replace , with _ in PASSES
+PASSES=$(echo $PASSES | tr ',' '_')
+LOG_DIR=logs
+mkdir -p ${LOG_DIR}
+LOG_FILE=${LOG_DIR}/debug_n${MACHINE_RANK}_${MODEL##*/}_${BACKEND}_np${NUM_PROCESSES}z${ZERO_STAGE}c${COMPILE}dc${DEEPCOMPILE}E${EAGER}b${BATCH_SIZE}seq${SEQ_LENGTH}g${GRADIENT_ACCUMULATION_STEPS}a${ACTIVATION_CHECKPOINTING}p${PASSES}.log
+echo "Logging to ${LOG_FILE}"
+
+${HOME}/.local/bin/accelerate launch --main_process_ip ${HOST_IP} --main_process_port 12345 \
+--num_machines ${NUM_NODES} --num_processes ${NUM_PROCESSES} --machine_rank ${MACHINE_RANK} \
+--config_file configs/config.yaml \
+run_acc_lm.py \
+--model_name "${MODEL}" \
+--zero_stage ${ZERO_STAGE} \
+${GAS_OPTS} \
+${EXTRA_OPTS} \
+2>&1 | tee ${LOG_FILE}
diff --git a/benchmarks/deepcompile/run_bench.sh b/benchmarks/deepcompile/run_bench.sh
new file mode 100644
index 000000000..174e34951
--- /dev/null
+++ b/benchmarks/deepcompile/run_bench.sh
@@ -0,0 +1,50 @@
+PROFILE_DIR=${PROFILE_DIR:-"profiles"}
+mkdir -p ${PROFILE_DIR}
+PROFILE_OPTS="--profile --profile-dir ${PROFILE_DIR}"
+COMPILE_OPTS="--compile"
+DC_OPTS="--compile --deepcompile"
+ACC_OPTS="--gradient-accumulation-steps 1"
+AC_OPTS="--activation-checkpointing"
+
+MODEL="meta-llama/Meta-Llama-3-70B-Instruct"
+BATCH_SIZE_OPTS=(1 2 4)
+SEQ_LENGTH_OPTS=(512 1024 2048)
+for BATCH_SIZE in ${BATCH_SIZE_OPTS[@]}; do
+    for SEQ_LENGTH in ${SEQ_LENGTH_OPTS[@]}; do
+        # skip if batch size is 4 and seq length is 2048, as it causes OOM
+        if [ ${BATCH_SIZE} -eq 4 ] && [ ${SEQ_LENGTH} -eq 2048 ]; then
+            continue
+        fi
+
+        ARGS="--model ${MODEL} --batch-size ${BATCH_SIZE} --seq-length ${SEQ_LENGTH} ${ACC_OPTS} ${AC_OPTS}"
+        bash ./run_multinode.sh --backend deepspeed ${ARGS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${COMPILE_OPTS}
+        bash ./run_multinode.sh --backend fsdp ${ARGS}
+        bash ./run_multinode.sh --backend fsdp ${ARGS} ${COMPILE_OPTS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${DC_OPTS} --passes prefetch,selective_gather
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${DC_OPTS} --passes prefetch
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${DC_OPTS} --passes selective_gather
+
+        cp -r logs ${PROFILE_DIR}/
+    done
+done
+
+MODEL="mistralai/Mixtral-8x7B-v0.1"
+BATCH_SIZE_OPTS=(1 2 4)
+SEQ_LENGTH_OPTS=(512 1024 2048)
+for BATCH_SIZE in ${BATCH_SIZE_OPTS[@]}; do
+    for SEQ_LENGTH in ${SEQ_LENGTH_OPTS[@]}; do
+        # skip if batch size is 4 and seq length is 2048, as it causes OOM
+        ARGS="--model ${MODEL} --batch-size ${BATCH_SIZE} --seq-length ${SEQ_LENGTH} ${ACC_OPTS} ${AC_OPTS}"
+        bash ./run_multinode.sh --backend deepspeed ${ARGS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${COMPILE_OPTS}
+        bash ./run_multinode.sh --backend fsdp ${ARGS}
+        bash ./run_multinode.sh --backend fsdp ${ARGS} ${COMPILE_OPTS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${DC_OPTS} --passes prefetch,selective_gather
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${DC_OPTS} --passes prefetch
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${DC_OPTS} --passes selective_gather
+
+        cp -r logs ${PROFILE_DIR}/
+    done
+done
+
diff --git a/benchmarks/deepcompile/run_bench_acc.sh b/benchmarks/deepcompile/run_bench_acc.sh
new file mode 100644
index 000000000..a3b66844d
--- /dev/null
+++ b/benchmarks/deepcompile/run_bench_acc.sh
@@ -0,0 +1,42 @@
+PROFILE_DIR=${PROFILE_DIR:-profiles}
+mkdir -p ${PROFILE_DIR}
+PROFILE_OPTS="--profile --profile-dir ${PROFILE_DIR}"
+COMPILE_OPTS="--compile"
+N3Z_OPTS="--compile --deepcompile"
+AC_OPTS="--activation-checkpointing"
+
+MODEL="meta-llama/Meta-Llama-3-70B-Instruct"
+BATCH_SIZE_OPTS=(1)
+SEQ_LENGTH_OPTS=(1024)
+ACC_OPTS=(2 4 8 16)
+for ACC_STEP in ${ACC_OPTS[@]}; do
+    for BATCH_SIZE in ${BATCH_SIZE_OPTS[@]}; do
+        for SEQ_LENGTH in ${SEQ_LENGTH_OPTS[@]}; do
+        ARGS="--model ${MODEL} --batch-size ${BATCH_SIZE} --seq-length ${SEQ_LENGTH} ${AC_OPTS} ${PROFILE_OPTS} --gradient-accumulation-steps ${ACC_STEP}"
+        bash ./run_multinode.sh --backend deepspeed ${ARGS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${COMPILE_OPTS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${N3Z_OPTS} --passes prefetch,selective_gather
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${N3Z_OPTS} --passes prefetch
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${N3Z_OPTS} --passes selective_gather
+        cp -r logs ${PROFILE_DIR}/
+        done
+    done
+done
+
+MODEL="mistralai/Mixtral-8x7B-v0.1"
+BATCH_SIZE_OPTS=(1)
+SEQ_LENGTH_OPTS=(1024)
+ACC_OPTS=(2 4 8 16)
+for ACC_STEP in ${ACC_OPTS[@]}; do
+    for BATCH_SIZE in ${BATCH_SIZE_OPTS[@]}; do
+        for SEQ_LENGTH in ${SEQ_LENGTH_OPTS[@]}; do
+            ARGS="--model ${MODEL} --batch-size ${BATCH_SIZE} --seq-length ${SEQ_LENGTH} ${AC_OPTS} ${PROFILE_OPTS} --gradient-accumulation-steps ${ACC_STEP}"
+        bash ./run_multinode.sh --backend deepspeed ${ARGS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${COMPILE_OPTS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${N3Z_OPTS} --passes prefetch,selective_gather
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${N3Z_OPTS} --passes prefetch
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${N3Z_OPTS} --passes selective_gather
+        cp -r logs ${PROFILE_DIR}/
+        done
+    done
+done
diff --git a/benchmarks/deepcompile/run_bench_lm.py b/benchmarks/deepcompile/run_bench_lm.py
new file mode 100644
index 000000000..f175d84d7
--- /dev/null
+++ b/benchmarks/deepcompile/run_bench_lm.py
@@ -0,0 +1,270 @@
+import os
+import argparse
+import time
+from datetime import datetime
+from contextlib import nullcontext
+from typing import List
+
+import torch
+from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, enable_full_determinism
+from datasets import load_dataset, DownloadConfig
+from accelerate import Accelerator
+from torch.utils.data import DataLoader
+from torch.utils.data.distributed import DistributedSampler
+from torch.utils.data import SequentialSampler
+
+from datasets.utils.logging import disable_progress_bar
+
+from patch_phi3_moe import patch_phi3moe
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name", type=str, default="meta-llama/Llama-2-7b-hf")
+    parser.add_argument("--batch_size", type=int, default=1)
+    parser.add_argument("--num_epochs", type=int, default=100)
+    parser.add_argument("--seq_length", type=int, default=512)
+    parser.add_argument("--learning_rate", type=float, default=2e-5)
+    parser.add_argument("--max_grad_norm", type=float, default=1.0)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
+    parser.add_argument("--activation_checkpointing", action="store_true")
+    parser.add_argument("--dataset_name", type=str, default="timdettmers/openassistant-guanaco")
+    parser.add_argument("--num_layers", type=int, default=0)
+    parser.add_argument("--attn_impl", type=str, default="spda")
+    parser.add_argument("--compile", action="store_true")
+    parser.add_argument("--passes", type=str, default=None)
+    parser.add_argument("--backend", type=str, default="inductor")
+    parser.add_argument("--offload_opt_states", action="store_true")
+    parser.add_argument("--profile", action="store_true")
+    parser.add_argument("--deterministic", action="store_true")
+    parser.add_argument("--profile_dir", type=str, default=None)
+    parser.add_argument("--bench_step", type=int, default=30)
+    parser.add_argument("--warmup_step", type=int, default=15)
+    parser.add_argument("--zero_stage", type=int, default=3)
+    parser.add_argument("--print_interval", type=int, default=1)
+    parser.add_argument("--save_weights", action="store_true")
+    parser.add_argument("--load_weights", action="store_true")
+
+    return parser.parse_args()
+
+
+def make_schedule(passes: List[str], warmup):
+    from deepspeed.compile.passes import zero3_compile, prefetch, selective_gather, offload_adam_states
+
+    schedule = []
+
+    if "offload_adam_states" in passes:
+        assert len(passes) == 1, "offload_adam_states should be the only pass"
+        schedule.append((0, [offload_adam_states.offload_adam_states_for_init, zero3_compile.add_z3_gather_release, offload_adam_states.move_opt_states_sync]))
+        schedule.append((5, [offload_adam_states.offload_adam_states_for_init, zero3_compile.add_z3_gather_release, offload_adam_states.move_opt_states]))
+    elif "offload_adam_states_sync" in passes:
+        assert len(passes) == 1, "offload_adam_states_sync should be the only pass"
+        schedule.append((0, [zero3_compile.add_z3_gather_release, offload_adam_states.move_opt_states_sync]))
+    else:
+        schedule.append((0, [zero3_compile.add_z3_gather_release]))
+        second_opt = [zero3_compile.add_z3_gather_release]
+        if "prefetch" in passes:
+            second_opt.append(prefetch.schedule_prefetch)
+        if "selective_gather" in passes:
+            second_opt.append(selective_gather.selective_gather)
+        schedule.append((warmup, second_opt))
+    return schedule
+
+
+def main():
+    args = get_args()
+    print(args)
+
+    if "offload_adam_states" in args.passes:
+        os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128'
+
+    if args.deterministic:
+        enable_full_determinism(1)
+        from torch._inductor import config
+        config.fallback_random = True
+
+    accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
+    device = accelerator.device
+    is_deepspeed = accelerator.state.deepspeed_plugin is not None
+    print(f"Running on device: {device} is_deepspeed: {is_deepspeed}")
+
+    # Load model and tokenizer
+    if accelerator.is_main_process:
+        print("Loading model and tokenizer...")
+
+    model_name = args.model_name
+
+    model_weight_path = f"{model_name.split('/')[1]}_cp_layer{args.num_layers}"
+    if args.load_weights:
+        model = AutoModelForCausalLM.from_pretrained(model_weight_path, trust_remote_code=True)
+    else:
+        if args.num_layers > 0:
+            model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
+            print(f"num_hidden_layers: {model_config.num_hidden_layers} -> {args.num_layers}")
+            model_config.num_hidden_layers = args.num_layers
+            model = AutoModelForCausalLM.from_config(model_config, trust_remote_code=True)
+        else:
+            model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
+
+    if patch_phi3moe(model) and accelerator.is_main_process:
+        print("Patched Phi-3.5-MoE model")
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+
+    if args.save_weights and accelerator.is_main_process:
+        model.save_pretrained(model_weight_path)
+
+    if args.activation_checkpointing:
+        model.gradient_checkpointing_enable()
+
+    tokenizer.pad_token = tokenizer.eos_token
+
+    # Load dataset
+    if accelerator.is_main_process:
+        print("Loading dataset...")
+    else:
+        disable_progress_bar()
+        
+    dataset = load_dataset('ag_news', split='train[:100%]', download_config=DownloadConfig(disable_tqdm=True))
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    tokenizer.pad_token = tokenizer.convert_ids_to_tokens(2)
+
+    def tokenize_function(examples):
+        return tokenizer(examples['text'], padding='max_length', max_length=args.seq_length, truncation=True)
+
+    tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=1, keep_in_memory=True)
+    tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
+
+    sampler = DistributedSampler(tokenized_dataset, num_replicas=accelerator.num_processes, rank=accelerator.process_index)
+    data_loader = DataLoader(tokenized_dataset, batch_size=args.batch_size, sampler=sampler, num_workers=4)
+
+    # Prepare optimizer
+    optimizer = torch.optim.AdamW(model.parameters(), lr=args.learning_rate)
+
+    # Prepare everything with accelerator
+    model, optimizer, data_loader = accelerator.prepare(model, optimizer, data_loader)
+    print(f"Model prepared: {model.__class__} optimizer: {optimizer.__class__}")
+
+    if "Mixtral" in model_name:
+        torch._dynamo.config.capture_dynamic_output_shape_ops = True
+        torch._dynamo.config.capture_scalar_outputs = True
+
+
+    if is_deepspeed:
+        if args.compile:
+            schedule = make_schedule(args.passes.split(","), warmup=5) if args.passes else None
+            model.compile(backend=args.backend, schedule=schedule)
+    else:
+        if args.compile:
+            model = torch.compile(model, backend=args.backend)
+
+    timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
+    model_name = args.model_name.split("/")[-1]
+    exp_name = f"{model_name}_np{accelerator.num_processes}ds{1 if is_deepspeed else 0}" \
+               f"B{args.backend}z{args.zero_stage}" \
+               f"L{0 if args.num_layers is None else args.num_layers}" \
+               f"bs{args.batch_size}seq{args.seq_length}acc{args.gradient_accumulation_steps}ac{1 if args.activation_checkpointing else 0}" \
+               f"pass_{'none' if args.passes is None else args.passes.replace(',', '_')}_" \
+               f"os{1 if args.offload_opt_states else 0}" \
+               f"T{timestamp}"
+    if args.profile_dir:
+        if accelerator.is_main_process and args.profile_dir:
+            os.makedirs(args.profile_dir, exist_ok=True)
+            if args.profile:
+                prof_dir = f"{args.profile_dir}/{exp_name}"
+                os.makedirs(prof_dir, exist_ok=True)
+        accelerator.wait_for_everyone()        
+        
+    do_profile = args.profile and accelerator.is_main_process
+    prof_context = torch.profiler.profile(
+        activities=[
+            torch.profiler.ProfilerActivity.CPU,
+            torch.profiler.ProfilerActivity.CUDA,
+        ],
+        schedule=torch.profiler.schedule(wait=0, warmup=10*args.gradient_accumulation_steps, active=3, repeat=1),
+        on_trace_ready=torch.profiler.tensorboard_trace_handler(prof_dir),
+    ) if do_profile else nullcontext()
+
+    # Training loop
+    model.train()
+    global_step = 0
+
+    iter_times = []
+
+    # See https://github.com/microsoft/DeepSpeed/issues/6793
+    acc_context = nullcontext if is_deepspeed else accelerator.accumulate
+
+    stop = False
+    with prof_context as prof:
+        for epoch in range(args.num_epochs):
+            start_iter = time.time()
+
+            for step, batch in enumerate(data_loader):
+                input_ids = batch['input_ids'].to(device)
+                attention_mask = batch['attention_mask'].to(device)
+
+                with acc_context(model):
+                    outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids, use_cache=False)
+                    loss = outputs.loss
+
+                    update_step = (is_deepspeed and model.is_gradient_accumulation_boundary()) \
+                        or (not is_deepspeed and accelerator.sync_gradients)
+                    accelerator.backward(loss)
+                    optimizer.step()
+                    optimizer.zero_grad()
+                    global_step += 1
+
+                    if update_step:
+                        if accelerator.is_main_process and global_step % (args.print_interval * args.gradient_accumulation_steps) == 0:
+                            print(f"Epoch {epoch+1}, Step {global_step}, Loss: {loss.item()} sync: {accelerator.sync_gradients} time: {time.time() - start_iter} alloc_mem: {torch.cuda.memory_allocated()} peak_mem: {torch.cuda.max_memory_allocated()}")
+
+                        iter_times.append(time.time() - start_iter)
+                        start_iter = time.time()
+
+                if do_profile:
+                    prof.step()
+
+                stop = global_step >= args.bench_step * args.gradient_accumulation_steps
+                if stop:
+                    break
+            if stop:
+                break
+
+    iter_times = iter_times[args.warmup_step:]
+
+    if accelerator.is_main_process:
+        compile_time_sum = 0
+        compile_time = 0
+        if args.compile and hasattr(model, "get_compile_time"):
+            compile_time = model.get_compile_time()
+            compile_time_sum = sum(t for _, _, _, t in compile_time)
+
+        is_deepcompile = is_deepspeed and model._config.compile_config.deepcompile
+        msg = f"{args.model_name} ds={is_deepspeed} np={accelerator.num_processes} batch_size={args.batch_size} seq={args.seq_length} zero_stage={args.zero_stage} acc={args.gradient_accumulation_steps} ac={args.activation_checkpointing} compile={args.compile} backend={args.backend} deepcompile={is_deepcompile} passes={args.passes} compile_time={compile_time_sum} iteration time: {sum(iter_times) / len(iter_times):.4f} alloc_mem: {torch.cuda.memory_allocated()} peak_mem: {torch.cuda.max_memory_allocated()}"
+        print(msg)
+
+        if args.profile_dir:
+            from pathlib import Path
+            filepath = Path(args.profile_dir) / f"result.txt"
+            with open(filepath, "a") as f:
+                f.write(f"{timestamp} {msg}" + "\n")
+
+            if args.compile:
+                filepath = Path(args.profile_dir) / f"compile_time.txt"
+                with open(filepath, "a") as f:
+                    msg =  f"{msg} compile_time={compile_time_sum} {compile_time}"
+                    f.write(f"{timestamp} {msg}" + "\n")
+
+    # # Save the model
+    # if accelerator.is_main_process:
+    #     accelerator.wait_for_everyone()
+    #     unwrapped_model = accelerator.unwrap_model(model)
+    #     unwrapped_model.save_pretrained("fine_tuned_model", save_function=accelerator.save)
+    #     tokenizer.save_pretrained("fine_tuned_model")
+
+if __name__ == "__main__":
+    torch._dynamo.config.accumulated_cache_size_limit = 256
+    torch._dynamo.config.cache_size_limit = 128
+    torch._dynamo.config.optimize_ddp = False
+
+    main()
diff --git a/benchmarks/deepcompile/run_bench_offload.sh b/benchmarks/deepcompile/run_bench_offload.sh
new file mode 100644
index 000000000..ea72db195
--- /dev/null
+++ b/benchmarks/deepcompile/run_bench_offload.sh
@@ -0,0 +1,25 @@
+PROFILE_DIR=${PROFILE_DIR:-"profile_offload"}
+mkdir -p ${PROFILE_DIR}
+PROFILE_OPTS="--profile --profile-dir ${PROFILE_DIR}"
+COMPILE_OPTS="--compile"
+DC_OPTS="--compile --deepcompile"
+ACC_OPTS="--gradient-accumulation-steps 1"
+AC_OPTS="--activation-checkpointing"
+
+mkdir -p logs
+
+export LOG_BASE="logs_offload"
+mkdir -p ${LOG_BASE}
+
+MODEL="meta-llama/Meta-Llama-3-70B-Instruct"
+BATCH_SIZE_OPTS=(1)
+SEQ_LENGTH_OPTS=(1024)
+for BATCH_SIZE in ${BATCH_SIZE_OPTS[@]}; do
+    for SEQ_LENGTH in ${SEQ_LENGTH_OPTS[@]}; do
+        ARGS="--model ${MODEL} --batch-size ${BATCH_SIZE} --seq-length ${SEQ_LENGTH} ${ACC_OPTS} ${AC_OPTS} ${PROFILE_OPTS}"
+        bash ./run.sh --backend deepspeed ${ARGS} --zero-stage 3
+        bash ./run.sh --backend deepspeed ${ARGS} --zero-stage 3 --ds-offload
+        bash ./run.sh --backend deepspeed ${ARGS} ${DC_OPTS} --zero-stage 3 --eager --passes offload_adam_states
+        bash ./run.sh --backend deepspeed ${ARGS} ${DC_OPTS} --zero-stage 3 --eager --passes offload_adam_states_sync
+    done
+done
diff --git a/benchmarks/deepcompile/run_bench_z1.sh b/benchmarks/deepcompile/run_bench_z1.sh
new file mode 100644
index 000000000..b5491e3fc
--- /dev/null
+++ b/benchmarks/deepcompile/run_bench_z1.sh
@@ -0,0 +1,21 @@
+PROFILE_DIR=${PROFILE_DIR:-profiles}
+mkdir -p ${PROFILE_DIR}
+PROFILE_OPTS="--profile --profile-dir ${PROFILE_DIR}"
+COMPILE_OPTS="--compile"
+DC_OPTS="--compile --deepcompile"
+ACC_OPTS="--gradient-accumulation-steps 1"
+AC_OPTS="--activation-checkpointing"
+
+MODEL="meta-llama/Meta-Llama-3-8B-Instruct"
+BATCH_SIZE_OPTS=(1 2 4)
+SEQ_LENGTH_OPTS=(512 1024 2048)
+for BATCH_SIZE in ${BATCH_SIZE_OPTS[@]}; do
+    for SEQ_LENGTH in ${SEQ_LENGTH_OPTS[@]}; do
+        ARGS="--model ${MODEL} --batch-size ${BATCH_SIZE} --seq-length ${SEQ_LENGTH} --zero-stage 1 ${ACC_OPTS} ${AC_OPTS}"
+        bash ./run_multinode.sh --backend deepspeed ${ARGS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${COMPILE_OPTS}
+        bash ./run_multinode.sh --backend deepspeed ${ARGS} ${DC_OPTS}
+
+        cp -r logs ${PROFILE_DIR}/
+    done
+done
diff --git a/benchmarks/deepcompile/run_multinode.sh b/benchmarks/deepcompile/run_multinode.sh
new file mode 100644
index 000000000..6f3feba9a
--- /dev/null
+++ b/benchmarks/deepcompile/run_multinode.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+echo $*
+
+SCRIPT_DIR=$(dirname $(realpath $0))
+HOST_IP=$(hostname -i)
+NUM_NODES=${NUM_NODES:-$(wc -l < /job/hostfile)}
+
+if [ "${NUM_NODES}" == "1" ]; then
+    # avoid dependency on pdsh when possible
+    cd ${SCRIPT_DIR}; bash ./run.sh --host-ip ${HOST_IP} $*
+else
+    ds_ssh -f hostfile_n${NUM_NODES} "cd ${SCRIPT_DIR}; bash ./run.sh --host-ip ${HOST_IP} $*"
+fi