Add GLM-4.7-Flash example docs and 8xH100 training script (#1645)

zhuzilin · Copilot · Copilot · web-flow · commit a2b16da5612e · 2026-02-28T18:40:45.000+08:00
Co-authored-by: Copilot &lt;copilot@github.com&gt;
Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/docs/en/examples/glm4.7-30B-A3B.md b/docs/en/examples/glm4.7-30B-A3B.md
@@ -0,0 +1,156 @@
+# GLM-4.7-Flash with 8×H100
+
+
+## Environment Preparation
+
+The environment setup, data, and checkpoint conversion are the same as for the Qwen3-4B model. You can refer to [Example: Qwen3-4B Model](qwen3-4B.md), replacing mentions of Qwen3-4B with GLM-4.7-Flash.
+
+### Download Model
+
+```bash
+hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash
+```
+
+### Convert Checkpoint
+
+To convert the Hugging Face checkpoint to torch_dist format:
+
+```bash
+cd /root/slime
+pip install -e . --no-deps
+source scripts/models/glm4.7-30B-A3B.sh
+PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
+   tools/convert_hf_to_torch_dist.py \
+   ${MODEL_ARGS[@]} \
+   --hf-checkpoint /root/GLM-4.7-Flash/ \
+   --save /root/GLM-4.7-Flash_torch_dist/
+```
+
+## Run Training
+
+Execute the training script:
+
+```bash
+cd /root/slime
+bash scripts/run-glm4.7-30B-A3B-8gpus.sh
+```
+
+### Parameter Introduction
+
+Here, we will briefly introduce the key parts in the [run-glm4.7-30B-A3B-8gpus.sh](https://github.com/THUDM/slime/blob/main/scripts/run-glm4.7-30B-A3B-8gpus.sh) script.
+
+#### MoE Configuration
+
+GLM-4.7-Flash is a Mixture-of-Experts (MoE) model with 64 routed experts (top-4 activation) and 1 shared expert. It has 47 layers: 1 dense layer + 46 MoE layers.
+
+1.  To support running GLM-4.7-Flash on 8×H100, we need to enable Megatron's CPU Adam to save GPU memory:
+
+    ```bash
+    OPTIMIZER_ARGS=(
+       ...
+       --optimizer-cpu-offload
+       --overlap-cpu-optimizer-d2h-h2d
+       --use-precision-aware-optimizer
+    )
+    ```
+
+2.  Enable MoE optimization in Megatron. For single-node 8×H100, we use TP=1, EP=8:
+
+    ```bash
+    PERF_ARGS=(
+       --tensor-model-parallel-size 1
+       --pipeline-model-parallel-size 1
+       --context-parallel-size 1
+       --expert-model-parallel-size 8
+       --expert-tensor-parallel-size 1
+       ...
+    )
+    ```
+
+3.  Enable MoE optimization in SGLang with DP attention:
+
+    ```bash
+    SGLANG_ARGS=(
+       --rollout-num-gpus-per-engine 8
+       --sglang-mem-fraction-static 0.7
+       --sglang-enable-dp-attention
+       --sglang-dp-size 8
+       --sglang-enable-dp-lm-head
+       --sglang-moe-dense-tp-size 1
+       ...
+    )
+    ```
+
+#### MTP Speculative Decoding (Inference Acceleration)
+
+GLM-4.7-Flash includes 1 MTP (Multi-Token Prediction) layer, which can be used for speculative decoding during inference to speed up rollout generation. To enable this, add the following to `SGLANG_ARGS`:
+
+```bash
+SGLANG_ARGS=(
+   ...
+   # MTP speculative decoding (EAGLE)
+   --sglang-speculative-algorithm EAGLE
+   --sglang-speculative-num-steps 2
+   --sglang-speculative-eagle-topk 1
+   --sglang-speculative-num-draft-tokens 3
+)
+```
+
+This enables SGLang to use the model's MTP layer as a draft model for EAGLE-style speculative decoding. The MTP layer predicts multiple future tokens, and SGLang verifies them in parallel, leading to faster generation.
+
+> ⚠️ **Note**: Speculative decoding requires additional GPU memory. If you encounter OOM issues, try reducing `--sglang-mem-fraction-static` or disabling speculative decoding.
+
+#### MTP Training
+
+slime also supports training MTP layers jointly with the main model for models that have MTP weight conversion implemented (e.g., MiMo, GLM-4.5). When enabled, the relevant arguments are:
+
+```bash
+# Add MTP layer count to model config
+MODEL_ARGS+=(--mtp-num-layers 1)
+
+# Enable MTP training
+SPEC_ARGS=(
+   --enable-mtp-training
+   --mtp-loss-scaling-factor 0.2
+)
+```
+
+- `--mtp-num-layers 1`: Tells Megatron to load the MTP layer from the checkpoint.
+- `--enable-mtp-training`: Enables gradient computation for MTP layers. Without this flag, the MTP layer is loaded but frozen.
+- `--mtp-loss-scaling-factor 0.2`: Weight of the MTP loss relative to the main policy loss. Default is 0.2.
+
+> ⚠️ **Note**: MTP training for GLM-4.7-Flash is not yet supported because the deepseek_v3 checkpoint bridge does not include MTP weight conversion (`# TODO: mtp` in upstream mbridge). You can still use MTP for speculative decoding during inference — SGLang handles MTP layers internally.
+>
+> For models with full MTP training support (e.g., MiMo), see `scripts/run-mimo-7B-rl-eagle.sh` as a reference.
+
+### Multi-Node Support
+
+For multi-node training (e.g., 2×8 H100), use the multi-node script:
+
+```bash
+cd /root/slime
+export BASE_DIR=/shared/path  # accessible by all nodes
+bash scripts/run-glm4.7-30B-A3B.sh
+```
+
+Key modifications for multi-node:
+
+  - Place the model and data on a path accessible by all nodes.
+  - Set `MASTER_ADDR` to an address accessible by all nodes.
+  - Remove CPU Adam configurations (distributed optimizer reduces per-GPU memory usage).
+  - Adjust parallelism: e.g., TP=4, PP=2, EP=8, CP=2.
+
+When the total number of GPUs is not a multiple or divisor of the total number of experts (64), you can use `--sglang-ep-num-redundant-experts` to add redundant experts. For example, in a 24-GPU scenario:
+
+```bash
+SGLANG_ARGS=(
+   --rollout-num-gpus-per-engine 24
+   --sglang-mem-fraction-static 0.7
+   --sglang-ep-size 24
+   --sglang-enable-dp-attention
+   --sglang-dp-size 3
+   --sglang-moe-dense-tp-size 1
+   --sglang-enable-dp-lm-head
+   --sglang-ep-num-redundant-experts 16
+)
+```
diff --git a/docs/en/get_started/quick_start.md b/docs/en/get_started/quick_start.md
@@ -71,7 +71,7 @@ hf download --repo-type dataset zhuzilin/aime-2024 \
 
 When using Megatron as the training backend, you need to first convert Hugging Face format model weights to Megatron `torch_dist` format.
 
-First, load the configuration file of the target model. The `slime/scripts/models` directory contains configuration files for supported models. You need to `source` the corresponding model script to load the configuration parameters into the current environment. Here we use GLM4-9B model as an example, and it's similar for Qwen3-4B, Qwen3-30B-A3B, etc.
+First, load the configuration file of the target model. The `slime/scripts/models` directory contains configuration files for supported models. You need to `source` the corresponding model script to load the configuration parameters into the current environment. Here we use GLM4-9B model as an example, and it's similar for Qwen3-4B, GLM-4.7-Flash, Qwen3-30B-A3B, etc.
 
 ```bash
 cd /root/slime
@@ -580,6 +580,7 @@ export NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=$(ip -o -4 addr show | awk '$4 ~ /^10\.
 
 slime has been deeply optimized for distributed training of large-scale Mixture of Experts (MoE) models. We provide some end-to-end training cases for reference:
 
+- [Example: 8xH100 Training GLM-4.7-Flash](../examples/glm4.7-30B-A3B.md)
 - [Example: 64xH100 Training GLM-4.5](../examples/glm4.5-355B-A32B.md)
 - [Example: 128xH100 Training DeepSeek-R1](../examples/deepseek-r1.md)
 - The scripts such as `scripts/run_qwen3_30b_a3b.py`, `scripts/run_glm45_355b_a32b.py` also support multi-node training, though there are little documentations about it currently.
diff --git a/docs/en/index.rst b/docs/en/index.rst
@@ -32,6 +32,7 @@ slime is the RL-framework behind GLM-4.7, GLM-4.6 and GLM-4.5. Apart from models
    :maxdepth: 1
    :caption: MoE
 
+   examples/glm4.7-30B-A3B.md
    examples/qwen3-30B-A3B.md
    examples/glm4.5-355B-A32B.md
    examples/deepseek-r1.md
diff --git a/docs/zh/examples/glm4.7-30B-A3B.md b/docs/zh/examples/glm4.7-30B-A3B.md
@@ -0,0 +1,155 @@
+# 8×H100 训练 GLM-4.7-Flash
+
+## 环境准备
+
+搭建环境、数据与 ckpt 转换均与 Qwen3-4B 模型相同，可以参考 [示例：Qwen3-4B](qwen3-4B.md)，将文中 Qwen3-4B 的部分转换为 GLM-4.7-Flash 即可。
+
+### 下载模型
+
+```bash
+hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash
+```
+
+### 转换 Checkpoint
+
+可以用如下方法把 Hugging Face checkpoint 转化为 torch_dist 格式：
+
+```bash
+cd /root/slime
+pip install -e . --no-deps
+source scripts/models/glm4.7-30B-A3B.sh
+PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
+   tools/convert_hf_to_torch_dist.py \
+   ${MODEL_ARGS[@]} \
+   --hf-checkpoint /root/GLM-4.7-Flash/ \
+   --save /root/GLM-4.7-Flash_torch_dist/
+```
+
+## 执行训练
+
+执行训练：
+
+```bash
+cd /root/slime
+bash scripts/run-glm4.7-30B-A3B-8gpus.sh
+```
+
+### 参数简介
+
+这里我们简单介绍一下脚本 [run-glm4.7-30B-A3B-8gpus.sh](https://github.com/THUDM/slime/blob/main/scripts/run-glm4.7-30B-A3B-8gpus.sh) 中的关键部分。
+
+#### MoE 配置
+
+GLM-4.7-Flash 是一个 MoE（混合专家）模型，包含 64 个路由专家（top-4 激活）和 1 个共享专家。共 47 层：1 层 dense 层 + 46 层 MoE 层。
+
+1. 为了支持在 8×H100 环境中运行 GLM-4.7-Flash，我们需要开启 Megatron 的 CPU Adam 以节省显存：
+
+   ```bash
+   OPTIMIZER_ARGS=(
+      ...
+      --optimizer-cpu-offload
+      --overlap-cpu-optimizer-d2h-h2d
+      --use-precision-aware-optimizer
+   )
+   ```
+
+2. 开启 Megatron 支持的 MoE 优化，单机 8×H100 配置为 TP=1, EP=8：
+
+   ```bash
+   PERF_ARGS=(
+      --tensor-model-parallel-size 1
+      --pipeline-model-parallel-size 1
+      --context-parallel-size 1
+      --expert-model-parallel-size 8
+      --expert-tensor-parallel-size 1
+      ...
+   )
+   ```
+
+3. 开启 SGLang 支持的 MoE 优化，使用 DP attention：
+
+   ```bash
+   SGLANG_ARGS=(
+      --rollout-num-gpus-per-engine 8
+      --sglang-mem-fraction-static 0.7
+      --sglang-enable-dp-attention
+      --sglang-dp-size 8
+      --sglang-enable-dp-lm-head
+      --sglang-moe-dense-tp-size 1
+      ...
+   )
+   ```
+
+#### MTP 投机解码（推理加速）
+
+GLM-4.7-Flash 包含 1 层 MTP（Multi-Token Prediction）层，可用于推理时的投机解码来加速 rollout 生成。要启用此功能，在 `SGLANG_ARGS` 中添加以下配置：
+
+```bash
+SGLANG_ARGS=(
+   ...
+   # MTP 投机解码 (EAGLE)
+   --sglang-speculative-algorithm EAGLE
+   --sglang-speculative-num-steps 2
+   --sglang-speculative-eagle-topk 1
+   --sglang-speculative-num-draft-tokens 3
+)
+```
+
+这会让 SGLang 使用模型的 MTP 层作为 EAGLE 风格投机解码的 draft 模型。MTP 层预测多个未来 token，SGLang 并行验证它们，从而加速生成。
+
+> ⚠️ **注意**：投机解码会占用额外的 GPU 显存。如果遇到 OOM 问题，可以尝试降低 `--sglang-mem-fraction-static` 或关闭投机解码。
+
+#### MTP 训练
+
+slime 也支持将 MTP 层与主模型联合训练，适用于已实现 MTP 权重转换的模型（如 MiMo、GLM-4.5）。启用时，相关参数如下：
+
+```bash
+# 在模型配置中添加 MTP 层数
+MODEL_ARGS+=(--mtp-num-layers 1)
+
+# 启用 MTP 训练
+SPEC_ARGS=(
+   --enable-mtp-training
+   --mtp-loss-scaling-factor 0.2
+)
+```
+
+- `--mtp-num-layers 1`：告知 Megatron 从 checkpoint 中加载 MTP 层。
+- `--enable-mtp-training`：启用 MTP 层的梯度计算。不设置此标志时，MTP 层会被加载但冻结。
+- `--mtp-loss-scaling-factor 0.2`：MTP loss 相对于主策略 loss 的权重，默认为 0.2。
+
+> ⚠️ **注意**：GLM-4.7-Flash 的 MTP 训练目前尚不支持，因为 deepseek_v3 的 checkpoint bridge 尚未实现 MTP 权重转换（上游 mbridge 中标注为 `# TODO: mtp`）。但推理时的投机解码仍然可用 — SGLang 会内部处理 MTP 层。
+>
+> 对于完整支持 MTP 训练的模型（如 MiMo），可参考 `scripts/run-mimo-7B-rl-eagle.sh`。
+
+### 多机支持
+
+对于多机训练（例如 2×8 H100），使用多机脚本：
+
+```bash
+cd /root/slime
+export BASE_DIR=/shared/path  # 所有节点都可以访问的路径
+bash scripts/run-glm4.7-30B-A3B.sh
+```
+
+对于多机环境，需要进行如下修改：
+
+- 将训练模型、数据放在所有机器都可以访问到的路径上；
+- 设置各台机器都可以访问到的 `MASTER_ADDR`；
+- 去掉 CPU Adam 相关的配置，因为使用了 distributed optimizer，多机环境下 optimizer 的显存占比会明显下降。
+- 调整并行度：例如 TP=4, PP=2, EP=8, CP=2。
+
+当总卡数并不能被 expert 总数（64）乘除时，可以使用 `--sglang-ep-num-redundant-experts` 来增加冗余的 expert。例如对于 24 卡的场景：
+
+```bash
+SGLANG_ARGS=(
+   --rollout-num-gpus-per-engine 24
+   --sglang-mem-fraction-static 0.7
+   --sglang-ep-size 24
+   --sglang-enable-dp-attention
+   --sglang-dp-size 3
+   --sglang-moe-dense-tp-size 1
+   --sglang-enable-dp-lm-head
+   --sglang-ep-num-redundant-experts 16
+)
+```
diff --git a/docs/zh/get_started/quick_start.md b/docs/zh/get_started/quick_start.md
@@ -70,7 +70,7 @@ hf download --repo-type dataset zhuzilin/aime-2024 \
 
 当使用 Megatron 作为训练后端时，需要先将 Hugging Face 格式的模型权重转换为 Megatron `torch_dist` 格式。
 
-首先，加载目标模型的配置文件。`slime/scripts/models` 目录下包含了支持模型的配置文件。需要 `source` 对应模型的脚本，将配置参数加载到当前环境中。此处我们以 GLM4-9B 模型为例子，对于 Qwen3-4B，Qwen3-30B-A3B，是类似的。
+首先，加载目标模型的配置文件。`slime/scripts/models` 目录下包含了支持模型的配置文件。需要 `source` 对应模型的脚本，将配置参数加载到当前环境中。此处我们以 GLM4-9B 模型为例子，对于 Qwen3-4B，GLM-4.7-Flash，Qwen3-30B-A3B，是类似的。
 
 ```bash
 cd /root/slime
@@ -577,5 +577,6 @@ ray job submit --address="http://127.0.0.1:8265" \
 
 slime 针对大规模混合专家（MoE）模型的分布式训练进行了深度优化。我们提供了一些端到端的训练案例以供参考：
 
+- [示例：8xH100 训练 GLM-4.7-Flash](../examples/glm4.7-30B-A3B.md)
 - [示例：64xH100 训练 GLM-4.5](../examples/glm4.5-355B-A32B.md)
 - [示例：128xH100 训练 DeepSeek-R1](../examples/deepseek-r1.md)
diff --git a/docs/zh/index.rst b/docs/zh/index.rst
@@ -32,6 +32,7 @@ slime 是 GLM-4.7、GLM-4.6、GLM-4.5 背后的 RL 训练框架。除此之外
    :maxdepth: 1
    :caption: MoE
 
+   examples/glm4.7-30B-A3B.md
    examples/qwen3-30B-A3B.md
    examples/glm4.5-355B-A32B.md
    examples/deepseek-r1.md
diff --git a/scripts/run-glm4.7-30B-A3B-8gpus.sh b/scripts/run-glm4.7-30B-A3B-8gpus.sh