Skip to content

Commit a2b16da

Browse files
zhuzilinCopilotCopilot
authored
Add GLM-4.7-Flash example docs and 8xH100 training script (#1645)
Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent fe1d0e6 commit a2b16da

File tree

7 files changed

+492
-2
lines changed

7 files changed

+492
-2
lines changed

docs/en/examples/glm4.7-30B-A3B.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# GLM-4.7-Flash with 8×H100
2+
3+
4+
## Environment Preparation
5+
6+
The environment setup, data, and checkpoint conversion are the same as for the Qwen3-4B model. You can refer to [Example: Qwen3-4B Model](qwen3-4B.md), replacing mentions of Qwen3-4B with GLM-4.7-Flash.
7+
8+
### Download Model
9+
10+
```bash
11+
hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash
12+
```
13+
14+
### Convert Checkpoint
15+
16+
To convert the Hugging Face checkpoint to torch_dist format:
17+
18+
```bash
19+
cd /root/slime
20+
pip install -e . --no-deps
21+
source scripts/models/glm4.7-30B-A3B.sh
22+
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
23+
tools/convert_hf_to_torch_dist.py \
24+
${MODEL_ARGS[@]} \
25+
--hf-checkpoint /root/GLM-4.7-Flash/ \
26+
--save /root/GLM-4.7-Flash_torch_dist/
27+
```
28+
29+
## Run Training
30+
31+
Execute the training script:
32+
33+
```bash
34+
cd /root/slime
35+
bash scripts/run-glm4.7-30B-A3B-8gpus.sh
36+
```
37+
38+
### Parameter Introduction
39+
40+
Here, we will briefly introduce the key parts in the [run-glm4.7-30B-A3B-8gpus.sh](https://github.com/THUDM/slime/blob/main/scripts/run-glm4.7-30B-A3B-8gpus.sh) script.
41+
42+
#### MoE Configuration
43+
44+
GLM-4.7-Flash is a Mixture-of-Experts (MoE) model with 64 routed experts (top-4 activation) and 1 shared expert. It has 47 layers: 1 dense layer + 46 MoE layers.
45+
46+
1. To support running GLM-4.7-Flash on 8×H100, we need to enable Megatron's CPU Adam to save GPU memory:
47+
48+
```bash
49+
OPTIMIZER_ARGS=(
50+
...
51+
--optimizer-cpu-offload
52+
--overlap-cpu-optimizer-d2h-h2d
53+
--use-precision-aware-optimizer
54+
)
55+
```
56+
57+
2. Enable MoE optimization in Megatron. For single-node 8×H100, we use TP=1, EP=8:
58+
59+
```bash
60+
PERF_ARGS=(
61+
--tensor-model-parallel-size 1
62+
--pipeline-model-parallel-size 1
63+
--context-parallel-size 1
64+
--expert-model-parallel-size 8
65+
--expert-tensor-parallel-size 1
66+
...
67+
)
68+
```
69+
70+
3. Enable MoE optimization in SGLang with DP attention:
71+
72+
```bash
73+
SGLANG_ARGS=(
74+
--rollout-num-gpus-per-engine 8
75+
--sglang-mem-fraction-static 0.7
76+
--sglang-enable-dp-attention
77+
--sglang-dp-size 8
78+
--sglang-enable-dp-lm-head
79+
--sglang-moe-dense-tp-size 1
80+
...
81+
)
82+
```
83+
84+
#### MTP Speculative Decoding (Inference Acceleration)
85+
86+
GLM-4.7-Flash includes 1 MTP (Multi-Token Prediction) layer, which can be used for speculative decoding during inference to speed up rollout generation. To enable this, add the following to `SGLANG_ARGS`:
87+
88+
```bash
89+
SGLANG_ARGS=(
90+
...
91+
# MTP speculative decoding (EAGLE)
92+
--sglang-speculative-algorithm EAGLE
93+
--sglang-speculative-num-steps 2
94+
--sglang-speculative-eagle-topk 1
95+
--sglang-speculative-num-draft-tokens 3
96+
)
97+
```
98+
99+
This enables SGLang to use the model's MTP layer as a draft model for EAGLE-style speculative decoding. The MTP layer predicts multiple future tokens, and SGLang verifies them in parallel, leading to faster generation.
100+
101+
> ⚠️ **Note**: Speculative decoding requires additional GPU memory. If you encounter OOM issues, try reducing `--sglang-mem-fraction-static` or disabling speculative decoding.
102+
103+
#### MTP Training
104+
105+
slime also supports training MTP layers jointly with the main model for models that have MTP weight conversion implemented (e.g., MiMo, GLM-4.5). When enabled, the relevant arguments are:
106+
107+
```bash
108+
# Add MTP layer count to model config
109+
MODEL_ARGS+=(--mtp-num-layers 1)
110+
111+
# Enable MTP training
112+
SPEC_ARGS=(
113+
--enable-mtp-training
114+
--mtp-loss-scaling-factor 0.2
115+
)
116+
```
117+
118+
- `--mtp-num-layers 1`: Tells Megatron to load the MTP layer from the checkpoint.
119+
- `--enable-mtp-training`: Enables gradient computation for MTP layers. Without this flag, the MTP layer is loaded but frozen.
120+
- `--mtp-loss-scaling-factor 0.2`: Weight of the MTP loss relative to the main policy loss. Default is 0.2.
121+
122+
> ⚠️ **Note**: MTP training for GLM-4.7-Flash is not yet supported because the deepseek_v3 checkpoint bridge does not include MTP weight conversion (`# TODO: mtp` in upstream mbridge). You can still use MTP for speculative decoding during inference — SGLang handles MTP layers internally.
123+
>
124+
> For models with full MTP training support (e.g., MiMo), see `scripts/run-mimo-7B-rl-eagle.sh` as a reference.
125+
126+
### Multi-Node Support
127+
128+
For multi-node training (e.g., 2×8 H100), use the multi-node script:
129+
130+
```bash
131+
cd /root/slime
132+
export BASE_DIR=/shared/path # accessible by all nodes
133+
bash scripts/run-glm4.7-30B-A3B.sh
134+
```
135+
136+
Key modifications for multi-node:
137+
138+
- Place the model and data on a path accessible by all nodes.
139+
- Set `MASTER_ADDR` to an address accessible by all nodes.
140+
- Remove CPU Adam configurations (distributed optimizer reduces per-GPU memory usage).
141+
- Adjust parallelism: e.g., TP=4, PP=2, EP=8, CP=2.
142+
143+
When the total number of GPUs is not a multiple or divisor of the total number of experts (64), you can use `--sglang-ep-num-redundant-experts` to add redundant experts. For example, in a 24-GPU scenario:
144+
145+
```bash
146+
SGLANG_ARGS=(
147+
--rollout-num-gpus-per-engine 24
148+
--sglang-mem-fraction-static 0.7
149+
--sglang-ep-size 24
150+
--sglang-enable-dp-attention
151+
--sglang-dp-size 3
152+
--sglang-moe-dense-tp-size 1
153+
--sglang-enable-dp-lm-head
154+
--sglang-ep-num-redundant-experts 16
155+
)
156+
```

docs/en/get_started/quick_start.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ hf download --repo-type dataset zhuzilin/aime-2024 \
7171

7272
When using Megatron as the training backend, you need to first convert Hugging Face format model weights to Megatron `torch_dist` format.
7373

74-
First, load the configuration file of the target model. The `slime/scripts/models` directory contains configuration files for supported models. You need to `source` the corresponding model script to load the configuration parameters into the current environment. Here we use GLM4-9B model as an example, and it's similar for Qwen3-4B, Qwen3-30B-A3B, etc.
74+
First, load the configuration file of the target model. The `slime/scripts/models` directory contains configuration files for supported models. You need to `source` the corresponding model script to load the configuration parameters into the current environment. Here we use GLM4-9B model as an example, and it's similar for Qwen3-4B, GLM-4.7-Flash, Qwen3-30B-A3B, etc.
7575

7676
```bash
7777
cd /root/slime
@@ -580,6 +580,7 @@ export NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=$(ip -o -4 addr show | awk '$4 ~ /^10\.
580580

581581
slime has been deeply optimized for distributed training of large-scale Mixture of Experts (MoE) models. We provide some end-to-end training cases for reference:
582582

583+
- [Example: 8xH100 Training GLM-4.7-Flash](../examples/glm4.7-30B-A3B.md)
583584
- [Example: 64xH100 Training GLM-4.5](../examples/glm4.5-355B-A32B.md)
584585
- [Example: 128xH100 Training DeepSeek-R1](../examples/deepseek-r1.md)
585586
- The scripts such as `scripts/run_qwen3_30b_a3b.py`, `scripts/run_glm45_355b_a32b.py` also support multi-node training, though there are little documentations about it currently.

docs/en/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ slime is the RL-framework behind GLM-4.7, GLM-4.6 and GLM-4.5. Apart from models
3232
:maxdepth: 1
3333
:caption: MoE
3434

35+
examples/glm4.7-30B-A3B.md
3536
examples/qwen3-30B-A3B.md
3637
examples/glm4.5-355B-A32B.md
3738
examples/deepseek-r1.md

docs/zh/examples/glm4.7-30B-A3B.md

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# 8×H100 训练 GLM-4.7-Flash
2+
3+
## 环境准备
4+
5+
搭建环境、数据与 ckpt 转换均与 Qwen3-4B 模型相同,可以参考 [示例:Qwen3-4B](qwen3-4B.md),将文中 Qwen3-4B 的部分转换为 GLM-4.7-Flash 即可。
6+
7+
### 下载模型
8+
9+
```bash
10+
hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash
11+
```
12+
13+
### 转换 Checkpoint
14+
15+
可以用如下方法把 Hugging Face checkpoint 转化为 torch_dist 格式:
16+
17+
```bash
18+
cd /root/slime
19+
pip install -e . --no-deps
20+
source scripts/models/glm4.7-30B-A3B.sh
21+
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
22+
tools/convert_hf_to_torch_dist.py \
23+
${MODEL_ARGS[@]} \
24+
--hf-checkpoint /root/GLM-4.7-Flash/ \
25+
--save /root/GLM-4.7-Flash_torch_dist/
26+
```
27+
28+
## 执行训练
29+
30+
执行训练:
31+
32+
```bash
33+
cd /root/slime
34+
bash scripts/run-glm4.7-30B-A3B-8gpus.sh
35+
```
36+
37+
### 参数简介
38+
39+
这里我们简单介绍一下脚本 [run-glm4.7-30B-A3B-8gpus.sh](https://github.com/THUDM/slime/blob/main/scripts/run-glm4.7-30B-A3B-8gpus.sh) 中的关键部分。
40+
41+
#### MoE 配置
42+
43+
GLM-4.7-Flash 是一个 MoE(混合专家)模型,包含 64 个路由专家(top-4 激活)和 1 个共享专家。共 47 层:1 层 dense 层 + 46 层 MoE 层。
44+
45+
1. 为了支持在 8×H100 环境中运行 GLM-4.7-Flash,我们需要开启 Megatron 的 CPU Adam 以节省显存:
46+
47+
```bash
48+
OPTIMIZER_ARGS=(
49+
...
50+
--optimizer-cpu-offload
51+
--overlap-cpu-optimizer-d2h-h2d
52+
--use-precision-aware-optimizer
53+
)
54+
```
55+
56+
2. 开启 Megatron 支持的 MoE 优化,单机 8×H100 配置为 TP=1, EP=8:
57+
58+
```bash
59+
PERF_ARGS=(
60+
--tensor-model-parallel-size 1
61+
--pipeline-model-parallel-size 1
62+
--context-parallel-size 1
63+
--expert-model-parallel-size 8
64+
--expert-tensor-parallel-size 1
65+
...
66+
)
67+
```
68+
69+
3. 开启 SGLang 支持的 MoE 优化,使用 DP attention:
70+
71+
```bash
72+
SGLANG_ARGS=(
73+
--rollout-num-gpus-per-engine 8
74+
--sglang-mem-fraction-static 0.7
75+
--sglang-enable-dp-attention
76+
--sglang-dp-size 8
77+
--sglang-enable-dp-lm-head
78+
--sglang-moe-dense-tp-size 1
79+
...
80+
)
81+
```
82+
83+
#### MTP 投机解码(推理加速)
84+
85+
GLM-4.7-Flash 包含 1 层 MTP(Multi-Token Prediction)层,可用于推理时的投机解码来加速 rollout 生成。要启用此功能,在 `SGLANG_ARGS` 中添加以下配置:
86+
87+
```bash
88+
SGLANG_ARGS=(
89+
...
90+
# MTP 投机解码 (EAGLE)
91+
--sglang-speculative-algorithm EAGLE
92+
--sglang-speculative-num-steps 2
93+
--sglang-speculative-eagle-topk 1
94+
--sglang-speculative-num-draft-tokens 3
95+
)
96+
```
97+
98+
这会让 SGLang 使用模型的 MTP 层作为 EAGLE 风格投机解码的 draft 模型。MTP 层预测多个未来 token,SGLang 并行验证它们,从而加速生成。
99+
100+
> ⚠️ **注意**:投机解码会占用额外的 GPU 显存。如果遇到 OOM 问题,可以尝试降低 `--sglang-mem-fraction-static` 或关闭投机解码。
101+
102+
#### MTP 训练
103+
104+
slime 也支持将 MTP 层与主模型联合训练,适用于已实现 MTP 权重转换的模型(如 MiMo、GLM-4.5)。启用时,相关参数如下:
105+
106+
```bash
107+
# 在模型配置中添加 MTP 层数
108+
MODEL_ARGS+=(--mtp-num-layers 1)
109+
110+
# 启用 MTP 训练
111+
SPEC_ARGS=(
112+
--enable-mtp-training
113+
--mtp-loss-scaling-factor 0.2
114+
)
115+
```
116+
117+
- `--mtp-num-layers 1`:告知 Megatron 从 checkpoint 中加载 MTP 层。
118+
- `--enable-mtp-training`:启用 MTP 层的梯度计算。不设置此标志时,MTP 层会被加载但冻结。
119+
- `--mtp-loss-scaling-factor 0.2`:MTP loss 相对于主策略 loss 的权重,默认为 0.2。
120+
121+
> ⚠️ **注意**:GLM-4.7-Flash 的 MTP 训练目前尚不支持,因为 deepseek_v3 的 checkpoint bridge 尚未实现 MTP 权重转换(上游 mbridge 中标注为 `# TODO: mtp`)。但推理时的投机解码仍然可用 — SGLang 会内部处理 MTP 层。
122+
>
123+
> 对于完整支持 MTP 训练的模型(如 MiMo),可参考 `scripts/run-mimo-7B-rl-eagle.sh`
124+
125+
### 多机支持
126+
127+
对于多机训练(例如 2×8 H100),使用多机脚本:
128+
129+
```bash
130+
cd /root/slime
131+
export BASE_DIR=/shared/path # 所有节点都可以访问的路径
132+
bash scripts/run-glm4.7-30B-A3B.sh
133+
```
134+
135+
对于多机环境,需要进行如下修改:
136+
137+
- 将训练模型、数据放在所有机器都可以访问到的路径上;
138+
- 设置各台机器都可以访问到的 `MASTER_ADDR`
139+
- 去掉 CPU Adam 相关的配置,因为使用了 distributed optimizer,多机环境下 optimizer 的显存占比会明显下降。
140+
- 调整并行度:例如 TP=4, PP=2, EP=8, CP=2。
141+
142+
当总卡数并不能被 expert 总数(64)乘除时,可以使用 `--sglang-ep-num-redundant-experts` 来增加冗余的 expert。例如对于 24 卡的场景:
143+
144+
```bash
145+
SGLANG_ARGS=(
146+
--rollout-num-gpus-per-engine 24
147+
--sglang-mem-fraction-static 0.7
148+
--sglang-ep-size 24
149+
--sglang-enable-dp-attention
150+
--sglang-dp-size 3
151+
--sglang-moe-dense-tp-size 1
152+
--sglang-enable-dp-lm-head
153+
--sglang-ep-num-redundant-experts 16
154+
)
155+
```

docs/zh/get_started/quick_start.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ hf download --repo-type dataset zhuzilin/aime-2024 \
7070

7171
当使用 Megatron 作为训练后端时,需要先将 Hugging Face 格式的模型权重转换为 Megatron `torch_dist` 格式。
7272

73-
首先,加载目标模型的配置文件。`slime/scripts/models` 目录下包含了支持模型的配置文件。需要 `source` 对应模型的脚本,将配置参数加载到当前环境中。此处我们以 GLM4-9B 模型为例子,对于 Qwen3-4B,Qwen3-30B-A3B,是类似的。
73+
首先,加载目标模型的配置文件。`slime/scripts/models` 目录下包含了支持模型的配置文件。需要 `source` 对应模型的脚本,将配置参数加载到当前环境中。此处我们以 GLM4-9B 模型为例子,对于 Qwen3-4B,GLM-4.7-Flash,Qwen3-30B-A3B,是类似的。
7474

7575
```bash
7676
cd /root/slime
@@ -577,5 +577,6 @@ ray job submit --address="http://127.0.0.1:8265" \
577577

578578
slime 针对大规模混合专家(MoE)模型的分布式训练进行了深度优化。我们提供了一些端到端的训练案例以供参考:
579579

580+
- [示例:8xH100 训练 GLM-4.7-Flash](../examples/glm4.7-30B-A3B.md)
580581
- [示例:64xH100 训练 GLM-4.5](../examples/glm4.5-355B-A32B.md)
581582
- [示例:128xH100 训练 DeepSeek-R1](../examples/deepseek-r1.md)

docs/zh/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ slime 是 GLM-4.7、GLM-4.6、GLM-4.5 背后的 RL 训练框架。除此之外
3232
:maxdepth: 1
3333
:caption: MoE
3434

35+
examples/glm4.7-30B-A3B.md
3536
examples/qwen3-30B-A3B.md
3637
examples/glm4.5-355B-A32B.md
3738
examples/deepseek-r1.md

0 commit comments

Comments
 (0)