Skip to content

Commit 2b72236

Browse files
authored
Merge branch 'main' into codex/slime-skills-alignment
2 parents acd61cc + a2b16da commit 2b72236

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+3361
-429
lines changed

.github/workflows/pr-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ jobs:
4848
strategy:
4949
fail-fast: false
5050
matrix:
51-
info: [{"num_gpus": 4, "test_file": "test_qwen2.5_0.5B_gsm8k_async_short.py"}, {"num_gpus": 4, "test_file": "test_qwen2.5_0.5B_gsm8k_short.py"}]
51+
info: [{"num_gpus": 4, "test_file": "test_qwen2.5_0.5B_gsm8k_async_short.py"}, {"num_gpus": 4, "test_file": "test_qwen2.5_0.5B_gsm8k_short.py"}, {"num_gpus": 8, "test_file": "test_qwen2.5_0.5B_sglang_config.py"}, {"num_gpus": 8, "test_file": "test_qwen2.5_0.5B_sglang_config_distributed.py"}]
5252
defaults:
5353
run:
5454
working-directory: ${{ github.workspace }}

.github/workflows/pr-test.yml.j2

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
'tests': [
55
{'test_file': 'test_qwen2.5_0.5B_gsm8k_async_short.py', 'num_gpus': 4},
66
{'test_file': 'test_qwen2.5_0.5B_gsm8k_short.py', 'num_gpus': 4},
7+
{'test_file': 'test_qwen2.5_0.5B_sglang_config.py', 'num_gpus': 8},
8+
{'test_file': 'test_qwen2.5_0.5B_sglang_config_distributed.py', 'num_gpus': 8},
79
],
810
},
911
'e2e-test-fsdp': {

docker/patch/latest/sglang.patch

Lines changed: 681 additions & 16 deletions
Large diffs are not rendered by default.

docker/version.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
nightly-dev-20260225a
1+
nightly-dev-20260227a

docs/en/developer_guide/debug.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,48 @@ Specifically, slime currently provides the following parameters for separate deb
5050

5151
When enabled, data will be loaded from `args.load_debug_rollout_data.format(rollout_id=rollout_id)`, and SGLang will not be initialized (automatically setting `debug_train_only=True`). This method allows you to fix the input for the training part to tune it, for example, by switching between different parallelization strategies.
5252

53+
## INT4 / Compressed-Tensors Quantization Checkpoint Issues
54+
55+
When using INT4-quantized models (e.g., `compressed-tensors` with `W4A16`), the checkpoint's `config.json` contains a `quantization_config.ignore` list that specifies which parameters should **not** be quantized. During online weight updates (Megatron → SGLang), slime also reads this ignore list to decide which parameters to INT4-quantize. An incorrect ignore list can cause silent errors:
56+
57+
1. **MoE router weights (`mlp.gate.weight`) become all zeros**
58+
59+
The MoE router weight (`mlp.gate.weight`, shape `[num_experts, hidden_size]`) is a plain 2D weight tensor, but it is **not** a Linear layer weight. If it is not in the ignore list, the online quantizer will INT4-quantize it into `weight_packed`, `weight_scale`, `weight_zero_point`, etc. However, SGLang does not expect quantized names for the router, so these parameters are silently skipped during `load_weights`, resulting in all-zero gate weights.
60+
61+
**Fix**: Ensure `config.json` contains `"re:.*mlp\\.gate\\..*"` in the ignore list.
62+
63+
2. **Other non-Linear 2D weights**
64+
65+
Similar issues can occur with any 2D `.weight` tensor that is not a true Linear layer, such as `model.embed_tokens.weight`. Always verify the ignore list covers all non-Linear weights.
66+
67+
**Recommended ignore patterns** (for GLM-style MoE models):
68+
```json
69+
"ignore": [
70+
"lm_head",
71+
"model.embed_tokens.weight",
72+
"re:.*self_attn.*",
73+
"re:.*mlp\\.shared_experts.*",
74+
"re:.*mlp\\.gate_up_proj.*",
75+
"re:.*mlp\\.gate_proj.*",
76+
"re:.*mlp\\.up_proj.*",
77+
"re:.*mlp\\.down_proj.*",
78+
"re:.*eh_proj.*",
79+
"re:.*mlp\\.gate\\..*"
80+
]
81+
```
82+
83+
3. **Missing safetensors shards**
84+
85+
Conversion tools may occasionally produce an incomplete checkpoint (e.g., a missing `model-00010-of-00093.safetensors`). After conversion, always verify:
86+
- The number of `.safetensors` files matches the expected count.
87+
- The `model.safetensors.index.json` contains entries for every layer.
88+
- Spot-check that critical layers (e.g., the first MoE layer) have the expected number of keys.
89+
90+
4. **How to diagnose**
91+
92+
- Use `--check-weight-update-equal` to verify that weights after a Megatron → SGLang sync match the expected values. If a parameter shows all zeros on the SGLang side, it was likely incorrectly quantized or missing from the checkpoint.
93+
- Use `--debug-rollout-only` with a small number of GPUs to quickly test whether SGLang can generate coherent text from the quantized checkpoint alone.
94+
5395
## Debug sglang illegal memory access (IMA)
5496

5597
When running large scale RL, we will occationally meet the IMA in SGLang, there are some debug suggestions based on our experience:

docs/en/examples/glm4.7-30B-A3B.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# GLM-4.7-Flash with 8×H100
2+
3+
4+
## Environment Preparation
5+
6+
The environment setup, data, and checkpoint conversion are the same as for the Qwen3-4B model. You can refer to [Example: Qwen3-4B Model](qwen3-4B.md), replacing mentions of Qwen3-4B with GLM-4.7-Flash.
7+
8+
### Download Model
9+
10+
```bash
11+
hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash
12+
```
13+
14+
### Convert Checkpoint
15+
16+
To convert the Hugging Face checkpoint to torch_dist format:
17+
18+
```bash
19+
cd /root/slime
20+
pip install -e . --no-deps
21+
source scripts/models/glm4.7-30B-A3B.sh
22+
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
23+
tools/convert_hf_to_torch_dist.py \
24+
${MODEL_ARGS[@]} \
25+
--hf-checkpoint /root/GLM-4.7-Flash/ \
26+
--save /root/GLM-4.7-Flash_torch_dist/
27+
```
28+
29+
## Run Training
30+
31+
Execute the training script:
32+
33+
```bash
34+
cd /root/slime
35+
bash scripts/run-glm4.7-30B-A3B-8gpus.sh
36+
```
37+
38+
### Parameter Introduction
39+
40+
Here, we will briefly introduce the key parts in the [run-glm4.7-30B-A3B-8gpus.sh](https://github.com/THUDM/slime/blob/main/scripts/run-glm4.7-30B-A3B-8gpus.sh) script.
41+
42+
#### MoE Configuration
43+
44+
GLM-4.7-Flash is a Mixture-of-Experts (MoE) model with 64 routed experts (top-4 activation) and 1 shared expert. It has 47 layers: 1 dense layer + 46 MoE layers.
45+
46+
1. To support running GLM-4.7-Flash on 8×H100, we need to enable Megatron's CPU Adam to save GPU memory:
47+
48+
```bash
49+
OPTIMIZER_ARGS=(
50+
...
51+
--optimizer-cpu-offload
52+
--overlap-cpu-optimizer-d2h-h2d
53+
--use-precision-aware-optimizer
54+
)
55+
```
56+
57+
2. Enable MoE optimization in Megatron. For single-node 8×H100, we use TP=1, EP=8:
58+
59+
```bash
60+
PERF_ARGS=(
61+
--tensor-model-parallel-size 1
62+
--pipeline-model-parallel-size 1
63+
--context-parallel-size 1
64+
--expert-model-parallel-size 8
65+
--expert-tensor-parallel-size 1
66+
...
67+
)
68+
```
69+
70+
3. Enable MoE optimization in SGLang with DP attention:
71+
72+
```bash
73+
SGLANG_ARGS=(
74+
--rollout-num-gpus-per-engine 8
75+
--sglang-mem-fraction-static 0.7
76+
--sglang-enable-dp-attention
77+
--sglang-dp-size 8
78+
--sglang-enable-dp-lm-head
79+
--sglang-moe-dense-tp-size 1
80+
...
81+
)
82+
```
83+
84+
#### MTP Speculative Decoding (Inference Acceleration)
85+
86+
GLM-4.7-Flash includes 1 MTP (Multi-Token Prediction) layer, which can be used for speculative decoding during inference to speed up rollout generation. To enable this, add the following to `SGLANG_ARGS`:
87+
88+
```bash
89+
SGLANG_ARGS=(
90+
...
91+
# MTP speculative decoding (EAGLE)
92+
--sglang-speculative-algorithm EAGLE
93+
--sglang-speculative-num-steps 2
94+
--sglang-speculative-eagle-topk 1
95+
--sglang-speculative-num-draft-tokens 3
96+
)
97+
```
98+
99+
This enables SGLang to use the model's MTP layer as a draft model for EAGLE-style speculative decoding. The MTP layer predicts multiple future tokens, and SGLang verifies them in parallel, leading to faster generation.
100+
101+
> ⚠️ **Note**: Speculative decoding requires additional GPU memory. If you encounter OOM issues, try reducing `--sglang-mem-fraction-static` or disabling speculative decoding.
102+
103+
#### MTP Training
104+
105+
slime also supports training MTP layers jointly with the main model for models that have MTP weight conversion implemented (e.g., MiMo, GLM-4.5). When enabled, the relevant arguments are:
106+
107+
```bash
108+
# Add MTP layer count to model config
109+
MODEL_ARGS+=(--mtp-num-layers 1)
110+
111+
# Enable MTP training
112+
SPEC_ARGS=(
113+
--enable-mtp-training
114+
--mtp-loss-scaling-factor 0.2
115+
)
116+
```
117+
118+
- `--mtp-num-layers 1`: Tells Megatron to load the MTP layer from the checkpoint.
119+
- `--enable-mtp-training`: Enables gradient computation for MTP layers. Without this flag, the MTP layer is loaded but frozen.
120+
- `--mtp-loss-scaling-factor 0.2`: Weight of the MTP loss relative to the main policy loss. Default is 0.2.
121+
122+
> ⚠️ **Note**: MTP training for GLM-4.7-Flash is not yet supported because the deepseek_v3 checkpoint bridge does not include MTP weight conversion (`# TODO: mtp` in upstream mbridge). You can still use MTP for speculative decoding during inference — SGLang handles MTP layers internally.
123+
>
124+
> For models with full MTP training support (e.g., MiMo), see `scripts/run-mimo-7B-rl-eagle.sh` as a reference.
125+
126+
### Multi-Node Support
127+
128+
For multi-node training (e.g., 2×8 H100), use the multi-node script:
129+
130+
```bash
131+
cd /root/slime
132+
export BASE_DIR=/shared/path # accessible by all nodes
133+
bash scripts/run-glm4.7-30B-A3B.sh
134+
```
135+
136+
Key modifications for multi-node:
137+
138+
- Place the model and data on a path accessible by all nodes.
139+
- Set `MASTER_ADDR` to an address accessible by all nodes.
140+
- Remove CPU Adam configurations (distributed optimizer reduces per-GPU memory usage).
141+
- Adjust parallelism: e.g., TP=4, PP=2, EP=8, CP=2.
142+
143+
When the total number of GPUs is not a multiple or divisor of the total number of experts (64), you can use `--sglang-ep-num-redundant-experts` to add redundant experts. For example, in a 24-GPU scenario:
144+
145+
```bash
146+
SGLANG_ARGS=(
147+
--rollout-num-gpus-per-engine 24
148+
--sglang-mem-fraction-static 0.7
149+
--sglang-ep-size 24
150+
--sglang-enable-dp-attention
151+
--sglang-dp-size 3
152+
--sglang-moe-dense-tp-size 1
153+
--sglang-enable-dp-lm-head
154+
--sglang-ep-num-redundant-experts 16
155+
)
156+
```

docs/en/get_started/quick_start.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ hf download --repo-type dataset zhuzilin/aime-2024 \
7171

7272
When using Megatron as the training backend, you need to first convert Hugging Face format model weights to Megatron `torch_dist` format.
7373

74-
First, load the configuration file of the target model. The `slime/scripts/models` directory contains configuration files for supported models. You need to `source` the corresponding model script to load the configuration parameters into the current environment. Here we use GLM4-9B model as an example, and it's similar for Qwen3-4B, Qwen3-30B-A3B, etc.
74+
First, load the configuration file of the target model. The `slime/scripts/models` directory contains configuration files for supported models. You need to `source` the corresponding model script to load the configuration parameters into the current environment. Here we use GLM4-9B model as an example, and it's similar for Qwen3-4B, GLM-4.7-Flash, Qwen3-30B-A3B, etc.
7575

7676
```bash
7777
cd /root/slime
@@ -580,6 +580,7 @@ export NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=$(ip -o -4 addr show | awk '$4 ~ /^10\.
580580

581581
slime has been deeply optimized for distributed training of large-scale Mixture of Experts (MoE) models. We provide some end-to-end training cases for reference:
582582

583+
- [Example: 8xH100 Training GLM-4.7-Flash](../examples/glm4.7-30B-A3B.md)
583584
- [Example: 64xH100 Training GLM-4.5](../examples/glm4.5-355B-A32B.md)
584585
- [Example: 128xH100 Training DeepSeek-R1](../examples/deepseek-r1.md)
585586
- The scripts such as `scripts/run_qwen3_30b_a3b.py`, `scripts/run_glm45_355b_a32b.py` also support multi-node training, though there are little documentations about it currently.

docs/en/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ slime is the RL-framework behind GLM-4.7, GLM-4.6 and GLM-4.5. Apart from models
3232
:maxdepth: 1
3333
:caption: MoE
3434

35+
examples/glm4.7-30B-A3B.md
3536
examples/qwen3-30B-A3B.md
3637
examples/glm4.5-355B-A32B.md
3738
examples/deepseek-r1.md

docs/zh/developer_guide/debug.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,48 @@ slime 支持将训练部分和推理部分分开进行调试,从而实现:
4848

4949
开启后,会从 `args.load_debug_rollout_data.format(rollout_id=rollout_id)` 来加载数据,并且不会初始化 sglang(自动设置 `debug_train_only=True`)。可以以这种方式来固定训练部分的输入,对训练部分进行调优,例如切换各种并行。
5050

51+
## INT4 / Compressed-Tensors 量化 Checkpoint 问题
52+
53+
使用 INT4 量化模型(如 `compressed-tensors``W4A16`)时,checkpoint 的 `config.json` 中有一个 `quantization_config.ignore` 列表,指定哪些参数****做量化。在线权重更新(Megatron → SGLang)时,slime 也会读取这个 ignore list 来决定哪些参数需要 INT4 量化。ignore list 不正确会导致静默错误:
54+
55+
1. **MoE 路由权重(`mlp.gate.weight`)变成全零**
56+
57+
MoE 的路由权重(`mlp.gate.weight`,shape `[num_experts, hidden_size]`)是一个普通的 2D weight tensor,但它**不是** Linear 层的权重。如果它不在 ignore list 中,在线量化器会把它 INT4 量化为 `weight_packed``weight_scale``weight_zero_point` 等。然而 SGLang 不会以量化名称来加载路由权重,因此这些参数在 `load_weights` 时被静默跳过,导致 gate 权重全零。
58+
59+
**修复方法**:确保 `config.json` 的 ignore list 中包含 `"re:.*mlp\\.gate\\..*"`
60+
61+
2. **其他非 Linear 的 2D 权重**
62+
63+
类似问题可能出现在任何不是真正 Linear 层的 2D `.weight` tensor 上,例如 `model.embed_tokens.weight`。务必检查 ignore list 覆盖了所有非 Linear 权重。
64+
65+
**推荐的 ignore 配置**(以 GLM 系 MoE 模型为例):
66+
```json
67+
"ignore": [
68+
"lm_head",
69+
"model.embed_tokens.weight",
70+
"re:.*self_attn.*",
71+
"re:.*mlp\\.shared_experts.*",
72+
"re:.*mlp\\.gate_up_proj.*",
73+
"re:.*mlp\\.gate_proj.*",
74+
"re:.*mlp\\.up_proj.*",
75+
"re:.*mlp\\.down_proj.*",
76+
"re:.*eh_proj.*",
77+
"re:.*mlp\\.gate\\..*"
78+
]
79+
```
80+
81+
3. **safetensors 分片缺失**
82+
83+
转换工具偶尔可能产出不完整的 checkpoint(例如缺少 `model-00010-of-00093.safetensors`)。转换完成后,务必检查:
84+
- `.safetensors` 文件数量是否与预期一致。
85+
- `model.safetensors.index.json` 中是否包含所有 layer 的条目。
86+
- 抽查关键 layer(如第一个 MoE layer)的 key 数量是否正确。
87+
88+
4. **如何排查**
89+
90+
- 使用 `--check-weight-update-equal` 验证 Megatron → SGLang 权重同步后的值是否正确。如果某个参数在 SGLang 侧全为零,说明它可能被错误量化或在 checkpoint 中缺失。
91+
- 使用 `--debug-rollout-only` 配合少量 GPU,快速测试 SGLang 能否从量化 checkpoint 正常生成文本。
92+
5193
## Debug sglang illegal memory access (IMA)
5294

5395
在进行大规模 RL 时,不时会遇到 SGLang IMA 的问题,以下是我们的一些 debug 建议:

0 commit comments

Comments
 (0)