THUDM
diff --git a/‎.github/workflows/pr-test.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/pr-test.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/pr-test.yml.j2‎
Lines changed: 2 additions & 0 deletions b/‎.github/workflows/pr-test.yml.j2‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docker/patch/latest/sglang.patch‎
Lines changed: 681 additions & 16 deletions b/‎docker/patch/latest/sglang.patch‎
Lines changed: 681 additions & 16 deletions
diff --git a/‎docker/version.txt‎
Lines changed: 1 addition & 1 deletion b/‎docker/version.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/en/developer_guide/debug.md‎
Lines changed: 42 additions & 0 deletions b/‎docs/en/developer_guide/debug.md‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎docs/en/examples/glm4.7-30B-A3B.md‎
Lines changed: 156 additions & 0 deletions b/‎docs/en/examples/glm4.7-30B-A3B.md‎
Lines changed: 156 additions & 0 deletions
diff --git a/‎docs/en/get_started/quick_start.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/en/get_started/quick_start.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/en/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/en/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/zh/developer_guide/debug.md‎
Lines changed: 42 additions & 0 deletions b/‎docs/zh/developer_guide/debug.md‎
Lines changed: 42 additions & 0 deletions
@@ -48,7 +48,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        info: [{"num_gpus": 4, "test_file": "test_qwen2.5_0.5B_gsm8k_async_short.py"}, {"num_gpus": 4, "test_file": "test_qwen2.5_0.5B_gsm8k_short.py"}]
+        info: [{"num_gpus": 4, "test_file": "test_qwen2.5_0.5B_gsm8k_async_short.py"}, {"num_gpus": 4, "test_file": "test_qwen2.5_0.5B_gsm8k_short.py"}, {"num_gpus": 8, "test_file": "test_qwen2.5_0.5B_sglang_config.py"}, {"num_gpus": 8, "test_file": "test_qwen2.5_0.5B_sglang_config_distributed.py"}]
     defaults:
       run:
         working-directory: ${{ github.workspace }}
 
@@ -4,6 +4,8 @@
       'tests': [
         {'test_file': 'test_qwen2.5_0.5B_gsm8k_async_short.py', 'num_gpus': 4},
         {'test_file': 'test_qwen2.5_0.5B_gsm8k_short.py', 'num_gpus': 4},
+        {'test_file': 'test_qwen2.5_0.5B_sglang_config.py', 'num_gpus': 8},
+        {'test_file': 'test_qwen2.5_0.5B_sglang_config_distributed.py', 'num_gpus': 8},
       ],
     },
     'e2e-test-fsdp': {
 
@@ -1 +1 @@
-nightly-dev-20260225a
+nightly-dev-20260227a
@@ -50,6 +50,48 @@ Specifically, slime currently provides the following parameters for separate deb
 
     When enabled, data will be loaded from `args.load_debug_rollout_data.format(rollout_id=rollout_id)`, and SGLang will not be initialized (automatically setting `debug_train_only=True`). This method allows you to fix the input for the training part to tune it, for example, by switching between different parallelization strategies.
 
+## INT4 / Compressed-Tensors Quantization Checkpoint Issues
+
+When using INT4-quantized models (e.g., `compressed-tensors` with `W4A16`), the checkpoint's `config.json` contains a `quantization_config.ignore` list that specifies which parameters should **not** be quantized. During online weight updates (Megatron → SGLang), slime also reads this ignore list to decide which parameters to INT4-quantize. An incorrect ignore list can cause silent errors:
+
+1. **MoE router weights (`mlp.gate.weight`) become all zeros**
+
+   The MoE router weight (`mlp.gate.weight`, shape `[num_experts, hidden_size]`) is a plain 2D weight tensor, but it is **not** a Linear layer weight. If it is not in the ignore list, the online quantizer will INT4-quantize it into `weight_packed`, `weight_scale`, `weight_zero_point`, etc. However, SGLang does not expect quantized names for the router, so these parameters are silently skipped during `load_weights`, resulting in all-zero gate weights.
+
+   **Fix**: Ensure `config.json` contains `"re:.*mlp\\.gate\\..*"` in the ignore list.
+
+2. **Other non-Linear 2D weights**
+
+   Similar issues can occur with any 2D `.weight` tensor that is not a true Linear layer, such as `model.embed_tokens.weight`. Always verify the ignore list covers all non-Linear weights.
+
+   **Recommended ignore patterns** (for GLM-style MoE models):
+   ```json
+   "ignore": [
+     "lm_head",
+     "model.embed_tokens.weight",
+     "re:.*self_attn.*",
+     "re:.*mlp\\.shared_experts.*",
+     "re:.*mlp\\.gate_up_proj.*",
+     "re:.*mlp\\.gate_proj.*",
+     "re:.*mlp\\.up_proj.*",
+     "re:.*mlp\\.down_proj.*",
+     "re:.*eh_proj.*",
+     "re:.*mlp\\.gate\\..*"
+   ]
+   ```
+
+3. **Missing safetensors shards**
+
+   Conversion tools may occasionally produce an incomplete checkpoint (e.g., a missing `model-00010-of-00093.safetensors`). After conversion, always verify:
+   - The number of `.safetensors` files matches the expected count.
+   - The `model.safetensors.index.json` contains entries for every layer.
+   - Spot-check that critical layers (e.g., the first MoE layer) have the expected number of keys.
+
+4. **How to diagnose**
+
+   - Use `--check-weight-update-equal` to verify that weights after a Megatron → SGLang sync match the expected values. If a parameter shows all zeros on the SGLang side, it was likely incorrectly quantized or missing from the checkpoint.
+   - Use `--debug-rollout-only` with a small number of GPUs to quickly test whether SGLang can generate coherent text from the quantized checkpoint alone.
+
 ## Debug sglang illegal memory access (IMA)
 
 When running large scale RL, we will occationally meet the IMA in SGLang, there are some debug suggestions based on our experience:
 
@@ -0,0 +1,156 @@
+# GLM-4.7-Flash with 8×H100
+
+
+## Environment Preparation
+
+The environment setup, data, and checkpoint conversion are the same as for the Qwen3-4B model. You can refer to [Example: Qwen3-4B Model](qwen3-4B.md), replacing mentions of Qwen3-4B with GLM-4.7-Flash.
+
+### Download Model
+
+```bash
+hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash
+```
+
+### Convert Checkpoint
+
+To convert the Hugging Face checkpoint to torch_dist format:
+
+```bash
+cd /root/slime
+pip install -e . --no-deps
+source scripts/models/glm4.7-30B-A3B.sh
+PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
+   tools/convert_hf_to_torch_dist.py \
+   ${MODEL_ARGS[@]} \
+   --hf-checkpoint /root/GLM-4.7-Flash/ \
+   --save /root/GLM-4.7-Flash_torch_dist/
+```
+
+## Run Training
+
+Execute the training script:
+
+```bash
+cd /root/slime
+bash scripts/run-glm4.7-30B-A3B-8gpus.sh
+```
+
+### Parameter Introduction
+
+Here, we will briefly introduce the key parts in the [run-glm4.7-30B-A3B-8gpus.sh](https://github.com/THUDM/slime/blob/main/scripts/run-glm4.7-30B-A3B-8gpus.sh) script.
+
+#### MoE Configuration
+
+GLM-4.7-Flash is a Mixture-of-Experts (MoE) model with 64 routed experts (top-4 activation) and 1 shared expert. It has 47 layers: 1 dense layer + 46 MoE layers.
+
+1.  To support running GLM-4.7-Flash on 8×H100, we need to enable Megatron's CPU Adam to save GPU memory:
+
+    ```bash
+    OPTIMIZER_ARGS=(
+       ...
+       --optimizer-cpu-offload
+       --overlap-cpu-optimizer-d2h-h2d
+       --use-precision-aware-optimizer
+    )
+    ```
+
+2.  Enable MoE optimization in Megatron. For single-node 8×H100, we use TP=1, EP=8:
+
+    ```bash
+    PERF_ARGS=(
+       --tensor-model-parallel-size 1
+       --pipeline-model-parallel-size 1
+       --context-parallel-size 1
+       --expert-model-parallel-size 8
+       --expert-tensor-parallel-size 1
+       ...
+    )
+    ```
+
+3.  Enable MoE optimization in SGLang with DP attention:
+
+    ```bash
+    SGLANG_ARGS=(
+       --rollout-num-gpus-per-engine 8
+       --sglang-mem-fraction-static 0.7
+       --sglang-enable-dp-attention
+       --sglang-dp-size 8
+       --sglang-enable-dp-lm-head
+       --sglang-moe-dense-tp-size 1
+       ...
+    )
+    ```
+
+#### MTP Speculative Decoding (Inference Acceleration)
+
+GLM-4.7-Flash includes 1 MTP (Multi-Token Prediction) layer, which can be used for speculative decoding during inference to speed up rollout generation. To enable this, add the following to `SGLANG_ARGS`:
+
+```bash
+SGLANG_ARGS=(
+   ...
+   # MTP speculative decoding (EAGLE)
+   --sglang-speculative-algorithm EAGLE
+   --sglang-speculative-num-steps 2
+   --sglang-speculative-eagle-topk 1
+   --sglang-speculative-num-draft-tokens 3
+)
+```
+
+This enables SGLang to use the model's MTP layer as a draft model for EAGLE-style speculative decoding. The MTP layer predicts multiple future tokens, and SGLang verifies them in parallel, leading to faster generation.
+
+> ⚠️ **Note**: Speculative decoding requires additional GPU memory. If you encounter OOM issues, try reducing `--sglang-mem-fraction-static` or disabling speculative decoding.
+
+#### MTP Training
+
+slime also supports training MTP layers jointly with the main model for models that have MTP weight conversion implemented (e.g., MiMo, GLM-4.5). When enabled, the relevant arguments are:
+
+```bash
+# Add MTP layer count to model config
+MODEL_ARGS+=(--mtp-num-layers 1)
+
+# Enable MTP training
+SPEC_ARGS=(
+   --enable-mtp-training
+   --mtp-loss-scaling-factor 0.2
+)
+```
+
+- `--mtp-num-layers 1`: Tells Megatron to load the MTP layer from the checkpoint.
+- `--enable-mtp-training`: Enables gradient computation for MTP layers. Without this flag, the MTP layer is loaded but frozen.
+- `--mtp-loss-scaling-factor 0.2`: Weight of the MTP loss relative to the main policy loss. Default is 0.2.
+
+> ⚠️ **Note**: MTP training for GLM-4.7-Flash is not yet supported because the deepseek_v3 checkpoint bridge does not include MTP weight conversion (`# TODO: mtp` in upstream mbridge). You can still use MTP for speculative decoding during inference — SGLang handles MTP layers internally.
+>
+> For models with full MTP training support (e.g., MiMo), see `scripts/run-mimo-7B-rl-eagle.sh` as a reference.
+
+### Multi-Node Support
+
+For multi-node training (e.g., 2×8 H100), use the multi-node script:
+
+```bash
+cd /root/slime
+export BASE_DIR=/shared/path  # accessible by all nodes
+bash scripts/run-glm4.7-30B-A3B.sh
+```
+
+Key modifications for multi-node:
+
+  - Place the model and data on a path accessible by all nodes.
+  - Set `MASTER_ADDR` to an address accessible by all nodes.
+  - Remove CPU Adam configurations (distributed optimizer reduces per-GPU memory usage).
+  - Adjust parallelism: e.g., TP=4, PP=2, EP=8, CP=2.
+
+When the total number of GPUs is not a multiple or divisor of the total number of experts (64), you can use `--sglang-ep-num-redundant-experts` to add redundant experts. For example, in a 24-GPU scenario:
+
+```bash
+SGLANG_ARGS=(
+   --rollout-num-gpus-per-engine 24
+   --sglang-mem-fraction-static 0.7
+   --sglang-ep-size 24
+   --sglang-enable-dp-attention
+   --sglang-dp-size 3
+   --sglang-moe-dense-tp-size 1
+   --sglang-enable-dp-lm-head
+   --sglang-ep-num-redundant-experts 16
+)
+```
@@ -71,7 +71,7 @@ hf download --repo-type dataset zhuzilin/aime-2024 \
 
 When using Megatron as the training backend, you need to first convert Hugging Face format model weights to Megatron `torch_dist` format.
 
-First, load the configuration file of the target model. The `slime/scripts/models` directory contains configuration files for supported models. You need to `source` the corresponding model script to load the configuration parameters into the current environment. Here we use GLM4-9B model as an example, and it's similar for Qwen3-4B, Qwen3-30B-A3B, etc.
+First, load the configuration file of the target model. The `slime/scripts/models` directory contains configuration files for supported models. You need to `source` the corresponding model script to load the configuration parameters into the current environment. Here we use GLM4-9B model as an example, and it's similar for Qwen3-4B, GLM-4.7-Flash, Qwen3-30B-A3B, etc.
 
 ```bash
 cd /root/slime
@@ -580,6 +580,7 @@ export NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=$(ip -o -4 addr show | awk '$4 ~ /^10\.
 
 slime has been deeply optimized for distributed training of large-scale Mixture of Experts (MoE) models. We provide some end-to-end training cases for reference:
 
+- [Example: 8xH100 Training GLM-4.7-Flash](../examples/glm4.7-30B-A3B.md)
 - [Example: 64xH100 Training GLM-4.5](../examples/glm4.5-355B-A32B.md)
 - [Example: 128xH100 Training DeepSeek-R1](../examples/deepseek-r1.md)
 - The scripts such as `scripts/run_qwen3_30b_a3b.py`, `scripts/run_glm45_355b_a32b.py` also support multi-node training, though there are little documentations about it currently.
@@ -32,6 +32,7 @@ slime is the RL-framework behind GLM-4.7, GLM-4.6 and GLM-4.5. Apart from models
    :maxdepth: 1
    :caption: MoE
 
+   examples/glm4.7-30B-A3B.md
    examples/qwen3-30B-A3B.md
    examples/glm4.5-355B-A32B.md
    examples/deepseek-r1.md
 
@@ -48,6 +48,48 @@ slime 支持将训练部分和推理部分分开进行调试，从而实现：
 
    开启后，会从 `args.load_debug_rollout_data.format(rollout_id=rollout_id)` 来加载数据，并且不会初始化 sglang（自动设置 `debug_train_only=True`）。可以以这种方式来固定训练部分的输入，对训练部分进行调优，例如切换各种并行。
 
+## INT4 / Compressed-Tensors 量化 Checkpoint 问题
+
+使用 INT4 量化模型（如 `compressed-tensors` 的 `W4A16`）时，checkpoint 的 `config.json` 中有一个 `quantization_config.ignore` 列表，指定哪些参数**不**做量化。在线权重更新（Megatron → SGLang）时，slime 也会读取这个 ignore list 来决定哪些参数需要 INT4 量化。ignore list 不正确会导致静默错误：
+
+1. **MoE 路由权重（`mlp.gate.weight`）变成全零**
+
+   MoE 的路由权重（`mlp.gate.weight`，shape `[num_experts, hidden_size]`）是一个普通的 2D weight tensor，但它**不是** Linear 层的权重。如果它不在 ignore list 中，在线量化器会把它 INT4 量化为 `weight_packed`、`weight_scale`、`weight_zero_point` 等。然而 SGLang 不会以量化名称来加载路由权重，因此这些参数在 `load_weights` 时被静默跳过，导致 gate 权重全零。
+
+   **修复方法**：确保 `config.json` 的 ignore list 中包含 `"re:.*mlp\\.gate\\..*"`。
+
+2. **其他非 Linear 的 2D 权重**
+
+   类似问题可能出现在任何不是真正 Linear 层的 2D `.weight` tensor 上，例如 `model.embed_tokens.weight`。务必检查 ignore list 覆盖了所有非 Linear 权重。
+
+   **推荐的 ignore 配置**（以 GLM 系 MoE 模型为例）：
+   ```json
+   "ignore": [
+     "lm_head",
+     "model.embed_tokens.weight",
+     "re:.*self_attn.*",
+     "re:.*mlp\\.shared_experts.*",
+     "re:.*mlp\\.gate_up_proj.*",
+     "re:.*mlp\\.gate_proj.*",
+     "re:.*mlp\\.up_proj.*",
+     "re:.*mlp\\.down_proj.*",
+     "re:.*eh_proj.*",
+     "re:.*mlp\\.gate\\..*"
+   ]
+   ```
+
+3. **safetensors 分片缺失**
+
+   转换工具偶尔可能产出不完整的 checkpoint（例如缺少 `model-00010-of-00093.safetensors`）。转换完成后，务必检查：
+   - `.safetensors` 文件数量是否与预期一致。
+   - `model.safetensors.index.json` 中是否包含所有 layer 的条目。
+   - 抽查关键 layer（如第一个 MoE layer）的 key 数量是否正确。
+
+4. **如何排查**
+
+   - 使用 `--check-weight-update-equal` 验证 Megatron → SGLang 权重同步后的值是否正确。如果某个参数在 SGLang 侧全为零，说明它可能被错误量化或在 checkpoint 中缺失。
+   - 使用 `--debug-rollout-only` 配合少量 GPU，快速测试 SGLang 能否从量化 checkpoint 正常生成文本。
+
 ## Debug sglang illegal memory access (IMA)
 
 在进行大规模 RL 时，不时会遇到 SGLang IMA 的问题，以下是我们的一些 debug 建议：
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-nightly-dev-20260225a`
	`1`	`+nightly-dev-20260227a`