|
| 1 | +# GLM-4.7-Flash with 8×H100 |
| 2 | + |
| 3 | + |
| 4 | +## Environment Preparation |
| 5 | + |
| 6 | +The environment setup, data, and checkpoint conversion are the same as for the Qwen3-4B model. You can refer to [Example: Qwen3-4B Model](qwen3-4B.md), replacing mentions of Qwen3-4B with GLM-4.7-Flash. |
| 7 | + |
| 8 | +### Download Model |
| 9 | + |
| 10 | +```bash |
| 11 | +hf download THUDM/GLM-4.7-Flash --local-dir /root/GLM-4.7-Flash |
| 12 | +``` |
| 13 | + |
| 14 | +### Convert Checkpoint |
| 15 | + |
| 16 | +To convert the Hugging Face checkpoint to torch_dist format: |
| 17 | + |
| 18 | +```bash |
| 19 | +cd /root/slime |
| 20 | +pip install -e . --no-deps |
| 21 | +source scripts/models/glm4.7-30B-A3B.sh |
| 22 | +PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \ |
| 23 | + tools/convert_hf_to_torch_dist.py \ |
| 24 | + ${MODEL_ARGS[@]} \ |
| 25 | + --hf-checkpoint /root/GLM-4.7-Flash/ \ |
| 26 | + --save /root/GLM-4.7-Flash_torch_dist/ |
| 27 | +``` |
| 28 | + |
| 29 | +## Run Training |
| 30 | + |
| 31 | +Execute the training script: |
| 32 | + |
| 33 | +```bash |
| 34 | +cd /root/slime |
| 35 | +bash scripts/run-glm4.7-30B-A3B-8gpus.sh |
| 36 | +``` |
| 37 | + |
| 38 | +### Parameter Introduction |
| 39 | + |
| 40 | +Here, we will briefly introduce the key parts in the [run-glm4.7-30B-A3B-8gpus.sh](https://github.com/THUDM/slime/blob/main/scripts/run-glm4.7-30B-A3B-8gpus.sh) script. |
| 41 | + |
| 42 | +#### MoE Configuration |
| 43 | + |
| 44 | +GLM-4.7-Flash is a Mixture-of-Experts (MoE) model with 64 routed experts (top-4 activation) and 1 shared expert. It has 47 layers: 1 dense layer + 46 MoE layers. |
| 45 | + |
| 46 | +1. To support running GLM-4.7-Flash on 8×H100, we need to enable Megatron's CPU Adam to save GPU memory: |
| 47 | + |
| 48 | + ```bash |
| 49 | + OPTIMIZER_ARGS=( |
| 50 | + ... |
| 51 | + --optimizer-cpu-offload |
| 52 | + --overlap-cpu-optimizer-d2h-h2d |
| 53 | + --use-precision-aware-optimizer |
| 54 | + ) |
| 55 | + ``` |
| 56 | + |
| 57 | +2. Enable MoE optimization in Megatron. For single-node 8×H100, we use TP=1, EP=8: |
| 58 | + |
| 59 | + ```bash |
| 60 | + PERF_ARGS=( |
| 61 | + --tensor-model-parallel-size 1 |
| 62 | + --pipeline-model-parallel-size 1 |
| 63 | + --context-parallel-size 1 |
| 64 | + --expert-model-parallel-size 8 |
| 65 | + --expert-tensor-parallel-size 1 |
| 66 | + ... |
| 67 | + ) |
| 68 | + ``` |
| 69 | + |
| 70 | +3. Enable MoE optimization in SGLang with DP attention: |
| 71 | + |
| 72 | + ```bash |
| 73 | + SGLANG_ARGS=( |
| 74 | + --rollout-num-gpus-per-engine 8 |
| 75 | + --sglang-mem-fraction-static 0.7 |
| 76 | + --sglang-enable-dp-attention |
| 77 | + --sglang-dp-size 8 |
| 78 | + --sglang-enable-dp-lm-head |
| 79 | + --sglang-moe-dense-tp-size 1 |
| 80 | + ... |
| 81 | + ) |
| 82 | + ``` |
| 83 | + |
| 84 | +#### MTP Speculative Decoding (Inference Acceleration) |
| 85 | + |
| 86 | +GLM-4.7-Flash includes 1 MTP (Multi-Token Prediction) layer, which can be used for speculative decoding during inference to speed up rollout generation. To enable this, add the following to `SGLANG_ARGS`: |
| 87 | + |
| 88 | +```bash |
| 89 | +SGLANG_ARGS=( |
| 90 | + ... |
| 91 | + # MTP speculative decoding (EAGLE) |
| 92 | + --sglang-speculative-algorithm EAGLE |
| 93 | + --sglang-speculative-num-steps 2 |
| 94 | + --sglang-speculative-eagle-topk 1 |
| 95 | + --sglang-speculative-num-draft-tokens 3 |
| 96 | +) |
| 97 | +``` |
| 98 | + |
| 99 | +This enables SGLang to use the model's MTP layer as a draft model for EAGLE-style speculative decoding. The MTP layer predicts multiple future tokens, and SGLang verifies them in parallel, leading to faster generation. |
| 100 | +
|
| 101 | +> ⚠️ **Note**: Speculative decoding requires additional GPU memory. If you encounter OOM issues, try reducing `--sglang-mem-fraction-static` or disabling speculative decoding. |
| 102 | +
|
| 103 | +#### MTP Training |
| 104 | +
|
| 105 | +slime also supports training MTP layers jointly with the main model for models that have MTP weight conversion implemented (e.g., MiMo, GLM-4.5). When enabled, the relevant arguments are: |
| 106 | +
|
| 107 | +```bash |
| 108 | +# Add MTP layer count to model config |
| 109 | +MODEL_ARGS+=(--mtp-num-layers 1) |
| 110 | +
|
| 111 | +# Enable MTP training |
| 112 | +SPEC_ARGS=( |
| 113 | + --enable-mtp-training |
| 114 | + --mtp-loss-scaling-factor 0.2 |
| 115 | +) |
| 116 | +``` |
| 117 | +
|
| 118 | +- `--mtp-num-layers 1`: Tells Megatron to load the MTP layer from the checkpoint. |
| 119 | +- `--enable-mtp-training`: Enables gradient computation for MTP layers. Without this flag, the MTP layer is loaded but frozen. |
| 120 | +- `--mtp-loss-scaling-factor 0.2`: Weight of the MTP loss relative to the main policy loss. Default is 0.2. |
| 121 | +
|
| 122 | +> ⚠️ **Note**: MTP training for GLM-4.7-Flash is not yet supported because the deepseek_v3 checkpoint bridge does not include MTP weight conversion (`# TODO: mtp` in upstream mbridge). You can still use MTP for speculative decoding during inference — SGLang handles MTP layers internally. |
| 123 | +> |
| 124 | +> For models with full MTP training support (e.g., MiMo), see `scripts/run-mimo-7B-rl-eagle.sh` as a reference. |
| 125 | +
|
| 126 | +### Multi-Node Support |
| 127 | +
|
| 128 | +For multi-node training (e.g., 2×8 H100), use the multi-node script: |
| 129 | +
|
| 130 | +```bash |
| 131 | +cd /root/slime |
| 132 | +export BASE_DIR=/shared/path # accessible by all nodes |
| 133 | +bash scripts/run-glm4.7-30B-A3B.sh |
| 134 | +``` |
| 135 | +
|
| 136 | +Key modifications for multi-node: |
| 137 | +
|
| 138 | + - Place the model and data on a path accessible by all nodes. |
| 139 | + - Set `MASTER_ADDR` to an address accessible by all nodes. |
| 140 | + - Remove CPU Adam configurations (distributed optimizer reduces per-GPU memory usage). |
| 141 | + - Adjust parallelism: e.g., TP=4, PP=2, EP=8, CP=2. |
| 142 | +
|
| 143 | +When the total number of GPUs is not a multiple or divisor of the total number of experts (64), you can use `--sglang-ep-num-redundant-experts` to add redundant experts. For example, in a 24-GPU scenario: |
| 144 | +
|
| 145 | +```bash |
| 146 | +SGLANG_ARGS=( |
| 147 | + --rollout-num-gpus-per-engine 24 |
| 148 | + --sglang-mem-fraction-static 0.7 |
| 149 | + --sglang-ep-size 24 |
| 150 | + --sglang-enable-dp-attention |
| 151 | + --sglang-dp-size 3 |
| 152 | + --sglang-moe-dense-tp-size 1 |
| 153 | + --sglang-enable-dp-lm-head |
| 154 | + --sglang-ep-num-redundant-experts 16 |
| 155 | +) |
| 156 | +``` |
0 commit comments