Skip to content

Commit d64af71

Browse files
author
Copilot
committed
[docker] fix int4 qat for upgraded sglang
1 parent 55828f3 commit d64af71

File tree

8 files changed

+110
-183
lines changed

8 files changed

+110
-183
lines changed

docker/patch/latest/sglang.patch

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -973,7 +973,7 @@ index 00bd68755..5a3ca8a67 100644
973973

974974
def get_routed_experts(
975975
diff --git a/python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py b/python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py
976-
index 4cbfed6f9..cd6c825f6 100644
976+
index 4cbfed6f9..88b452744 100644
977977
--- a/python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py
978978
+++ b/python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py
979979
@@ -499,7 +499,7 @@ class CompressedTensorsConfig(QuantizationConfig):
@@ -985,6 +985,16 @@ index 4cbfed6f9..cd6c825f6 100644
985985

986986
def _is_mxint4a16(self, weight_quant: BaseModel, input_quant: BaseModel) -> bool:
987987
input_quant_none = input_quant is None
988+
@@ -968,6 +968,9 @@ class CompressedTensorsFusedMoEMethod(FusedMoEMethodBase):
989+
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
990+
layer.scheme.process_weights_after_loading(layer)
991+
992+
+ def restore_weights_before_loading(self, layer: torch.nn.Module) -> None:
993+
+ layer.scheme.restore_weights_before_loading(layer)
994+
+
995+
def create_weights(
996+
self,
997+
layer: torch.nn.Module,
988998
diff --git a/python/sglang/srt/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16_moe.py b/python/sglang/srt/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16_moe.py
989999
index 6264f36d0..bef31a374 100644
9901000
--- a/python/sglang/srt/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16_moe.py
@@ -2438,3 +2448,17 @@ index 4636128fa..a9b61df39 100644
24382448
}
24392449

24402450
DENY_CLASSES = {
2451+
diff --git a/python/sglang/srt/utils/weight_checker.py b/python/sglang/srt/utils/weight_checker.py
2452+
index 3be16446e..1b2371c83 100644
2453+
--- a/python/sglang/srt/utils/weight_checker.py
2454+
+++ b/python/sglang/srt/utils/weight_checker.py
2455+
@@ -69,6 +69,9 @@ def _check_tensors(
2456+
actual_should_compare,
2457+
actual,
2458+
) in zip(expect_tensors, actual_tensors, strict=True):
2459+
+ if ".cos_sin_cache" in expect_name:
2460+
+ # skip cos/sin cache which is deterministic from shape and dtype and may have different shapes due to different implementations.
2461+
+ continue
2462+
assert expect_name == actual_name, f"{expect_name=} {actual_name=}"
2463+
assert (
2464+
expect_should_compare == actual_should_compare

docs/en/developer_guide/debug.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,48 @@ Specifically, slime currently provides the following parameters for separate deb
5050

5151
When enabled, data will be loaded from `args.load_debug_rollout_data.format(rollout_id=rollout_id)`, and SGLang will not be initialized (automatically setting `debug_train_only=True`). This method allows you to fix the input for the training part to tune it, for example, by switching between different parallelization strategies.
5252

53+
## INT4 / Compressed-Tensors Quantization Checkpoint Issues
54+
55+
When using INT4-quantized models (e.g., `compressed-tensors` with `W4A16`), the checkpoint's `config.json` contains a `quantization_config.ignore` list that specifies which parameters should **not** be quantized. During online weight updates (Megatron → SGLang), slime also reads this ignore list to decide which parameters to INT4-quantize. An incorrect ignore list can cause silent errors:
56+
57+
1. **MoE router weights (`mlp.gate.weight`) become all zeros**
58+
59+
The MoE router weight (`mlp.gate.weight`, shape `[num_experts, hidden_size]`) is a plain 2D weight tensor, but it is **not** a Linear layer weight. If it is not in the ignore list, the online quantizer will INT4-quantize it into `weight_packed`, `weight_scale`, `weight_zero_point`, etc. However, SGLang does not expect quantized names for the router, so these parameters are silently skipped during `load_weights`, resulting in all-zero gate weights.
60+
61+
**Fix**: Ensure `config.json` contains `"re:.*mlp\\.gate\\..*"` in the ignore list.
62+
63+
2. **Other non-Linear 2D weights**
64+
65+
Similar issues can occur with any 2D `.weight` tensor that is not a true Linear layer, such as `model.embed_tokens.weight`. Always verify the ignore list covers all non-Linear weights.
66+
67+
**Recommended ignore patterns** (for GLM-style MoE models):
68+
```json
69+
"ignore": [
70+
"lm_head",
71+
"model.embed_tokens.weight",
72+
"re:.*self_attn.*",
73+
"re:.*mlp\\.shared_experts.*",
74+
"re:.*mlp\\.gate_up_proj.*",
75+
"re:.*mlp\\.gate_proj.*",
76+
"re:.*mlp\\.up_proj.*",
77+
"re:.*mlp\\.down_proj.*",
78+
"re:.*eh_proj.*",
79+
"re:.*mlp\\.gate\\..*"
80+
]
81+
```
82+
83+
3. **Missing safetensors shards**
84+
85+
Conversion tools may occasionally produce an incomplete checkpoint (e.g., a missing `model-00010-of-00093.safetensors`). After conversion, always verify:
86+
- The number of `.safetensors` files matches the expected count.
87+
- The `model.safetensors.index.json` contains entries for every layer.
88+
- Spot-check that critical layers (e.g., the first MoE layer) have the expected number of keys.
89+
90+
4. **How to diagnose**
91+
92+
- Use `--check-weight-update-equal` to verify that weights after a Megatron → SGLang sync match the expected values. If a parameter shows all zeros on the SGLang side, it was likely incorrectly quantized or missing from the checkpoint.
93+
- Use `--debug-rollout-only` with a small number of GPUs to quickly test whether SGLang can generate coherent text from the quantized checkpoint alone.
94+
5395
## Debug sglang illegal memory access (IMA)
5496

5597
When running large scale RL, we will occationally meet the IMA in SGLang, there are some debug suggestions based on our experience:

docs/zh/developer_guide/debug.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,48 @@ slime 支持将训练部分和推理部分分开进行调试,从而实现:
4848

4949
开启后,会从 `args.load_debug_rollout_data.format(rollout_id=rollout_id)` 来加载数据,并且不会初始化 sglang(自动设置 `debug_train_only=True`)。可以以这种方式来固定训练部分的输入,对训练部分进行调优,例如切换各种并行。
5050

51+
## INT4 / Compressed-Tensors 量化 Checkpoint 问题
52+
53+
使用 INT4 量化模型(如 `compressed-tensors``W4A16`)时,checkpoint 的 `config.json` 中有一个 `quantization_config.ignore` 列表,指定哪些参数****做量化。在线权重更新(Megatron → SGLang)时,slime 也会读取这个 ignore list 来决定哪些参数需要 INT4 量化。ignore list 不正确会导致静默错误:
54+
55+
1. **MoE 路由权重(`mlp.gate.weight`)变成全零**
56+
57+
MoE 的路由权重(`mlp.gate.weight`,shape `[num_experts, hidden_size]`)是一个普通的 2D weight tensor,但它**不是** Linear 层的权重。如果它不在 ignore list 中,在线量化器会把它 INT4 量化为 `weight_packed``weight_scale``weight_zero_point` 等。然而 SGLang 不会以量化名称来加载路由权重,因此这些参数在 `load_weights` 时被静默跳过,导致 gate 权重全零。
58+
59+
**修复方法**:确保 `config.json` 的 ignore list 中包含 `"re:.*mlp\\.gate\\..*"`
60+
61+
2. **其他非 Linear 的 2D 权重**
62+
63+
类似问题可能出现在任何不是真正 Linear 层的 2D `.weight` tensor 上,例如 `model.embed_tokens.weight`。务必检查 ignore list 覆盖了所有非 Linear 权重。
64+
65+
**推荐的 ignore 配置**(以 GLM 系 MoE 模型为例):
66+
```json
67+
"ignore": [
68+
"lm_head",
69+
"model.embed_tokens.weight",
70+
"re:.*self_attn.*",
71+
"re:.*mlp\\.shared_experts.*",
72+
"re:.*mlp\\.gate_up_proj.*",
73+
"re:.*mlp\\.gate_proj.*",
74+
"re:.*mlp\\.up_proj.*",
75+
"re:.*mlp\\.down_proj.*",
76+
"re:.*eh_proj.*",
77+
"re:.*mlp\\.gate\\..*"
78+
]
79+
```
80+
81+
3. **safetensors 分片缺失**
82+
83+
转换工具偶尔可能产出不完整的 checkpoint(例如缺少 `model-00010-of-00093.safetensors`)。转换完成后,务必检查:
84+
- `.safetensors` 文件数量是否与预期一致。
85+
- `model.safetensors.index.json` 中是否包含所有 layer 的条目。
86+
- 抽查关键 layer(如第一个 MoE layer)的 key 数量是否正确。
87+
88+
4. **如何排查**
89+
90+
- 使用 `--check-weight-update-equal` 验证 Megatron → SGLang 权重同步后的值是否正确。如果某个参数在 SGLang 侧全为零,说明它可能被错误量化或在 checkpoint 中缺失。
91+
- 使用 `--debug-rollout-only` 配合少量 GPU,快速测试 SGLang 能否从量化 checkpoint 正常生成文本。
92+
5193
## Debug sglang illegal memory access (IMA)
5294

5395
在进行大规模 RL 时,不时会遇到 SGLang IMA 的问题,以下是我们的一些 debug 建议:

slime/utils/debug_utils/__init__.py

Whitespace-only changes.

slime/utils/debug_utils/display_debug_rollout_data.py

Lines changed: 0 additions & 73 deletions
This file was deleted.

slime/utils/debug_utils/replay_reward_fn.py

Lines changed: 0 additions & 50 deletions
This file was deleted.

slime/utils/debug_utils/send_to_sglang.py

Lines changed: 0 additions & 58 deletions
This file was deleted.

tools/convert_hf_to_int4_direct.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,7 @@ def parse_args():
283283
parser.add_argument("--model-dir", type=str, required=True, help="local BF16 path")
284284
parser.add_argument("--save-dir", type=str, required=True)
285285
parser.add_argument("--group-size", type=int, default=32, help="Group Size")
286-
parser.add_argument("--is-symmetric", type=bool, default=True, help="Is Symmetric")
286+
parser.add_argument("--is-symmetric", action="store_true", help="Whether to use symmetric quantization")
287287
parser.add_argument(
288288
"--ignore-rules",
289289
nargs="+",

0 commit comments

Comments
 (0)