Skip to content

Commit 475cc1d

Browse files
Ankur-singhgithub-actions[bot]hshrivastava-droid
authored
[NVIDIA] Update GLM-5 NVFP4 B200 SGLang config (#1011)
* Update GLM-5 NVFP4 B200 SGLang config and benchmark script Add tp4 ep1 conc-128 search-space entry for both 1k1k and 8k1k configs. Update benchmark script with new server launch flags: enable-dp-lm-head, disable-radix-cache, fp8_e4m3 kv-cache, NSA trtllm backends, flashinfer allreduce fusion, and tuned prefill/memory settings. Bump GLM-5 NVFP4 B200 tp4 concurrency to 256 * Add perf-changelog entry for GLM-5 NVFP4 B200 SGLang config update Co-authored-by: Ankur Singh <Ankur-singh@users.noreply.github.com> * Update GLM-5 NVFP4 B200: tp8 conc=4, tp4 conc=4-256, cuda-graph-max-bs 256 * Remove enable-dp-lm-head option from script --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Ankur Singh <Ankur-singh@users.noreply.github.com> Co-authored-by: hshrivastava-droid <hshrivastava@nvidia.com>
1 parent 49db200 commit 475cc1d

File tree

3 files changed

+27
-13
lines changed

3 files changed

+27
-13
lines changed

.github/configs/nvidia-master.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1837,11 +1837,13 @@ glm5-fp4-b200-sglang:
18371837
- isl: 1024
18381838
osl: 1024
18391839
search-space:
1840-
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64 }
1840+
- { tp: 8, ep: 1, conc-start: 4, conc-end: 4 }
1841+
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256 }
18411842
- isl: 8192
18421843
osl: 1024
18431844
search-space:
1844-
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64 }
1845+
- { tp: 8, ep: 1, conc-start: 4, conc-end: 4 }
1846+
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256 }
18451847

18461848
qwen3.5-fp8-b200-sglang-mtp:
18471849
image: lmsysorg/sglang:v0.5.9-cu130

benchmarks/single_node/glm5_fp4_b200.sh

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -33,23 +33,26 @@ fi
3333
# Start GPU monitoring (power, temperature, clocks every second)
3434
start_gpu_monitor
3535

36-
# following https://huggingface.co/nvidia/GLM-5-NVFP4#usage recipe
37-
# except using latest nightly at the time of writing
38-
# since the recommended nightly image in that recipe doesn't exist.
39-
4036
set -x
4137
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
4238
--trust-remote-code \
4339
--tensor-parallel-size=$TP \
44-
--data-parallel-size 1 --expert-parallel-size 1 \
45-
--tool-call-parser glm47 \
46-
--reasoning-parser glm45 \
40+
--data-parallel-size 1 --expert-parallel-size $EP_SIZE \
41+
--disable-radix-cache \
4742
--quantization modelopt_fp4 \
48-
--cuda-graph-max-bs $CONC --max-running-requests $CONC \
49-
--mem-fraction-static 0.80 \
50-
--chunked-prefill-size 131072 \
43+
--kv-cache-dtype fp8_e4m3 \
44+
--nsa-decode-backend trtllm \
45+
--nsa-prefill-backend trtllm \
46+
--moe-runner-backend flashinfer_trtllm \
47+
--enable-flashinfer-allreduce-fusion \
48+
--cuda-graph-max-bs 256 \
49+
--max-prefill-tokens 32768 \
50+
--chunked-prefill-size 32768 \
51+
--mem-fraction-static 0.9 \
5152
--stream-interval 30 \
52-
--model-loader-extra-config '{"enable_multithread_load": true}' $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
53+
--scheduler-recv-interval 10 \
54+
--tokenizer-worker-num 6 \
55+
--tokenizer-path $MODEL $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
5356

5457
SERVER_PID=$!
5558

perf-changelog.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1307,3 +1307,12 @@
13071307
- "Model: nvidia/Qwen3.5-397B-A17B-NVFP4"
13081308
- "Configs: 1k1k (TP4 conc 4-128), 8k1k (TP4 conc 4-128)"
13091309
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/820
1310+
1311+
- config-keys:
1312+
- glm5-fp4-b200-sglang
1313+
description:
1314+
- "Update GLM-5 NVFP4 B200 SGLang benchmark script with optimized launch parameters"
1315+
- "Add TP4 search space with higher concurrency (128-256) for 1k1k and 8k1k configs"
1316+
- "Enable FP8 E4M3 KV cache, NSA backends (trtllm), flashinfer allreduce fusion, MoE flashinfer_trtllm runner"
1317+
- "Tune mem-fraction-static to 0.9, chunked-prefill-size to 32768, add tokenizer-worker-num 6"
1318+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1011

0 commit comments

Comments
 (0)