[NVIDIA] Update GLM-5 NVFP4 B200 SGLang config (#1011)

Ankur-singh · github-actions[bot] · hshrivastava-droid · web-flow · commit 475cc1dbc09e · 2026-04-09T17:47:28.000-05:00
* Update GLM-5 NVFP4 B200 SGLang config and benchmark script

Add tp4 ep1 conc-128 search-space entry for both 1k1k and 8k1k configs.
Update benchmark script with new server launch flags: enable-dp-lm-head,
disable-radix-cache, fp8_e4m3 kv-cache, NSA trtllm backends, flashinfer
allreduce fusion, and tuned prefill/memory settings.

Bump GLM-5 NVFP4 B200 tp4 concurrency to 256

* Add perf-changelog entry for GLM-5 NVFP4 B200 SGLang config update

Co-authored-by: Ankur Singh &lt;Ankur-singh@users.noreply.github.com&gt;

* Update GLM-5 NVFP4 B200: tp8 conc=4, tp4 conc=4-256, cuda-graph-max-bs 256

* Remove enable-dp-lm-head option from script

---------

Co-authored-by: claude[bot] &lt;41898282+claude[bot]@users.noreply.github.com&gt;
Co-authored-by: Ankur Singh &lt;Ankur-singh@users.noreply.github.com&gt;
Co-authored-by: hshrivastava-droid &lt;hshrivastava@nvidia.com&gt;
diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml
@@ -1837,11 +1837,13 @@ glm5-fp4-b200-sglang:
   - isl: 1024
     osl: 1024
     search-space:
-    - { tp: 8, ep: 1, conc-start: 4, conc-end: 64 }
+    - { tp: 8, ep: 1, conc-start: 4, conc-end: 4 }
+    - { tp: 4, ep: 1, conc-start: 4, conc-end: 256 }
   - isl: 8192
     osl: 1024
     search-space:
-    - { tp: 8, ep: 1, conc-start: 4, conc-end: 64 }
+    - { tp: 8, ep: 1, conc-start: 4, conc-end: 4 }
+    - { tp: 4, ep: 1, conc-start: 4, conc-end: 256 }
 
 qwen3.5-fp8-b200-sglang-mtp:
   image: lmsysorg/sglang:v0.5.9-cu130
diff --git a/benchmarks/single_node/glm5_fp4_b200.sh b/benchmarks/single_node/glm5_fp4_b200.sh
@@ -33,23 +33,26 @@ fi
 # Start GPU monitoring (power, temperature, clocks every second)
 start_gpu_monitor
 
-# following https://huggingface.co/nvidia/GLM-5-NVFP4#usage recipe
-# except using latest nightly at the time of writing
-# since the recommended nightly image in that recipe doesn't exist.
-
 set -x
 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
 --trust-remote-code \
 --tensor-parallel-size=$TP \
---data-parallel-size 1 --expert-parallel-size 1 \
---tool-call-parser glm47 \
---reasoning-parser glm45 \
+--data-parallel-size 1 --expert-parallel-size $EP_SIZE \
+--disable-radix-cache \
 --quantization modelopt_fp4 \
---cuda-graph-max-bs $CONC --max-running-requests $CONC \
---mem-fraction-static 0.80 \
---chunked-prefill-size 131072 \
+--kv-cache-dtype fp8_e4m3 \
+--nsa-decode-backend trtllm \
+--nsa-prefill-backend trtllm \
+--moe-runner-backend flashinfer_trtllm \
+--enable-flashinfer-allreduce-fusion \
+--cuda-graph-max-bs 256 \
+--max-prefill-tokens 32768 \
+--chunked-prefill-size 32768 \
+--mem-fraction-static 0.9 \
 --stream-interval 30 \
---model-loader-extra-config '{"enable_multithread_load": true}' $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
+--scheduler-recv-interval 10 \
+--tokenizer-worker-num 6 \
+--tokenizer-path $MODEL $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
 
 SERVER_PID=$!
 
diff --git a/perf-changelog.yaml b/perf-changelog.yaml
@@ -1307,3 +1307,12 @@
     - "Model: nvidia/Qwen3.5-397B-A17B-NVFP4"
     - "Configs: 1k1k (TP4 conc 4-128), 8k1k (TP4 conc 4-128)"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/820
+
+- config-keys:
+    - glm5-fp4-b200-sglang
+  description:
+    - "Update GLM-5 NVFP4 B200 SGLang benchmark script with optimized launch parameters"
+    - "Add TP4 search space with higher concurrency (128-256) for 1k1k and 8k1k configs"
+    - "Enable FP8 E4M3 KV cache, NSA backends (trtllm), flashinfer allreduce fusion, MoE flashinfer_trtllm runner"
+    - "Tune mem-fraction-static to 0.9, chunked-prefill-size to 32768, add tokenizer-worker-num 6"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1011