Skip to content

Commit 4b655dc

Browse files
[AMD/ROCm] qwen3.5 mxfp4 support on mi355x + sglang (#1006)
* qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * qwen3.5 mxfp4 on mi355x Signed-off-by: seungrokj <seungrok.jung@amd.com> * ROCM qwen3.5 fp4 support, sglang Signed-off-by: seungrokj <seungrok.jung@amd.com> * ROCM qwen3.5 fp4 support, sglang Signed-off-by: seungrokj <seungrok.jung@amd.com> * ROCM qwen3.5 fp4 support, sglang Signed-off-by: seungrokj <seungrok.jung@amd.com> * ROCM qwen3.5 fp4 support, sglang Signed-off-by: seungrokj <seungrok.jung@amd.com> * ROCM qwen3.5 fp4 support, sglang Signed-off-by: seungrokj <seungrok.jung@amd.com> * ROCM qwen3.5 fp4 support, sglang Signed-off-by: seungrokj <seungrok.jung@amd.com> * ROCM qwen3.5 fp4 support, sglang Signed-off-by: seungrokj <seungrok.jung@amd.com> * Update perf-changelog.yaml --------- Signed-off-by: seungrokj <seungrok.jung@amd.com> Co-authored-by: Bryan Shan <58582368+Oseltamivir@users.noreply.github.com>
1 parent 475cc1d commit 4b655dc

File tree

4 files changed

+95
-1
lines changed

4 files changed

+95
-1
lines changed

.github/configs/amd-master.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,24 @@ qwen3.5-fp8-mi355x-sglang:
203203
search-space:
204204
- { tp: 8, conc-start: 4, conc-end: 64 }
205205

206+
qwen3.5-fp4-mi355x-sglang:
207+
image: lmsysorg/sglang:v0.5.10-rocm720-mi35x
208+
model: amd/Qwen3.5-397B-A17B-MXFP4
209+
model-prefix: qwen3.5
210+
runner: mi355x
211+
precision: fp4
212+
framework: sglang
213+
multinode: false
214+
seq-len-configs:
215+
- isl: 1024
216+
osl: 1024
217+
search-space:
218+
- { tp: 4, conc-start: 4, conc-end: 256 }
219+
- isl: 8192
220+
osl: 1024
221+
search-space:
222+
- { tp: 4, conc-start: 4, conc-end: 256 }
223+
206224
qwen3.5-fp8-mi300x-sglang:
207225
image: lmsysorg/sglang:v0.5.9-rocm720-mi30x
208226
model: Qwen/Qwen3.5-397B-A17B-FP8
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
#!/usr/bin/env bash
2+
3+
source "$(dirname "$0")/../benchmark_lib.sh"
4+
5+
check_env_vars \
6+
MODEL \
7+
TP \
8+
CONC \
9+
ISL \
10+
OSL \
11+
RANDOM_RANGE_RATIO \
12+
RESULT_FILENAME
13+
14+
if [[ -n "$SLURM_JOB_ID" ]]; then
15+
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
16+
fi
17+
18+
hf download "$MODEL"
19+
20+
export SGLANG_USE_AITER=1
21+
22+
SERVER_LOG=/workspace/server.log
23+
PORT=${PORT:-8888}
24+
MEM_FRAC_STATIC=${MEM_FRAC_STATIC:-0.8}
25+
26+
if [ "${EVAL_ONLY}" = "true" ]; then
27+
setup_eval_context
28+
fi
29+
30+
# Start GPU monitoring (power, temperature, clocks every second)
31+
start_gpu_monitor
32+
33+
set -x
34+
python3 -m sglang.launch_server --model-path=$MODEL --trust-remote-code \
35+
--host=0.0.0.0 --port=$PORT \
36+
--tensor-parallel-size=$TP \
37+
--attention-backend aiter \
38+
--mem-fraction-static $MEM_FRAC_STATIC \
39+
--model-loader-extra-config '{"enable_multithread_load": true}' \
40+
--watchdog-timeout 1200 \
41+
--disable-radix-cache \
42+
> $SERVER_LOG 2>&1 &
43+
44+
SERVER_PID=$!
45+
46+
# Wait for server to be ready
47+
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" --sleep-interval 60
48+
49+
run_benchmark_serving \
50+
--model "$MODEL" \
51+
--port "$PORT" \
52+
--backend vllm \
53+
--input-len "$ISL" \
54+
--output-len "$OSL" \
55+
--random-range-ratio "$RANDOM_RANGE_RATIO" \
56+
--num-prompts "$((CONC * 10))" \
57+
--max-concurrency "$CONC" \
58+
--result-filename "$RESULT_FILENAME" \
59+
--result-dir /workspace/
60+
61+
# After throughput, run evaluation only if RUN_EVAL is true
62+
if [ "${RUN_EVAL}" = "true" ]; then
63+
run_eval --framework lm-eval --port "$PORT"
64+
append_lm_eval_summary
65+
fi
66+
67+
# Stop GPU monitoring
68+
stop_gpu_monitor
69+
set +x

perf-changelog.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1316,3 +1316,9 @@
13161316
- "Enable FP8 E4M3 KV cache, NSA backends (trtllm), flashinfer allreduce fusion, MoE flashinfer_trtllm runner"
13171317
- "Tune mem-fraction-static to 0.9, chunked-prefill-size to 32768, add tokenizer-worker-num 6"
13181318
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1011
1319+
1320+
- config-keys:
1321+
- qwen3.5-fp4-mi355x-sglang
1322+
description:
1323+
- "Qwen3.5 fp4 support on SGL"
1324+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1006

runners/launch_mi355x-amds.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,8 +180,9 @@ else
180180
"
181181

182182
export VLLM_CACHE_ROOT="/it-share/gharunners/.cache/vllm"
183+
#--container-mount-home \
183184

184-
if [[ "$FRAMEWORK" == "atom" ]]; then
185+
if [[ "$FRAMEWORK" == "atom" ]] || [[ "$FRAMEWORK" == "sglang" ]]; then
185186
SLRUM_HOME_MOUNT=""
186187
else
187188
SLRUM_HOME_MOUNT=" --container-mount-home "

0 commit comments

Comments
 (0)