Skip to content

Commit 0ee5ff3

Browse files
Optimize search space and upgrade Image to 0.19.0 for MiniMax-M2.5 (#1003)
* Add TP2EP2 for minimaxm2.5-fp8-mi355x-vllm Fewer GPUs means less inter-GPU communication overhead, and MoE expert parallelism across 2 GPUs is very efficient for this model. * Optimize config for minimaxm2.5-fp8-mi355x-vllm * Update perf-changelog for minimaxm2.5-fp8-mi355x-vllm * Upgrade minimaxm2.5-fp8-mi355x-vllm Image to v0.19.0 Enable FP8 KV cache + AITER FA for minimaxm2.5-fp8-mi355x-vllm * optimize all reduce * fix pr * Update perf-chagelog * Fix the perf-changelog --------- Co-authored-by: zhutaoyu <zhutaoyu97@gmail.com>
1 parent bddbf40 commit 0ee5ff3

File tree

3 files changed

+24
-7
lines changed

3 files changed

+24
-7
lines changed

.github/configs/amd-master.yaml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -334,7 +334,7 @@ kimik2.5-fp4-mi355x-atom:
334334
- { tp: 4, conc-start: 4, conc-end: 128 }
335335

336336
minimaxm2.5-fp8-mi355x-vllm:
337-
image: vllm/vllm-openai-rocm:v0.18.0
337+
image: vllm/vllm-openai-rocm:v0.19.0
338338
model: MiniMaxAI/MiniMax-M2.5
339339
model-prefix: minimaxm2.5
340340
runner: mi355x
@@ -345,15 +345,15 @@ minimaxm2.5-fp8-mi355x-vllm:
345345
- isl: 1024
346346
osl: 1024
347347
search-space:
348-
- { tp: 2, conc-start: 4, conc-end: 64 }
349-
- { tp: 4, conc-start: 4, conc-end: 64 }
350-
- { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }
348+
- { tp: 2, ep: 2, conc-start: 2, conc-end: 512 }
349+
- { tp: 4, ep: 4, conc-start: 4, conc-end: 256 }
350+
- { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }
351351
- isl: 8192
352352
osl: 1024
353353
search-space:
354-
- { tp: 2, conc-start: 4, conc-end: 64 }
355-
- { tp: 4, conc-start: 4, conc-end: 64 }
356-
- { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }
354+
- { tp: 2, ep: 2, conc-start: 2, conc-end: 256 }
355+
- { tp: 4, ep: 4, conc-start: 4, conc-end: 512 }
356+
- { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }
357357

358358
minimaxm2.5-fp8-mi355x-atom:
359359
image: rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2

benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
2525
fi
2626

2727
export VLLM_ROCM_USE_AITER=1
28+
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
2829

2930
SERVER_LOG=/workspace/server.log
3031
PORT=${PORT:-8888}
@@ -49,8 +50,10 @@ vllm serve $MODEL --port $PORT \
4950
$EP \
5051
--gpu-memory-utilization 0.95 \
5152
--max-model-len $MAX_MODEL_LEN \
53+
--kv-cache-dtype fp8 \
5254
--block-size=32 \
5355
--no-enable-prefix-caching \
56+
--attention-backend "ROCM_AITER_FA" \
5457
--trust-remote-code > $SERVER_LOG 2>&1 &
5558

5659
SERVER_PID=$!

perf-changelog.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1244,4 +1244,18 @@
12441244
- "Remove ISL 1024 / OSL 8192 seq-len config"
12451245
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/947
12461246

1247+
- config-keys:
1248+
- minimaxm2.5-fp8-mi355x-vllm
1249+
description:
1250+
- "Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space"
1251+
- "Add tp2 ep2 search-space entries (conc 2-256) for all seq-len configs"
1252+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1002
12471253

1254+
- config-keys:
1255+
- minimaxm2.5-fp8-mi355x-vllm
1256+
description:
1257+
- "Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space"
1258+
- "Add tp2 ep2 search-space entries (conc 2-256) for all seq-len configs"
1259+
- "Upgrade vLLM image to v0.19.0"
1260+
- "Enable FP8 KV cache + AITER FA for minimaxm2.5-fp8-mi355x-vllm"
1261+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1003

0 commit comments

Comments
 (0)