Skip to content

Commit 79ea365

Browse files
Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 (#935)
* Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 * Update perf-changelog.yaml --------- Co-authored-by: ankursingh-nv <ankusingh@nvidia.com>
1 parent 6a5dad4 commit 79ea365

File tree

2 files changed

+10
-1
lines changed

2 files changed

+10
-1
lines changed

benchmarks/single_node/kimik2.5_int4_b200.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ hf download "$MODEL"
2121
nvidia-smi
2222

2323
export PYTHONNOUSERSITE=1
24+
export VLLM_USE_FLASHINFER_MOE_INT4=1
2425

2526
SERVER_LOG=/workspace/server.log
2627
PORT=${PORT:-8888}
@@ -38,7 +39,8 @@ vllm serve $MODEL --host 0.0.0.0 --port $PORT \
3839
--tool-call-parser kimi_k2 \
3940
--compilation_config.pass_config.fuse_allreduce_rms true \
4041
--trust-remote-code \
41-
--disable-log-requests > $SERVER_LOG 2>&1 &
42+
--disable-log-requests \
43+
--no-enable-prefix-caching > $SERVER_LOG 2>&1 &
4244

4345
SERVER_PID=$!
4446

perf-changelog.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1048,3 +1048,10 @@
10481048
- "Replace FP8 with combination of TP4 and TP8 config"
10491049
- "Add --enable-flashinfer-allreduce-fusion to TP8"
10501050
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/918
1051+
1052+
- config-keys:
1053+
- kimik2.5-int4-b200-vllm
1054+
description:
1055+
- "Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 benchmark"
1056+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/935
1057+

0 commit comments

Comments
 (0)