MiniMax-M2.5: update B200 FP8 serving config by faradawn · Pull Request #321 · vllm-project/recipes

faradawn · 2026-04-08T04:01:56Z

Summary

Add --enable-expert-parallel (tp:4/ep:4 validated; tp:2/ep:2 also noted)
Add --gpu-memory-utilization 0.90
Add --block-size 32
Add --kv-cache-dtype fp8
Add --stream-interval 20
Add --no-enable-prefix-caching
Switch image tag from nightly to latest

Based on SemiAnalysisAI/InferenceX#1010, which validated and widened the B200 FP8 search space for MiniMax-M2.5 (tp:4/ep:4 at conc 256–512, tp:2/ep:2 at conc 512).

Add benchmark-validated flags for B200 FP8 from SemiAnalysisAI/InferenceX#1010: --enable-expert-parallel (tp:4/ep:4 validated, tp:2/ep:2 also supported), --gpu-memory-utilization 0.90, --block-size 32, --kv-cache-dtype fp8, --stream-interval 20, --no-enable-prefix-caching. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request updates the MiniMax-M2.5 documentation for B200 GPU configurations, transitioning to the latest vLLM image and adding several performance-tuning flags. However, the proposed configuration contains a parallelism inconsistency where setting tensor parallelism to 4 on a 4-GPU setup makes expert parallelism redundant. Furthermore, the update inadvertently removed critical tool-calling and reasoning parser flags and introduced an invalid argument, '--no-enable-prefix-caching', which would cause server startup failures.

gemini-code-assist · 2026-04-08T04:04:31Z

MiniMax/MiniMax-M2.5.md


 ### B200 (FP8)

+Recommended configuration uses 4 GPUs with tensor and expert parallelism. A 2-GPU configuration (`--tensor-parallel-size 2 --enable-expert-parallel`) is also supported.


There is an inconsistency between the text, the command, and the PR description regarding parallelism:

The text states the recommended configuration uses 4 GPUs.

The command below uses --tensor-parallel-size 4.

In vLLM, if the number of GPUs equals the tensor parallel size, expert parallelism is effectively disabled (EP=1), making --enable-expert-parallel redundant.

The PR description mentions tp:4/ep:4 was validated, which requires 16 GPUs.

If the intention is to provide a 4-GPU recommendation that utilizes expert parallelism, the configuration should likely be --tensor-parallel-size 2 (which defaults to EP=2 on 4 GPUs). Please clarify the intended hardware target.

gemini-code-assist · 2026-04-08T04:04:31Z

MiniMax/MiniMax-M2.5.md

+      --enable-expert-parallel \
+      --gpu-memory-utilization 0.90 \
+      --block-size 32 \
+      --kv-cache-dtype fp8 \
+      --stream-interval 20 \
+      --no-enable-prefix-caching \
      --trust-remote-code


This update introduces two issues:

Regression: The tool-calling and reasoning parser flags (--tool-call-parser, --reasoning-parser, and --enable-auto-tool-choice) have been removed. These are essential for the model's specialized features, such as structured tool use and the <think> block formatting.

Invalid Argument: --no-enable-prefix-caching is not a valid vLLM argument. Prefix caching is disabled by default in vLLM; including an unrecognized flag will cause the server to fail at startup. If you wish to ensure it is off, simply omit the --enable-prefix-caching flag.

Suggested change

--enable-expert-parallel \

--gpu-memory-utilization 0.90 \

--block-size 32 \

--kv-cache-dtype fp8 \

--stream-interval 20 \

--no-enable-prefix-caching \

--trust-remote-code

--enable-expert-parallel \

--gpu-memory-utilization 0.90 \

--block-size 32 \

--kv-cache-dtype fp8 \

--stream-interval 20 \

--tool-call-parser minimax_m2 \

--reasoning-parser minimax_m2_append_think \

--enable-auto-tool-choice \

--trust-remote-code

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

hshrivastava-droid mentioned this pull request Apr 8, 2026

[NVIDIA] Update Minimax fp8 B200 Configs SemiAnalysisAI/InferenceX#1010

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MiniMax-M2.5: update B200 FP8 serving config#321

MiniMax-M2.5: update B200 FP8 serving config#321
faradawn wants to merge 1 commit intovllm-project:mainfrom
faradawn:minimaxm25-b200-fp8

faradawn commented Apr 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		### B200 (FP8)

		Recommended configuration uses 4 GPUs with tensor and expert parallelism. A 2-GPU configuration (`--tensor-parallel-size 2 --enable-expert-parallel`) is also supported.

Conversation

faradawn commented Apr 8, 2026

Summary

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant