Skip to content

MiniMax-M2.5: update B200 FP8 serving config#321

Open
faradawn wants to merge 1 commit intovllm-project:mainfrom
faradawn:minimaxm25-b200-fp8
Open

MiniMax-M2.5: update B200 FP8 serving config#321
faradawn wants to merge 1 commit intovllm-project:mainfrom
faradawn:minimaxm25-b200-fp8

Conversation

@faradawn
Copy link
Copy Markdown
Collaborator

@faradawn faradawn commented Apr 8, 2026

Summary

  • Add --enable-expert-parallel (tp:4/ep:4 validated; tp:2/ep:2 also noted)
  • Add --gpu-memory-utilization 0.90
  • Add --block-size 32
  • Add --kv-cache-dtype fp8
  • Add --stream-interval 20
  • Add --no-enable-prefix-caching
  • Switch image tag from nightly to latest

Based on SemiAnalysisAI/InferenceX#1010, which validated and widened the B200 FP8 search space for MiniMax-M2.5 (tp:4/ep:4 at conc 256–512, tp:2/ep:2 at conc 512).

Add benchmark-validated flags for B200 FP8 from SemiAnalysisAI/InferenceX#1010:
--enable-expert-parallel (tp:4/ep:4 validated, tp:2/ep:2 also supported),
--gpu-memory-utilization 0.90, --block-size 32, --kv-cache-dtype fp8,
--stream-interval 20, --no-enable-prefix-caching.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the MiniMax-M2.5 documentation for B200 GPU configurations, transitioning to the latest vLLM image and adding several performance-tuning flags. However, the proposed configuration contains a parallelism inconsistency where setting tensor parallelism to 4 on a 4-GPU setup makes expert parallelism redundant. Furthermore, the update inadvertently removed critical tool-calling and reasoning parser flags and introduced an invalid argument, '--no-enable-prefix-caching', which would cause server startup failures.


### B200 (FP8)

Recommended configuration uses 4 GPUs with tensor and expert parallelism. A 2-GPU configuration (`--tensor-parallel-size 2 --enable-expert-parallel`) is also supported.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an inconsistency between the text, the command, and the PR description regarding parallelism:

  • The text states the recommended configuration uses 4 GPUs.
  • The command below uses --tensor-parallel-size 4.
  • In vLLM, if the number of GPUs equals the tensor parallel size, expert parallelism is effectively disabled (EP=1), making --enable-expert-parallel redundant.
  • The PR description mentions tp:4/ep:4 was validated, which requires 16 GPUs.

If the intention is to provide a 4-GPU recommendation that utilizes expert parallelism, the configuration should likely be --tensor-parallel-size 2 (which defaults to EP=2 on 4 GPUs). Please clarify the intended hardware target.

Comment on lines +46 to 52
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--block-size 32 \
--kv-cache-dtype fp8 \
--stream-interval 20 \
--no-enable-prefix-caching \
--trust-remote-code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This update introduces two issues:

  1. Regression: The tool-calling and reasoning parser flags (--tool-call-parser, --reasoning-parser, and --enable-auto-tool-choice) have been removed. These are essential for the model's specialized features, such as structured tool use and the <think> block formatting.
  2. Invalid Argument: --no-enable-prefix-caching is not a valid vLLM argument. Prefix caching is disabled by default in vLLM; including an unrecognized flag will cause the server to fail at startup. If you wish to ensure it is off, simply omit the --enable-prefix-caching flag.
Suggested change
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--block-size 32 \
--kv-cache-dtype fp8 \
--stream-interval 20 \
--no-enable-prefix-caching \
--trust-remote-code
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--block-size 32 \
--kv-cache-dtype fp8 \
--stream-interval 20 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--trust-remote-code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant