MiniMax-M2.5: update B200 FP8 serving config#321
MiniMax-M2.5: update B200 FP8 serving config#321faradawn wants to merge 1 commit intovllm-project:mainfrom
Conversation
Add benchmark-validated flags for B200 FP8 from SemiAnalysisAI/InferenceX#1010: --enable-expert-parallel (tp:4/ep:4 validated, tp:2/ep:2 also supported), --gpu-memory-utilization 0.90, --block-size 32, --kv-cache-dtype fp8, --stream-interval 20, --no-enable-prefix-caching. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request updates the MiniMax-M2.5 documentation for B200 GPU configurations, transitioning to the latest vLLM image and adding several performance-tuning flags. However, the proposed configuration contains a parallelism inconsistency where setting tensor parallelism to 4 on a 4-GPU setup makes expert parallelism redundant. Furthermore, the update inadvertently removed critical tool-calling and reasoning parser flags and introduced an invalid argument, '--no-enable-prefix-caching', which would cause server startup failures.
|
|
||
| ### B200 (FP8) | ||
|
|
||
| Recommended configuration uses 4 GPUs with tensor and expert parallelism. A 2-GPU configuration (`--tensor-parallel-size 2 --enable-expert-parallel`) is also supported. |
There was a problem hiding this comment.
There is an inconsistency between the text, the command, and the PR description regarding parallelism:
- The text states the recommended configuration uses 4 GPUs.
- The command below uses
--tensor-parallel-size 4. - In vLLM, if the number of GPUs equals the tensor parallel size, expert parallelism is effectively disabled (EP=1), making
--enable-expert-parallelredundant. - The PR description mentions tp:4/ep:4 was validated, which requires 16 GPUs.
If the intention is to provide a 4-GPU recommendation that utilizes expert parallelism, the configuration should likely be --tensor-parallel-size 2 (which defaults to EP=2 on 4 GPUs). Please clarify the intended hardware target.
| --enable-expert-parallel \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --block-size 32 \ | ||
| --kv-cache-dtype fp8 \ | ||
| --stream-interval 20 \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code |
There was a problem hiding this comment.
This update introduces two issues:
- Regression: The tool-calling and reasoning parser flags (
--tool-call-parser,--reasoning-parser, and--enable-auto-tool-choice) have been removed. These are essential for the model's specialized features, such as structured tool use and the<think>block formatting. - Invalid Argument:
--no-enable-prefix-cachingis not a valid vLLM argument. Prefix caching is disabled by default in vLLM; including an unrecognized flag will cause the server to fail at startup. If you wish to ensure it is off, simply omit the--enable-prefix-cachingflag.
| --enable-expert-parallel \ | |
| --gpu-memory-utilization 0.90 \ | |
| --block-size 32 \ | |
| --kv-cache-dtype fp8 \ | |
| --stream-interval 20 \ | |
| --no-enable-prefix-caching \ | |
| --trust-remote-code | |
| --enable-expert-parallel \ | |
| --gpu-memory-utilization 0.90 \ | |
| --block-size 32 \ | |
| --kv-cache-dtype fp8 \ | |
| --stream-interval 20 \ | |
| --tool-call-parser minimax_m2 \ | |
| --reasoning-parser minimax_m2_append_think \ | |
| --enable-auto-tool-choice \ | |
| --trust-remote-code |
Summary
--enable-expert-parallel(tp:4/ep:4 validated; tp:2/ep:2 also noted)--gpu-memory-utilization 0.90--block-size 32--kv-cache-dtype fp8--stream-interval 20--no-enable-prefix-cachingnightlytolatestBased on SemiAnalysisAI/InferenceX#1010, which validated and widened the B200 FP8 search space for MiniMax-M2.5 (tp:4/ep:4 at conc 256–512, tp:2/ep:2 at conc 512).