Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions MiniMax/MiniMax-M2.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,21 @@ MiniMax-M2.5 can be run on different GPU configurations. The recommended setup u

### B200 (FP8)

Recommended configuration uses 4 GPUs with tensor and expert parallelism. A 2-GPU configuration (`--tensor-parallel-size 2 --enable-expert-parallel`) is also supported.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an inconsistency between the text, the command, and the PR description regarding parallelism:

  • The text states the recommended configuration uses 4 GPUs.
  • The command below uses --tensor-parallel-size 4.
  • In vLLM, if the number of GPUs equals the tensor parallel size, expert parallelism is effectively disabled (EP=1), making --enable-expert-parallel redundant.
  • The PR description mentions tp:4/ep:4 was validated, which requires 16 GPUs.

If the intention is to provide a 4-GPU recommendation that utilizes expert parallelism, the configuration should likely be --tensor-parallel-size 2 (which defaults to EP=2 on 4 GPUs). Please clarify the intended hardware target.


```bash
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:nightly MiniMaxAI/MiniMax-M2.5 \
vllm/vllm-openai:latest MiniMaxAI/MiniMax-M2.5 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--block-size 32 \
--kv-cache-dtype fp8 \
--stream-interval 20 \
--no-enable-prefix-caching \
--trust-remote-code
Comment on lines +46 to 52
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This update introduces two issues:

  1. Regression: The tool-calling and reasoning parser flags (--tool-call-parser, --reasoning-parser, and --enable-auto-tool-choice) have been removed. These are essential for the model's specialized features, such as structured tool use and the <think> block formatting.
  2. Invalid Argument: --no-enable-prefix-caching is not a valid vLLM argument. Prefix caching is disabled by default in vLLM; including an unrecognized flag will cause the server to fail at startup. If you wish to ensure it is off, simply omit the --enable-prefix-caching flag.
Suggested change
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--block-size 32 \
--kv-cache-dtype fp8 \
--stream-interval 20 \
--no-enable-prefix-caching \
--trust-remote-code
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--block-size 32 \
--kv-cache-dtype fp8 \
--stream-interval 20 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--trust-remote-code

```

Expand Down