[AMD] Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space#1002
[AMD] Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space#1002chunfangamd wants to merge 4 commits intomainfrom
Conversation
Fewer GPUs means less inter-GPU communication overhead, and MoE expert parallelism across 2 GPUs is very efficient for this model.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
| - { tp: 4, ep: 4, conc-start: 4, conc-end: 256 } | ||
| - { tp: 8, ep: 8, conc-start: 2, conc-end: 2 } | ||
| - isl: 8192 | ||
| osl: 1024 | ||
| search-space: | ||
| - { tp: 2, conc-start: 4, conc-end: 64 } | ||
| - { tp: 4, conc-start: 4, conc-end: 64 } | ||
| - { tp: 8, ep: 8, conc-start: 32, conc-end: 256 } | ||
| - { tp: 2, ep: 2, conc-start: 2, conc-end: 256 } | ||
| - { tp: 4, ep: 4, conc-start: 4, conc-end: 512 } |
There was a problem hiding this comment.
🔴 The tp:4/ep:4 conc-end values are accidentally swapped between the 1k1k and 8k1k ISL scenarios: 1k1k gets conc-end:256 while 8k1k gets conc-end:512, which is backwards. Longer context requires more KV-cache memory per request, so 8k1k should support fewer concurrent requests (lower conc-end); the values should be 512 for 1k1k and 256 for 8k1k, matching the stated PR intent and the tp:2/ep:2 entries in this same PR.
Extended reasoning...
What the bug is and how it manifests
In the minimaxm2.5-fp8-mi355x-vllm config, the new tp:4/ep:4 entries have their conc-end values reversed relative to ISL scenario:
- isl:1024 (1k1k):
tp:4, ep:4, conc-end: 256 - isl:8192 (8k1k):
tp:4, ep:4, conc-end: 512
This means the benchmark will explore lower maximum concurrency for short contexts and higher maximum concurrency for long contexts — the opposite of what is physically appropriate.
The specific code path
The diff modifies the seq-len-configs for minimaxm2.5-fp8-mi355x-vllm. Under isl: 1024, conc-end: 256 was written for tp:4/ep:4. Under isl: 8192, conc-end: 512 was written for tp:4/ep:4. These two values appear to have been copy-pasted with the wrong values for the wrong ISL bucket.
Why existing code doesn't prevent it
The YAML config is not validated for semantic ordering of conc-end across ISL scenarios. Both values are syntactically valid; only a human review catches that the relationship between them is inverted.
What the impact would be
The benchmarking harness will test tp:4/ep:4 up to concurrency 512 at 8k ISL, where memory pressure is highest, likely causing OOM errors, degraded throughput, or misleading results. Conversely, it will stop at concurrency 256 for 1k ISL, potentially missing the optimal peak-throughput operating point for shorter contexts.
How to fix it
Swap the two conc-end values so that:
- isl:1024, tp:4, ep:4 →
conc-end: 512 - isl:8192, tp:4, ep:4 →
conc-end: 256
Step-by-step proof
- The PR description explicitly states the intent: "conc 2-512 for 1k1k and (conc 2-256) for 8k1k scenarios."
- The
tp:2/ep:2entries added in this same PR correctly implement that intent: 1k1k getsconc-end:512, 8k1k getsconc-end:256. - The
tp:4/ep:4entries are reversed: 1k1k getsconc-end:256, 8k1k getsconc-end:512— contradicting point 1. - The analogous
minimaxm2.5-fp8-mi325x-vllmconfig (already in the file) confirms the expected pattern:tp:8/ep:8hasconc-end:512for isl:1024 andconc-end:256for isl:8192. - Physical reasoning independently confirms: 8k ISL requires ~8× more KV-cache memory per request than 1k ISL, so the system can sustain fewer concurrent requests under 8k load, i.e.,
conc-endfor 8k1k must be ≤conc-endfor 1k1k at the same TP/EP.
| - { tp: 8, ep: 8, conc-start: 2, conc-end: 2 } | ||
| - isl: 8192 | ||
| osl: 1024 | ||
| search-space: | ||
| - { tp: 2, conc-start: 4, conc-end: 64 } | ||
| - { tp: 4, conc-start: 4, conc-end: 64 } | ||
| - { tp: 8, ep: 8, conc-start: 32, conc-end: 256 } | ||
| - { tp: 2, ep: 2, conc-start: 2, conc-end: 256 } | ||
| - { tp: 4, ep: 4, conc-start: 4, conc-end: 512 } | ||
| - { tp: 8, ep: 8, conc-start: 2, conc-end: 2 } |
There was a problem hiding this comment.
🟡 The tp:8/ep:8 entries for both ISL scenarios (1k1k and 8k1k) have conc-start:2 and conc-end:2, collapsing the search space to a single concurrency point. This is a dramatic reduction from the previous conc-start:32, conc-end:256 range and is undocumented—if intentional (e.g., tp8/ep8 is known non-optimal and just needs a sanity check at conc=2), a brief comment in the YAML would prevent future confusion; if unintentional, conc-end should be raised to provide a meaningful search range.
Extended reasoning...
What the bug is: In the updated minimaxm2.5-fp8-mi355x-vllm config, both ISL scenarios now have { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }. When conc-start == conc-end, only a single concurrency value (conc=2) is benchmarked. The previous configuration used conc-start: 32, conc-end: 256, providing a meaningful 8-point search range.
The specific change: The diff shows the tp:8/ep:8 lines for both 1k1k (line 350) and 8k1k (line 356) changed from conc-start: 32, conc-end: 256 to conc-start: 2, conc-end: 2. The PR description only mentions "Add tp2 ep2 search-space entries" and makes no reference to the tp8/ep8 reduction.
Why existing code doesn't prevent it: The YAML schema appears to allow conc-start == conc-end as a degenerate but technically valid configuration. No validation prevents this single-point collapse.
Addressing the intentionality argument: One verifier argued this is intentional search-space optimization—narrowing tp8/ep8 to a reference point while widening tp2/ep2 (2-512) and tp4/ep4 (4-256). This is plausible: the PR title is "Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space," and the consistency across both ISL scenarios suggests deliberate authorship. However, the sibling minimaxm2.5-fp8-mi355x-atom config still uses conc-start: 32, conc-end: 256 for tp8/ep8, creating an inconsistency between frameworks with no explanation. No other config in this file uses conc-start == conc-end as a search space (other instances like tp: 1, conc-start: 64, conc-end: 64 in gptoss-fp4-mi300x-vllm are similarly unexplained, but that is a separate concern).
Step-by-step proof: Before the PR, tp8/ep8 in minimaxm2.5-fp8-mi355x-vllm would test concurrencies 32, 64, 128, 256 (assuming powers-of-2 steps). After the PR, only conc=2 is tested. Since tp2/ep2 now covers 2-512 and tp4/ep4 covers 4-256, there is substantial overlap in the low-concurrency range—but tp8/ep8's potential performance advantage at higher concurrencies (32-256) is now completely untested, making it impossible to determine if tp8/ep8 would outperform lower TP configurations at production-scale loads.
How to fix: If intentional, add an inline comment such as # pinned to single sanity-check point; tp2/tp4 are the primary search targets. If unintentional, restore conc-end to a value like 256 to preserve meaningful search coverage.
|
@chunfangamd fixed by #1003 |
Add tp2 ep2 search-space entries (conc 2-512) for 1k1k and (conc 2-256) for 8k1k scenarios.
e2e Test: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23984616875