Skip to content

[AMD] Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space#1002

Closed
chunfangamd wants to merge 4 commits intomainfrom
chun/minimaxm2.5-fp8-single-node-opt-2
Closed

[AMD] Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space#1002
chunfangamd wants to merge 4 commits intomainfrom
chun/minimaxm2.5-fp8-single-node-opt-2

Conversation

@chunfangamd
Copy link
Copy Markdown
Collaborator

Add tp2 ep2 search-space entries (conc 2-512) for 1k1k and (conc 2-256) for 8k1k scenarios.

e2e Test: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23984616875

Fewer GPUs means less inter-GPU communication overhead, and MoE
expert parallelism across 2 GPUs is very efficient for this model.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Comment on lines +349 to +355
- { tp: 4, ep: 4, conc-start: 4, conc-end: 256 }
- { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }
- isl: 8192
osl: 1024
search-space:
- { tp: 2, conc-start: 4, conc-end: 64 }
- { tp: 4, conc-start: 4, conc-end: 64 }
- { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }
- { tp: 2, ep: 2, conc-start: 2, conc-end: 256 }
- { tp: 4, ep: 4, conc-start: 4, conc-end: 512 }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The tp:4/ep:4 conc-end values are accidentally swapped between the 1k1k and 8k1k ISL scenarios: 1k1k gets conc-end:256 while 8k1k gets conc-end:512, which is backwards. Longer context requires more KV-cache memory per request, so 8k1k should support fewer concurrent requests (lower conc-end); the values should be 512 for 1k1k and 256 for 8k1k, matching the stated PR intent and the tp:2/ep:2 entries in this same PR.

Extended reasoning...

What the bug is and how it manifests

In the minimaxm2.5-fp8-mi355x-vllm config, the new tp:4/ep:4 entries have their conc-end values reversed relative to ISL scenario:

  • isl:1024 (1k1k): tp:4, ep:4, conc-end: 256
  • isl:8192 (8k1k): tp:4, ep:4, conc-end: 512

This means the benchmark will explore lower maximum concurrency for short contexts and higher maximum concurrency for long contexts — the opposite of what is physically appropriate.

The specific code path

The diff modifies the seq-len-configs for minimaxm2.5-fp8-mi355x-vllm. Under isl: 1024, conc-end: 256 was written for tp:4/ep:4. Under isl: 8192, conc-end: 512 was written for tp:4/ep:4. These two values appear to have been copy-pasted with the wrong values for the wrong ISL bucket.

Why existing code doesn't prevent it

The YAML config is not validated for semantic ordering of conc-end across ISL scenarios. Both values are syntactically valid; only a human review catches that the relationship between them is inverted.

What the impact would be

The benchmarking harness will test tp:4/ep:4 up to concurrency 512 at 8k ISL, where memory pressure is highest, likely causing OOM errors, degraded throughput, or misleading results. Conversely, it will stop at concurrency 256 for 1k ISL, potentially missing the optimal peak-throughput operating point for shorter contexts.

How to fix it

Swap the two conc-end values so that:

  • isl:1024, tp:4, ep:4 → conc-end: 512
  • isl:8192, tp:4, ep:4 → conc-end: 256

Step-by-step proof

  1. The PR description explicitly states the intent: "conc 2-512 for 1k1k and (conc 2-256) for 8k1k scenarios."
  2. The tp:2/ep:2 entries added in this same PR correctly implement that intent: 1k1k gets conc-end:512, 8k1k gets conc-end:256.
  3. The tp:4/ep:4 entries are reversed: 1k1k gets conc-end:256, 8k1k gets conc-end:512 — contradicting point 1.
  4. The analogous minimaxm2.5-fp8-mi325x-vllm config (already in the file) confirms the expected pattern: tp:8/ep:8 has conc-end:512 for isl:1024 and conc-end:256 for isl:8192.
  5. Physical reasoning independently confirms: 8k ISL requires ~8× more KV-cache memory per request than 1k ISL, so the system can sustain fewer concurrent requests under 8k load, i.e., conc-end for 8k1k must be ≤ conc-end for 1k1k at the same TP/EP.

Comment on lines +350 to +356
- { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }
- isl: 8192
osl: 1024
search-space:
- { tp: 2, conc-start: 4, conc-end: 64 }
- { tp: 4, conc-start: 4, conc-end: 64 }
- { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }
- { tp: 2, ep: 2, conc-start: 2, conc-end: 256 }
- { tp: 4, ep: 4, conc-start: 4, conc-end: 512 }
- { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The tp:8/ep:8 entries for both ISL scenarios (1k1k and 8k1k) have conc-start:2 and conc-end:2, collapsing the search space to a single concurrency point. This is a dramatic reduction from the previous conc-start:32, conc-end:256 range and is undocumented—if intentional (e.g., tp8/ep8 is known non-optimal and just needs a sanity check at conc=2), a brief comment in the YAML would prevent future confusion; if unintentional, conc-end should be raised to provide a meaningful search range.

Extended reasoning...

What the bug is: In the updated minimaxm2.5-fp8-mi355x-vllm config, both ISL scenarios now have { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }. When conc-start == conc-end, only a single concurrency value (conc=2) is benchmarked. The previous configuration used conc-start: 32, conc-end: 256, providing a meaningful 8-point search range.

The specific change: The diff shows the tp:8/ep:8 lines for both 1k1k (line 350) and 8k1k (line 356) changed from conc-start: 32, conc-end: 256 to conc-start: 2, conc-end: 2. The PR description only mentions "Add tp2 ep2 search-space entries" and makes no reference to the tp8/ep8 reduction.

Why existing code doesn't prevent it: The YAML schema appears to allow conc-start == conc-end as a degenerate but technically valid configuration. No validation prevents this single-point collapse.

Addressing the intentionality argument: One verifier argued this is intentional search-space optimization—narrowing tp8/ep8 to a reference point while widening tp2/ep2 (2-512) and tp4/ep4 (4-256). This is plausible: the PR title is "Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space," and the consistency across both ISL scenarios suggests deliberate authorship. However, the sibling minimaxm2.5-fp8-mi355x-atom config still uses conc-start: 32, conc-end: 256 for tp8/ep8, creating an inconsistency between frameworks with no explanation. No other config in this file uses conc-start == conc-end as a search space (other instances like tp: 1, conc-start: 64, conc-end: 64 in gptoss-fp4-mi300x-vllm are similarly unexplained, but that is a separate concern).

Step-by-step proof: Before the PR, tp8/ep8 in minimaxm2.5-fp8-mi355x-vllm would test concurrencies 32, 64, 128, 256 (assuming powers-of-2 steps). After the PR, only conc=2 is tested. Since tp2/ep2 now covers 2-512 and tp4/ep4 covers 4-256, there is substantial overlap in the low-concurrency range—but tp8/ep8's potential performance advantage at higher concurrencies (32-256) is now completely untested, making it impossible to determine if tp8/ep8 would outperform lower TP configurations at production-scale loads.

How to fix: If intentional, add an inline comment such as # pinned to single sanity-check point; tp2/tp4 are the primary search targets. If unintentional, restore conc-end to a value like 256 to preserve meaningful search coverage.

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Apr 5, 2026

@chunfangamd fixed by #1003

@cquil11 cquil11 added the AMD label Apr 8, 2026
@cquil11 cquil11 changed the title Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space [AMD] Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Development

Successfully merging this pull request may close these issues.

3 participants