[AMD] Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space by chunfangamd · Pull Request #1002 · SemiAnalysisAI/InferenceX

chunfangamd · 2026-04-04T20:08:11Z

Add tp2 ep2 search-space entries (conc 2-512) for 1k1k and (conc 2-256) for 8k1k scenarios.

e2e Test: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23984616875

Fewer GPUs means less inter-GPU communication overhead, and MoE expert parallelism across 2 GPUs is very efficient for this model.

github-actions · 2026-04-04T20:08:21Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

github-actions · 2026-04-04T20:08:21Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

claude · 2026-04-04T20:16:11Z

.github/configs/amd-master.yaml

+    - { tp: 4, ep: 4, conc-start: 4, conc-end: 256 }
+    - { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }
  - isl: 8192
    osl: 1024
    search-space:
-    - { tp: 2, conc-start: 4, conc-end: 64 }
-    - { tp: 4, conc-start: 4, conc-end: 64 }
-    - { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }
+    - { tp: 2, ep: 2, conc-start: 2, conc-end: 256 }
+    - { tp: 4, ep: 4, conc-start: 4, conc-end: 512 }


🔴 The tp:4/ep:4 conc-end values are accidentally swapped between the 1k1k and 8k1k ISL scenarios: 1k1k gets conc-end:256 while 8k1k gets conc-end:512, which is backwards. Longer context requires more KV-cache memory per request, so 8k1k should support fewer concurrent requests (lower conc-end); the values should be 512 for 1k1k and 256 for 8k1k, matching the stated PR intent and the tp:2/ep:2 entries in this same PR.

Extended reasoning...

What the bug is and how it manifests

In the minimaxm2.5-fp8-mi355x-vllm config, the new tp:4/ep:4 entries have their conc-end values reversed relative to ISL scenario:

isl:1024 (1k1k): tp:4, ep:4, conc-end: 256

isl:8192 (8k1k): tp:4, ep:4, conc-end: 512

This means the benchmark will explore lower maximum concurrency for short contexts and higher maximum concurrency for long contexts — the opposite of what is physically appropriate.

The specific code path

The diff modifies the seq-len-configs for minimaxm2.5-fp8-mi355x-vllm. Under isl: 1024, conc-end: 256 was written for tp:4/ep:4. Under isl: 8192, conc-end: 512 was written for tp:4/ep:4. These two values appear to have been copy-pasted with the wrong values for the wrong ISL bucket.

Why existing code doesn't prevent it

The YAML config is not validated for semantic ordering of conc-end across ISL scenarios. Both values are syntactically valid; only a human review catches that the relationship between them is inverted.

What the impact would be

The benchmarking harness will test tp:4/ep:4 up to concurrency 512 at 8k ISL, where memory pressure is highest, likely causing OOM errors, degraded throughput, or misleading results. Conversely, it will stop at concurrency 256 for 1k ISL, potentially missing the optimal peak-throughput operating point for shorter contexts.

How to fix it

Swap the two conc-end values so that:

isl:1024, tp:4, ep:4 → conc-end: 512

isl:8192, tp:4, ep:4 → conc-end: 256

Step-by-step proof

The PR description explicitly states the intent: "conc 2-512 for 1k1k and (conc 2-256) for 8k1k scenarios."

The tp:2/ep:2 entries added in this same PR correctly implement that intent: 1k1k gets conc-end:512, 8k1k gets conc-end:256.

The tp:4/ep:4 entries are reversed: 1k1k gets conc-end:256, 8k1k gets conc-end:512 — contradicting point 1.

The analogous minimaxm2.5-fp8-mi325x-vllm config (already in the file) confirms the expected pattern: tp:8/ep:8 has conc-end:512 for isl:1024 and conc-end:256 for isl:8192.

Physical reasoning independently confirms: 8k ISL requires ~8× more KV-cache memory per request than 1k ISL, so the system can sustain fewer concurrent requests under 8k load, i.e., conc-end for 8k1k must be ≤ conc-end for 1k1k at the same TP/EP.

claude · 2026-04-04T20:16:11Z

.github/configs/amd-master.yaml

+    - { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }
  - isl: 8192
    osl: 1024
    search-space:
-    - { tp: 2, conc-start: 4, conc-end: 64 }
-    - { tp: 4, conc-start: 4, conc-end: 64 }
-    - { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }
+    - { tp: 2, ep: 2, conc-start: 2, conc-end: 256 }
+    - { tp: 4, ep: 4, conc-start: 4, conc-end: 512 }
+    - { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }


🟡 The tp:8/ep:8 entries for both ISL scenarios (1k1k and 8k1k) have conc-start:2 and conc-end:2, collapsing the search space to a single concurrency point. This is a dramatic reduction from the previous conc-start:32, conc-end:256 range and is undocumented—if intentional (e.g., tp8/ep8 is known non-optimal and just needs a sanity check at conc=2), a brief comment in the YAML would prevent future confusion; if unintentional, conc-end should be raised to provide a meaningful search range.

Extended reasoning...

What the bug is: In the updated minimaxm2.5-fp8-mi355x-vllm config, both ISL scenarios now have { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }. When conc-start == conc-end, only a single concurrency value (conc=2) is benchmarked. The previous configuration used conc-start: 32, conc-end: 256, providing a meaningful 8-point search range.

The specific change: The diff shows the tp:8/ep:8 lines for both 1k1k (line 350) and 8k1k (line 356) changed from conc-start: 32, conc-end: 256 to conc-start: 2, conc-end: 2. The PR description only mentions "Add tp2 ep2 search-space entries" and makes no reference to the tp8/ep8 reduction.

Why existing code doesn't prevent it: The YAML schema appears to allow conc-start == conc-end as a degenerate but technically valid configuration. No validation prevents this single-point collapse.

Addressing the intentionality argument: One verifier argued this is intentional search-space optimization—narrowing tp8/ep8 to a reference point while widening tp2/ep2 (2-512) and tp4/ep4 (4-256). This is plausible: the PR title is "Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space," and the consistency across both ISL scenarios suggests deliberate authorship. However, the sibling minimaxm2.5-fp8-mi355x-atom config still uses conc-start: 32, conc-end: 256 for tp8/ep8, creating an inconsistency between frameworks with no explanation. No other config in this file uses conc-start == conc-end as a search space (other instances like tp: 1, conc-start: 64, conc-end: 64 in gptoss-fp4-mi300x-vllm are similarly unexplained, but that is a separate concern).

Step-by-step proof: Before the PR, tp8/ep8 in minimaxm2.5-fp8-mi355x-vllm would test concurrencies 32, 64, 128, 256 (assuming powers-of-2 steps). After the PR, only conc=2 is tested. Since tp2/ep2 now covers 2-512 and tp4/ep4 covers 4-256, there is substantial overlap in the low-concurrency range—but tp8/ep8's potential performance advantage at higher concurrencies (32-256) is now completely untested, making it impossible to determine if tp8/ep8 would outperform lower TP configurations at production-scale loads.

How to fix: If intentional, add an inline comment such as # pinned to single sanity-check point; tp2/tp4 are the primary search targets. If unintentional, restore conc-end to a value like 256 to preserve meaningful search coverage.

functionstackx

lgtm

cquil11 · 2026-04-05T17:02:54Z

@chunfangamd fixed by #1003

chunfangamd added 2 commits April 4, 2026 11:42

Add TP2EP2 for minimaxm2.5-fp8-mi355x-vllm

65fe884

Fewer GPUs means less inter-GPU communication overhead, and MoE expert parallelism across 2 GPUs is very efficient for this model.

Optimize config for minimaxm2.5-fp8-mi355x-vllm

faf1537

chunfangamd requested a review from a team April 4, 2026 20:08

chunfangamd requested a review from billishyahao as a code owner April 4, 2026 20:08

github-project-automation bot added this to InferenceMAX Board Apr 4, 2026

Update perf-changelog for minimaxm2.5-fp8-mi355x-vllm

772019f

claude bot reviewed Apr 4, 2026

View reviewed changes

functionstackx approved these changes Apr 4, 2026

View reviewed changes

claude bot mentioned this pull request Apr 5, 2026

[AMD] Optimize search space and upgrade Image to 0.19.0 for MiniMax-M2.5 #1003

Merged

Merge branch 'main' into chun/minimaxm2.5-fp8-single-node-opt-2

2eaf88a

cquil11 closed this Apr 5, 2026

github-project-automation bot moved this to Done in InferenceMAX Board Apr 5, 2026

cquil11 added the AMD label Apr 8, 2026

cquil11 changed the title ~~Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space~~ [AMD] Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space#1002

[AMD] Optimize MiniMax-M2.5 FP8 MI355X vLLM search-space#1002
chunfangamd wants to merge 4 commits intomainfrom
chun/minimaxm2.5-fp8-single-node-opt-2

chunfangamd commented Apr 4, 2026

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

claude bot Apr 4, 2026

Uh oh!

claude bot Apr 4, 2026

Uh oh!

functionstackx left a comment

Uh oh!

cquil11 commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chunfangamd commented Apr 4, 2026

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

claude bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

cquil11 commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants