Skip to content

Remove 1.5x MSA overhead and re-tune MiMo V2 Pro block configs#1076

Open
Prayer3th wants to merge 4 commits into
mainfrom
tune/mimo-v2-pro-remove-msa-1.5x
Open

Remove 1.5x MSA overhead and re-tune MiMo V2 Pro block configs#1076
Prayer3th wants to merge 4 commits into
mainfrom
tune/mimo-v2-pro-remove-msa-1.5x

Conversation

@Prayer3th
Copy link
Copy Markdown
Collaborator

Summary

  • Remove the overly conservative total_bytes = int(total_bytes * 1.5) MSA overhead multiplier from the VMEM estimator in bench_fused_moe.py
  • Re-tune FusedMoE block configs for MiMo V2 Pro (384 experts, H=6144, intermediate=2048, ep=32) without the 1.5x factor
  • Key config changes at 1024+ tokens: bt doubled (16→32, 64→128), unlocking larger token blocks that were previously filtered by the conservative VMEM estimate

Benchmark Results (ISL=16K/OSL=1K, V7X EP32)

BSZ Output tok/s vs 05.06 baseline TPOT (ms) vs 05.06
64 869.62 +8.9% 49.32 -4.0%
128 928.71 +13.0% 70.17 -5.1%
256 928.44 +9.6% 70.21 -6.2%
512 974.34 +11.4% 71.83 -7.3%

Pure prefill (ISL=16K/OSL=1): +17.8% ~ +18.5% input throughput across all batch sizes.

Test plan

  • Re-tuned on V7X EP32 (4 pods × 8 chips) with --tpu-vmem-headroom-ratio 0.90
  • No OOM observed during tuning or serving
  • End-to-end serving benchmark (8 groups) shows consistent improvement over 05.06 baseline

🤖 Generated with Claude Code

Prayer3th and others added 2 commits May 13, 2026 17:51
The 1.5x factor was overly conservative and excluded valid block configs
that could improve performance. The estimator now reports raw scratch
sizes; callers can still tighten with --tpu-vmem-headroom-ratio and
--tpu-vmem-estimate-scale.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-tuned after removing the 1.5x MSA overhead multiplier from the VMEM
estimator. Key changes for higher token counts:
- 1024t: bt 16→32
- 2048t: bf 1024→2048, bd 2048→1024
- 4096/8192/16384t: bt 64→128

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes the FusedMoE kernel performance by refining the VMEM estimation logic and updating block configurations. By removing an overly conservative memory overhead factor, the system can now utilize larger token blocks, leading to measurable improvements in both throughput and latency for MiMo V2 Pro models.

Highlights

  • VMEM Estimator Optimization: Removed the conservative 1.5x MSA overhead multiplier from the VMEM estimator in the benchmark suite to allow for more efficient memory utilization.
  • FusedMoE Configuration Tuning: Re-tuned block configurations for MiMo V2 Pro (384 experts) to leverage the freed-up memory, resulting in significant throughput improvements.
  • Performance Gains: Achieved up to 18.5% improvement in input throughput for pure prefill tasks and consistent gains across various batch sizes.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Prayer3th and others added 2 commits May 13, 2026 23:47
Re-tuned with 1.5x MSA overhead removed from VMEM estimator.

EP8: all 9 token counts (64-16384) tuned successfully.
EP32: 64-4096 tuned, 8192/16384 use 4096 config as fallback
(TPU firmware crash on large token counts in 4-pod setup).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EP32 8192 tuned successfully with headroom=0.85 on fresh pods.
16384 still crashes during warmup, using 8192 config as fallback.
Key change: btc 128→64 for 8192/16384 entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant