Remove 1.5x MSA overhead and re-tune MiMo V2 Pro block configs#1076
Remove 1.5x MSA overhead and re-tune MiMo V2 Pro block configs#1076Prayer3th wants to merge 4 commits into
Conversation
The 1.5x factor was overly conservative and excluded valid block configs that could improve performance. The estimator now reports raw scratch sizes; callers can still tighten with --tpu-vmem-headroom-ratio and --tpu-vmem-estimate-scale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-tuned after removing the 1.5x MSA overhead multiplier from the VMEM estimator. Key changes for higher token counts: - 1024t: bt 16→32 - 2048t: bf 1024→2048, bd 2048→1024 - 4096/8192/16384t: bt 64→128 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes the FusedMoE kernel performance by refining the VMEM estimation logic and updating block configurations. By removing an overly conservative memory overhead factor, the system can now utilize larger token blocks, leading to measurable improvements in both throughput and latency for MiMo V2 Pro models. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Re-tuned with 1.5x MSA overhead removed from VMEM estimator. EP8: all 9 token counts (64-16384) tuned successfully. EP32: 64-4096 tuned, 8192/16384 use 4096 config as fallback (TPU firmware crash on large token counts in 4-pod setup). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EP32 8192 tuned successfully with headroom=0.85 on fresh pods. 16384 still crashes during warmup, using 8192 config as fallback. Key change: btc 128→64 for 8192/16384 entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
total_bytes = int(total_bytes * 1.5)MSA overhead multiplier from the VMEM estimator inbench_fused_moe.pybtdoubled (16→32, 64→128), unlocking larger token blocks that were previously filtered by the conservative VMEM estimateBenchmark Results (ISL=16K/OSL=1K, V7X EP32)
Pure prefill (ISL=16K/OSL=1): +17.8% ~ +18.5% input throughput across all batch sizes.
Test plan
--tpu-vmem-headroom-ratio 0.90🤖 Generated with Claude Code