Remove 1.5x MSA overhead and re-tune MiMo V2 Pro block configs by Prayer3th · Pull Request #1076 · sgl-project/sglang-jax

Prayer3th · 2026-05-13T15:17:07Z

Summary

Remove the overly conservative total_bytes = int(total_bytes * 1.5) MSA overhead multiplier from the VMEM estimator in bench_fused_moe.py
Re-tune FusedMoE block configs for MiMo V2 Pro (384 experts, H=6144, intermediate=2048, ep=32) without the 1.5x factor
Key config changes at 1024+ tokens: bt doubled (16→32, 64→128), unlocking larger token blocks that were previously filtered by the conservative VMEM estimate

Benchmark Results (ISL=16K/OSL=1K, V7X EP32)

BSZ	Output tok/s	vs 05.06 baseline	TPOT (ms)	vs 05.06
64	869.62	+8.9%	49.32	-4.0%
128	928.71	+13.0%	70.17	-5.1%
256	928.44	+9.6%	70.21	-6.2%
512	974.34	+11.4%	71.83	-7.3%

Pure prefill (ISL=16K/OSL=1): +17.8% ~ +18.5% input throughput across all batch sizes.

Test plan

Re-tuned on V7X EP32 (4 pods × 8 chips) with --tpu-vmem-headroom-ratio 0.90
No OOM observed during tuning or serving
End-to-end serving benchmark (8 groups) shows consistent improvement over 05.06 baseline

🤖 Generated with Claude Code

The 1.5x factor was overly conservative and excluded valid block configs that could improve performance. The estimator now reports raw scratch sizes; callers can still tighten with --tpu-vmem-headroom-ratio and --tpu-vmem-estimate-scale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Re-tuned after removing the 1.5x MSA overhead multiplier from the VMEM estimator. Key changes for higher token counts: - 1024t: bt 16→32 - 2048t: bf 1024→2048, bd 2048→1024 - 4096/8192/16384t: bt 64→128 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-05-13T15:17:18Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes the FusedMoE kernel performance by refining the VMEM estimation logic and updating block configurations. By removing an overly conservative memory overhead factor, the system can now utilize larger token blocks, leading to measurable improvements in both throughput and latency for MiMo V2 Pro models.

Highlights

VMEM Estimator Optimization: Removed the conservative 1.5x MSA overhead multiplier from the VMEM estimator in the benchmark suite to allow for more efficient memory utilization.
FusedMoE Configuration Tuning: Re-tuned block configurations for MiMo V2 Pro (384 experts) to leverage the freed-up memory, resulting in significant throughput improvements.
Performance Gains: Achieved up to 18.5% improvement in input throughput for pure prefill tasks and consistent gains across various batch sizes.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Re-tuned with 1.5x MSA overhead removed from VMEM estimator. EP8: all 9 token counts (64-16384) tuned successfully. EP32: 64-4096 tuned, 8192/16384 use 4096 config as fallback (TPU firmware crash on large token counts in 4-pod setup). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EP32 8192 tuned successfully with headroom=0.85 on fresh pods. 16384 still crashes during warmup, using 8192 config as fallback. Key change: btc 128→64 for 8192/16384 entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Prayer3th and others added 2 commits May 13, 2026 17:51

Prayer3th and others added 2 commits May 13, 2026 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove 1.5x MSA overhead and re-tune MiMo V2 Pro block configs#1076

Remove 1.5x MSA overhead and re-tune MiMo V2 Pro block configs#1076
Prayer3th wants to merge 4 commits into
mainfrom
tune/mimo-v2-pro-remove-msa-1.5x

Prayer3th commented May 13, 2026

Uh oh!

gemini-code-assist Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Prayer3th commented May 13, 2026

Summary

Benchmark Results (ISL=16K/OSL=1K, V7X EP32)

Test plan

Uh oh!

gemini-code-assist Bot commented May 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant