[WIP] feat: fused moe v2 kernel#1086
Conversation
Key optimization over v1: scatter tokens directly from HBM to VMEM (skipping the a2a_s_x2_hbm intermediate buffer), saving one HBM round-trip and ~72MB HBM for MiMo V2 Pro EP32. Tokens double-buffered in VMEM enable scatter-FFN overlap in the pipelined expert loop. Scope: bf16 tokens/weights, no quantization, no shared expert, no bias. Single-device correctness verified (ep_size=1, bench-4 pod): - small-test (d=768, E=4, k=2): rel_err=0.65% - MiMo-V2-Pro (d=6144, E=16, k=8): rel_err=0.51% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts fori_loop, bf16 accumulation, x hoisting back to the ep8-tested version: Python for-loop with prologue/steady-state/epilogue, f32 b_y_acc_vmem, b_y_out_vmem staging, compute_tile outside expert_ffn_v2, and vmem_limit_bytes=64MB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a high-performance Fused MoE V2 kernel using JAX and Pallas. The implementation focuses on reducing memory overhead and improving throughput by scattering tokens directly from HBM to VMEM and utilizing double-buffering to overlap computation and data movement. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
- Simplify wait_a2a_scatter_recv/send: use single semaphore wait
instead of fori_loop (fixes deadlock on ep>=8)
- Remove redundant b_y_out_vmem staging buffer, DMA directly from
b_y_acc_vmem
- Add test_multi.py with CLI config selection:
python test_multi.py small # ep=8 quick test
python test_multi.py mimo-v2-pro # ep=32 full config
Tested: ep=8 small PASS (rel_err=0.005494)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist