perf: enable MoE GroupedGEMM for MoE models by seonjinn · Pull Request #2278 · NVIDIA-NeMo/RL

seonjinn · 2026-04-17T04:19:34Z

Support moe_grouped_gemm (grouped GEMM for MoE experts) through the MegatronConfig TypedDict and _apply_moe_config(). Enables it in every root MoE performance recipes (Qwen3-30B-A3B 4n4g/4n8g/4n8g-40K, Qwen3-235B 16n8g, DeepSeek-V3 32n8g, DAPO DeepSeek-V3 64n8g); child recipes inherit.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Wires moe_grouped_gemm (CUTLASS grouped GEMM for MoE experts) through the MegatronConfig TypedDict and _apply_moe_config(). Enables it in every root MoE performance recipe (Qwen3-30B-A3B 4n4g/4n8g/4n8g-40K, Qwen3-235B 16n8g, DeepSeek-V3 32n8g, DAPO DeepSeek-V3 64n8g); child recipes inherit. Signed-off-by: sna <sna@nvidia.com>

copy-pr-bot · 2026-04-17T04:19:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

seonjinn · 2026-04-17T06:14:32Z

/ok to test 1713776

terrykong

Summary

This PR adds moe_grouped_gemm support through the MegatronConfig TypedDict and _apply_moe_config(), enabling it in 6 root MoE performance recipes. The concept is sound and the feature is valuable, but the implementation has a critical issue with the config application pattern that could silently regress throughput for ~14 other MoE recipes.

Ship blocker

Silent performance regression: The .get("moe_grouped_gemm", False) pattern unconditionally overwrites Bridge's default of True with False for any config that omits the field. See inline comment for details and suggested fix.

Suggestions

Performance evidence: Could you share before/after throughput data (tokens/sec or step time) on at least one representative recipe? A short convergence sanity check would also be helpful.
PR description: The "What does this PR do?", "Issues", and "Usage" sections still have template placeholders — consider filling them in.

Generated by Claude Code

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

seonjinn · 2026-04-20T08:00:05Z

/ok to test 2520299

seonjinn requested review from a team as code owners April 17, 2026 04:19

seonjinn requested review from a team as code owners April 17, 2026 05:10

seonjinn force-pushed the sj/grouped-gemm branch from f7bdb19 to 1713776 Compare April 17, 2026 06:12

seonjinn self-assigned this Apr 17, 2026

terrykong reviewed Apr 19, 2026

View reviewed changes

Comment thread nemo_rl/models/megatron/setup.py Outdated

Comment thread nemo_rl/models/policy/__init__.py

seonjinn and others added 3 commits April 20, 2026 00:15

Update nemo_rl/models/policy/__init__.py

bab4f1b

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

Update nemo_rl/models/megatron/setup.py

fd3e6a7

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

Merge branch 'main' into sj/grouped-gemm

2520299

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: enable MoE GroupedGEMM for MoE models#2278

perf: enable MoE GroupedGEMM for MoE models#2278
seonjinn wants to merge 4 commits intomainfrom
sj/grouped-gemm

seonjinn commented Apr 17, 2026

Uh oh!

copy-pr-bot bot commented Apr 17, 2026

Uh oh!

seonjinn commented Apr 17, 2026

Uh oh!

terrykong left a comment

Uh oh!

Uh oh!

Uh oh!

seonjinn commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

seonjinn commented Apr 17, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Apr 17, 2026

Uh oh!

seonjinn commented Apr 17, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Summary

Ship blocker

Suggestions

Uh oh!

Uh oh!

Uh oh!

seonjinn commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants