Skip to content

perf: enable MoE GroupedGEMM for MoE models#2278

Open
seonjinn wants to merge 4 commits intomainfrom
sj/grouped-gemm
Open

perf: enable MoE GroupedGEMM for MoE models#2278
seonjinn wants to merge 4 commits intomainfrom
sj/grouped-gemm

Conversation

@seonjinn
Copy link
Copy Markdown
Contributor

Support moe_grouped_gemm (grouped GEMM for MoE experts) through the MegatronConfig TypedDict and _apply_moe_config(). Enables it in every root MoE performance recipes (Qwen3-30B-A3B 4n4g/4n8g/4n8g-40K, Qwen3-235B 16n8g, DeepSeek-V3 32n8g, DAPO DeepSeek-V3 64n8g); child recipes inherit.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Wires moe_grouped_gemm (CUTLASS grouped GEMM for MoE experts) through
the MegatronConfig TypedDict and _apply_moe_config(). Enables it in
every root MoE performance recipe (Qwen3-30B-A3B 4n4g/4n8g/4n8g-40K,
Qwen3-235B 16n8g, DeepSeek-V3 32n8g, DAPO DeepSeek-V3 64n8g); child
recipes inherit.

Signed-off-by: sna <sna@nvidia.com>
@seonjinn seonjinn requested review from a team as code owners April 17, 2026 04:19
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@seonjinn seonjinn requested review from a team as code owners April 17, 2026 05:10
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test 1713776

@seonjinn seonjinn self-assigned this Apr 17, 2026
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR adds moe_grouped_gemm support through the MegatronConfig TypedDict and _apply_moe_config(), enabling it in 6 root MoE performance recipes. The concept is sound and the feature is valuable, but the implementation has a critical issue with the config application pattern that could silently regress throughput for ~14 other MoE recipes.

Ship blocker

Silent performance regression: The .get("moe_grouped_gemm", False) pattern unconditionally overwrites Bridge's default of True with False for any config that omits the field. See inline comment for details and suggested fix.

Suggestions

  • Performance evidence: Could you share before/after throughput data (tokens/sec or step time) on at least one representative recipe? A short convergence sanity check would also be helpful.
  • PR description: The "What does this PR do?", "Issues", and "Usage" sections still have template placeholders — consider filling them in.

Generated by Claude Code

Comment thread nemo_rl/models/megatron/setup.py Outdated
Comment thread nemo_rl/models/policy/__init__.py
seonjinn and others added 3 commits April 20, 2026 00:15
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Seonjin  <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Seonjin  <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test 2520299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants