Add batch invariant grouped gemm kernel for inference#3783
Add batch invariant grouped gemm kernel for inference#3783santhnm2 wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
|
/claude review |
There was a problem hiding this comment.
Light review — code looks good overall. The kernel logic, stride handling for both trans_b modes, and the persistent-style tile scheduling are correct. The fallback stub for is_batch_invariant_mode_enabled() when the import fails is a clean pattern. Test coverage is thorough with correctness, batch invariance, determinism, and EP/TP integration tests.
One minor observation: HAVE_BATCH_INVARIANT is defined but never referenced — the dispatch relies entirely on the is_batch_invariant_mode_enabled() fallback stub. Unlike HAVE_FLASHINFER (which is used for assertions and guards elsewhere), HAVE_BATCH_INVARIANT is dead code. Consider either removing it or adding a guard (e.g., an assertion in _triton_batch_invariant_forward) for consistency with the HAVE_FLASHINFER pattern.
LGTM otherwise.
|
|
||
| BLOCK_M, BLOCK_N, BLOCK_K = 128, 128, 64 | ||
|
|
||
| bs_cpu = batch_sizes.cpu() |
There was a problem hiding this comment.
This will not work with cuda-graphs. Should we disable them in transformer config?
There was a problem hiding this comment.
Yea will add some assertion that it doesn't work with cuda graphs.
What does this PR do ?
Adds a batch invariant grouped gemm kernel for bf16 inference implemented in Triton.
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.