Skip to content

fix: add support check for gemm config for cutlass moe#2495

Merged
yzh119 merged 5 commits intoflashinfer-ai:mainfrom
nv-yunzheq:fix_moe_dispatch
Feb 5, 2026
Merged

fix: add support check for gemm config for cutlass moe#2495
yzh119 merged 5 commits intoflashinfer-ai:mainfrom
nv-yunzheq:fix_moe_dispatch

Conversation

@nv-yunzheq
Copy link
Copy Markdown
Collaborator

@nv-yunzheq nv-yunzheq commented Feb 4, 2026

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes
    • Added runtime checks to block unsupported kernel configurations and provide clearer, actionable error messages for incompatible settings.
    • Enforced output-size alignment and disallowed certain fusion options when a memory-limited execution mode is selected, preventing silent failures and aiding profiling/tuning.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @nv-yunzheq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the CUTLASS Mixture of Experts (MoE) GEMM kernels by integrating critical validation checks. The primary goal is to prevent misconfigurations, particularly when utilizing the NO_SMEM epilogue schedule, by enforcing output alignment requirements and ensuring compatibility with various epilogue fusion types. This proactive validation helps the autotuner identify and avoid invalid configurations during the profiling phase, ultimately leading to more stable and optimized GEMM operations.

Highlights

  • GEMM Configuration Validation: Introduced new validation checks for GEMM configurations within the moeGemmBiasAct and moeGemm functions to ensure compatibility with the NO_SMEM epilogue schedule.
  • Output Alignment Enforcement: Added a check to verify that the output dimension N is 256-bit aligned when using the NO_SMEM epilogue schedule, which is crucial for performance and correctness.
  • Epilogue Fusion Compatibility: Implemented a check to prevent the use of FINALIZE epilogue fusion type with the NO_SMEM epilogue schedule, as this combination is not supported.
  • Autotuner Support: These checks are designed to help the autotuner catch invalid tactics during profiling, improving the robustness and efficiency of the MoE GEMM operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
    • Added a TLLM_CHECK_WITH_INFO statement in moeGemmBiasAct to ensure output N is 256-bit aligned when NO_SMEM epilogue schedule is used without gated activation.
    • Added TLLM_CHECK_WITH_INFO statements in moeGemm to verify that NO_SMEM epilogue schedule is not used with FINALIZE fusion type and that output N is 256-bit aligned.
Activity
  • No activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Added runtime validation in MOE GEMM dispatch to enforce NO_SMEM epilogue constraints: require output N alignment based on OutputType bit-width and disallow FINALIZE epilogue fusion when NO_SMEM is selected; checks run before dispatch and do not change public APIs.

Changes

Cohort / File(s) Summary
MOE GEMM Validation
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
Added runtime checks in MoeGemmRunner::dispatchToArch (and callers) to: (1) if epilogue_schedule == NO_SMEM, require N % (256 / bits(OutputType)) == 0; (2) if epilogue_schedule == NO_SMEM, disallow epilogue_fusion_type == FINALIZE. Errors are emitted prior to dispatch; no public API/signature changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • djmmoss
  • yzh119
  • wenscarl

Poem

🐇 I hopped through code with careful paws,
Counting bits and checking clause by clause.
NO_SMEM needs N aligned just right,
FINALIZE forbidden when shared memory's not in sight.
A tiny hop, a safer run — kernels gleam in morning light.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description only contains the template placeholders without actual implementation details, rationale, or related issue references. Fill in the Description section explaining what the PR does and why; add any related issue links; confirm checklist items were completed.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding validation checks for GEMM configuration in the cutlass MOE kernel dispatch logic.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds necessary support checks for GEMM configurations in CUTLASS MoE kernels, specifically for NO_SMEM epilogue schedules. The checks ensure output alignment and prevent unsupported fusions, which is important for correctness and for the autotuner. While the changes are good, there's an opportunity to improve code clarity and reduce duplication. The alignment calculation is repeated, and I've suggested extracting it into a constant. Furthermore, since the alignment check logic is now duplicated in two functions, consider refactoring it into a shared helper function for better maintainability.

Comment on lines +986 to +992
if (inputs.gemm_config.epilogue_schedule == cutlass_extensions::EpilogueScheduleType::NO_SMEM &&
!isGatedActivation(inputs.activation_type)) {
TLLM_CHECK_WITH_INFO(
inputs.n % (256 / cutlass::sizeof_bits<OutputType>::value) == 0,
"Output N %ld does not meet minimum alignment requirements for NO_SMEM epilogue %d",
(long)inputs.n, (int)(256 / cutlass::sizeof_bits<OutputType>::value));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation for the minimum alignment is performed twice inside this check. To improve readability and avoid this repetition, you can store the result in a const auto variable. This makes the code cleaner and easier to understand.

  if (inputs.gemm_config.epilogue_schedule == cutlass_extensions::EpilogueScheduleType::NO_SMEM &&
      !isGatedActivation(inputs.activation_type)) {
    const auto min_alignment = 256 / cutlass::sizeof_bits<OutputType>::value;
    TLLM_CHECK_WITH_INFO(
        inputs.n % min_alignment == 0,
        "Output N %ld does not meet minimum alignment requirements for NO_SMEM epilogue %d",
        (long)inputs.n, (int)min_alignment);
  }

Comment on lines +1028 to +1036
if (inputs.gemm_config.epilogue_schedule == cutlass_extensions::EpilogueScheduleType::NO_SMEM) {
TLLM_CHECK_WITH_INFO(inputs.gemm_config.epilogue_fusion_type !=
cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE,
"NO_SMEM epilogue schedule is not supported with FINALIZE fusion");
TLLM_CHECK_WITH_INFO(
inputs.n % (256 / cutlass::sizeof_bits<OutputType>::value) == 0,
"Output N %ld does not meet minimum alignment requirements for NO_SMEM epilogue %d",
(long)inputs.n, (int)(256 / cutlass::sizeof_bits<OutputType>::value));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the check in moeGemmBiasAct, the alignment calculation is repeated here. Extracting it into a const auto variable will improve readability. Since this alignment check logic is now present in two places, you might also consider creating a private helper function to encapsulate this check and avoid code duplication.

  if (inputs.gemm_config.epilogue_schedule == cutlass_extensions::EpilogueScheduleType::NO_SMEM) {
    TLLM_CHECK_WITH_INFO(inputs.gemm_config.epilogue_fusion_type !=
                             cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE,
                         "NO_SMEM epilogue schedule is not supported with FINALIZE fusion");
    const auto min_alignment = 256 / cutlass::sizeof_bits<OutputType>::value;
    TLLM_CHECK_WITH_INFO(
        inputs.n % min_alignment == 0,
        "Output N %ld does not meet minimum alignment requirements for NO_SMEM epilogue %d",
        (long)inputs.n, (int)min_alignment);
  }

@nv-yunzheq nv-yunzheq changed the title add support check for gemm config for cutlass moe fix: add support check for gemm config for cutlass moe Feb 4, 2026
// For NoSmem epilogue schedule, output N must be 256-bit aligned.
// For gated activation, this is automatic if the usual alignment requirement is met.
// This check is here so the autotuner can catch invalid tactics during profiling.
if (inputs.gemm_config.epilogue_schedule == cutlass_extensions::EpilogueScheduleType::NO_SMEM &&
Copy link
Copy Markdown
Contributor

@djns99 djns99 Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put this in runGemm/dispatchToArch we don't need to have two copies of this check.

Maybe here since this is only relevant for SM90+

Copy link
Copy Markdown
Collaborator Author

@nv-yunzheq nv-yunzheq Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatchToArch doesn't work as we could not know if the activation is gated or not in the function.
runGemm works, but to align with the logic in the moe runner code, I think it's better to split the gemm1 and gemm2 logic separately to make it more clear and align with the original logic

Copy link
Copy Markdown
Contributor

@djns99 djns99 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont need to check isGatedActivation here

  size_t const fc1_out_size =
      ((!use_ampere_activation_fusion) && is_gated_activation) ? inter_size * 2 : inter_size;

This line sets the value of N correctly. The original check is only working with inter_size so needs to explicitly check we are in the non-gated case

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated to dispatchToArch

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Feb 4, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !298 has been created, and the CI pipeline #43308250 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[CANCELING] Pipeline #43308250: canceled

@nv-yunzheq
Copy link
Copy Markdown
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !298 has been updated with latest changes, and the CI pipeline #43311572 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[CANCELING] Pipeline #43311572: canceled

@nv-yunzheq
Copy link
Copy Markdown
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !298 has been updated with latest changes, and the CI pipeline #43312738 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #43312738: 10/20 passed

Copy link
Copy Markdown
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests clean

approved again

@yzh119 yzh119 merged commit 8655234 into flashinfer-ai:main Feb 5, 2026
31 of 36 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Feb 24, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants