Skip to content

Add Relu2 activation support in CUTLASS MoE backend and fix autotuner async CUDA error handling#2897

Closed
askliar wants to merge 11 commits intoflashinfer-ai:mainfrom
askliar:feature/add_relu2_for_default_backend
Closed

Add Relu2 activation support in CUTLASS MoE backend and fix autotuner async CUDA error handling#2897
askliar wants to merge 11 commits intoflashinfer-ai:mainfrom
askliar:feature/add_relu2_for_default_backend

Conversation

@askliar
Copy link
Copy Markdown
Contributor

@askliar askliar commented Mar 26, 2026

Summary

  • Add Relu2 (squared ReLU) activation support for the CUTLASS MoE GEMM path, enabling models that use Relu2 as their MoE gate activation (e.g. Nemotron-H MTP draft model) to run through the FlashInfer CUTLASS backend instead of throwing InvalidType at runtime.
  • SM121 (GB10) tile filtering for CUTLASS MoE GEMM — GB10 has ~99 KB SMEM per block (vs ~228 KB on GB200/SM120). Three of the four FP4 grouped GEMM tiles and several Ampere-style bf16 tiles exceed this budget and fail at runtime. This PR filters them out at the runtime tactic selection level so the autotuner never tries known-bad configurations.
  • Fix autotuner async CUDA error propagation — when a tactic probe launches a kernel that fails asynchronously (e.g. cudaErrorIllegalInstruction), the sticky error was not cleared and would surface later during CUDA graph capture or inference. Added torch.cuda.synchronize() drain and demoted the "Skipping tactic" log from WARNING to DEBUG since these failures are expected and recoverable.
  • Separate JIT cache key for SM121 — SM12.1 now normalizes to arch key 121f instead of sharing 120f with SM12.0, ensuring the JIT cache is isolated per architecture.

Depends on

This PR should be merged after #2913 (enable GDC for CUTLASS fused MoE PDL on SM12x), which adds the -DCUTLASS_ENABLE_GDC_FOR_SM100=1 compile flags that this PR's SM120 module generator relies on.

Changes

Relu2 activation (epilogue_helpers.h, moe_gemm_template_dispatch.h)

Two pieces were missing:

  1. epilogue_helpers.h had no EpilogueOpDefaultRelu2 tag struct or Epilogue partial specialization, so there was no CUTLASS epilogue type for Relu2.
  2. moeGemmBiasAct() had no case ActivationType::Relu2, causing it to fall through to InvalidType and throw.
    The Relu2 functor itself (relu(x)²) already existed in fused_activations.h — this PR just wires it into the epilogue dispatch.

SM121 tile filtering (cutlass_heuristic.cpp, compilation_context.py)

GPU Arch SMEM/block
GB200 SM120 ~228 KB
GB10 SM121 ~99 KB
FP4 grouped GEMM (TMA Warp-Specialized path):
  • Added get_candidate_configs_sm121() that returns only CtaShape128x128x64B (~73 KB, fits in 99 KB). The other three SM120 tiles (128x128x128B, 256x128x64B, 128x256x64B) all exceed 99 KB.
  • Changed dispatch from sm >= 120 to sm == 120 + separate sm == 121 branch.
    Ampere-style bf16/FP8 grouped GEMM (SM80 path):
  • Added filter_sm121 lambda in get_candidate_tiles() that removes tiles where both M >= 128 and N >= 128, since these also exceed SM121's SMEM budget.
    JIT cache separation:
  • compilation_context.py: SM12.1 now returns (12, "1f") instead of (12, "0f"), giving GB10 cache path ~/.cache/flashinfer/<version>/121f/ separate from GB200's 120f/.

Autotuner robustness (autotuner.py)

  • Demoted "Skipping tactic" log from WARNING to DEBUG — tile-incompatibility failures are expected on SM121 and should not spam logs.
  • Added torch.cuda.synchronize() drain after failed tactic probes to clear sticky async CUDA errors (e.g. cudaErrorIllegalInstruction from failed TMA WS GEMM probes) before they surface during CUDA graph capture.

Test plan

  • Existing MoE tests pass (pytest tests/moe/)
  • Relu2-specific tests: pytest tests/moe/test_trtllm_gen_fused_moe.py -k Relu2
  • Manual: run Nemotron-H MTP on GB10 (SM121) with FlashInfer CUTLASS backend:
    • No WARNING [Autotuner]: Skipping tactic log spam
    • No sticky CUDA errors during graph capture
    • Only CtaShape128x128x64B FP4 grouped GEMM tile selected
    • JIT cache at ~/.cache/flashinfer/0.6.7/121f/

…patch

- Introduced `EpilogueOpDefaultRelu2` struct in `epilogue_helpers.h` for Relu2 activation.
- Updated `moe_gemm_template_dispatch.h` to handle `ActivationType::Relu2`, enabling the use of the new Relu2 activation in GEMM operations.
- Enhanced the autotuner to clear pending CUDA errors during profiling, improving robustness in error handling.

Signed-off-by: Andrii Skliar <askliar@nvidia.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the FlashInfer CUTLASS backend by introducing compatibility for models that employ Relu2 as their Mixture-of-Experts (MoE) gate activation, thereby expanding the range of supported neural network architectures. Concurrently, it improves the robustness of the autotuner by ensuring that transient CUDA errors are properly handled and cleared, preventing them from destabilizing later operations.

Highlights

  • Relu2 Activation Support: Added support for Relu2 (squared ReLU) activation within the CUTLASS MoE GEMM path, enabling models like Nemotron-H MTP to utilize this activation type without runtime errors.
  • Autotuner CUDA Error Handling: Fixed an issue in the autotuner where asynchronous CUDA errors from failed tactic probes were not cleared, preventing them from causing subsequent crashes during CUDA graph capture or inference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Relu2 epilogue support to CUTLASS and dispatch in MOE GEMM, tweaks autotuner tactic-failure handling to attempt CUDA synchronize, enables CUTLASS GDC compile flags for additional SM targets, and introduces SM121-specific kernel generation and heuristic selection with SM12.x normalization updates.

Changes

Cohort / File(s) Summary
CUTLASS epilogue extension
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue_helpers.h
Added EpilogueOpDefaultRelu2 tag and Epilogue<..., EpilogueOpDefaultRelu2> specialization mapping to cutlass::epilogue::thread::LinearCombinationGeneric configured with Relu2.
MOE GEMM dispatch
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
Added ActivationType::Relu2 case in MoeGemmRunner::moeGemmBiasAct to dispatch using the new epilogue tag.
Autotuner error handling
flashinfer/autotuner.py
Changed tactic-profiling failure logging from warningdebug and attempt torch.cuda.synchronize() (suppressed exceptions) before marking a tactic as failed.
Fused-MoE & FP8 NVCC flags
flashinfer/jit/fused_moe.py, flashinfer/jit/gemm/fp8_blockscale.py, flashinfer/fused_moe/__init__.py, flashinfer/fused_moe/core.py
Added -DCUTLASS_ENABLE_GDC_FOR_SM100=1 to NVCC/CUDA cflags where relevant; exported and dispatch support for new gen_cutlass_fused_moe_sm121_module.
SM12x/SM121 kernel generation
flashinfer/jit/gemm/cutlass/generate_kernels.py
Added is_sm121 flag to grouped-gemm generator, new generate_sm121_operations, adjusted dispatch to generate SM121-only ops when appropriate, and added inl_map entry for (GemmKind.Grouped, 121).
Heuristic candidate generation
csrc/.../cutlass_heuristic.cpp
Added get_candidate_configs_sm121(...), changed SM120 check to exact-match and introduced SM121 branch; restricted SM121 candidates (FAST_BUILD vs non-FAST_BUILD) and FP4-only constraints.
Grid-dependency / GDC enablement
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h
Extended compile-time detection to enable CUTLASS_GDC_ENABLED for SM100-family variants when macros indicate GDC support.
CUDA arch normalization
flashinfer/compilation_context.py
Distinguished SM12.1 vs SM12.0 when CUDA toolchain >= 12.9, returning compute_121f for minor==1 and preserving prior behavior for minor==0.
Misc. grouped GEMM adjustments
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/... (related)
Small dispatch and compile-flag wiring changes to integrate new epilogue and SM121 support.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • yzh119
  • aleozlx
  • cyx-6
  • yongwww
  • jimmyzho
  • nv-yunzheq
  • bkryu
  • kahyunnam

Poem

🐰 I hopped through kernels, stitched Relu2 bright,
Flags fluttered for SM12, a quiet compile night.
I poked a cuda-sync, then danced out of sight —
Binaries hum, and the rabbit naps tight. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the two main changes: adding Relu2 activation support to the CUTLASS MoE backend and fixing autotuner async CUDA error handling.
Description check ✅ Passed The PR description comprehensively covers all required sections from the template and provides clear, detailed information about the changes, rationale, and testing approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@askliar askliar changed the title feat: add Relu2 activation support in epilogue helpers and kernel dis… Add Relu2 activation support in CUTLASS MoE backend and fix autotuner async CUDA error handling Mar 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Relu2 activation function by extending the CUTLASS epilogue helpers and integrating it into the Mixture-of-Experts (MoE) GEMM kernel dispatch logic. Additionally, it enhances the flashinfer autotuner by adding a mechanism to clear pending asynchronous CUDA errors after failed profiling runs, which prevents these errors from affecting subsequent CUDA graph capture. I have no feedback to provide as there were no review comments.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

1019-1040: Add a fused-activation invariant guard before dispatch.

If inputs.use_fused_moe is ever true for Relu2 (or other non-gated activations), execution can hit a non-executing path in genericMoeGemmKernelLauncher::call. A fast-fail check here would prevent silent misconfiguration.

Proposed guard
 void MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType, IsMXFPX>::moeGemmBiasAct(
     GroupedGemmInput<T, WeightType, ScaleBiasType, OutputType> inputs,
     TmaWarpSpecializedGroupedGemmInput hopper_inputs) {
+  TLLM_CHECK_WITH_INFO(
+      !inputs.use_fused_moe ||
+          inputs.activation_type == ActivationType::Swiglu ||
+          inputs.activation_type == ActivationType::Geglu,
+      "use_fused_moe is only valid for gated activations (Swiglu/Geglu)");
+
   switch (inputs.activation_type) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h`
around lines 1019 - 1040, Add a fast-fail guard before the activation dispatch
to ensure non-gated activations cannot be used with fused MOE: check
inputs.use_fused_moe and if true for non-gated activations (e.g.,
ActivationType::Relu2, ActivationType::Relu, ActivationType::Gelu,
ActivationType::Silu, ActivationType::Identity, ActivationType::Swiglu,
ActivationType::Geglu) log/throw a clear error and return before calling
runGemm; place this check in the same scope that switches on
inputs.activation_type (the dispatch that calls runGemm) so misconfiguration
never reaches genericMoeGemmKernelLauncher::call. Ensure the error message
mentions use_fused_moe and the offending ActivationType to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h`:
- Around line 1019-1040: Add a fast-fail guard before the activation dispatch to
ensure non-gated activations cannot be used with fused MOE: check
inputs.use_fused_moe and if true for non-gated activations (e.g.,
ActivationType::Relu2, ActivationType::Relu, ActivationType::Gelu,
ActivationType::Silu, ActivationType::Identity, ActivationType::Swiglu,
ActivationType::Geglu) log/throw a clear error and return before calling
runGemm; place this check in the same scope that switches on
inputs.activation_type (the dispatch that calls runGemm) so misconfiguration
never reaches genericMoeGemmKernelLauncher::call. Ensure the error message
mentions use_fused_moe and the offending ActivationType to aid debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef3b076f-ae24-496a-9133-6b2c7468e8fc

📥 Commits

Reviewing files that changed from the base of the PR and between d426b18 and 586ca5f.

📒 Files selected for processing (3)
  • csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue_helpers.h
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • flashinfer/autotuner.py

- Added compilation flag `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` to the `gen_cutlass_fused_moe_sm120`, `gen_cutlass_fused_moe_sm103`, and `gen_cutlass_fused_moe_sm100` functions to support GDC for SM100 architecture.

Signed-off-by: Andrii Skliar <askliar@nvidia.com>
"-DENABLE_FP8",
"-DENABLE_FP4",
"-DUSING_OSS_CUTLASS_MOE_GEMM",
"-DCUTLASS_ENABLE_GDC_FOR_SM100=1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@askliar What does this flag do?
Is there a way to add it in some common flags area or something?

Copy link
Copy Markdown
Contributor

@johnnynunez johnnynunez Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed for activate PDL #2708

@amirkl94
Copy link
Copy Markdown
Contributor

General question, why is the change in #1954 not enough for your case? It probably goes for a different code-path but I'd like to make sure we're not missing a different bug that causes this.

Comment thread flashinfer/autotuner.py
f"[Autotuner]: Failed when profiling {r} {tac}, shapes={shapes}. Error occurred: {e}"
)

# Clear any pending async CUDA errors (e.g.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of nitpicking but consider splitting this into a different PR

johnnynunez and others added 4 commits March 30, 2026 00:46
- Introduced `get_candidate_configs_sm121` function to handle GEMM configurations for the SM121 architecture, which has a reduced shared memory budget.
- Updated `generate_sm120_grouped_gemm_operations` to accommodate the specific tile size constraints for SM121.
- Enhanced `CompilationContext` to differentiate between SM120 and SM121 in the JIT cache.
- Adjusted kernel generation logic to ensure compatibility with the new architecture.

Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp`:
- Around line 646-647: The FAST_BUILD branch in get_candidate_configs_sm121
unconditionally delegates non-GROUPED_GEMM cases to get_candidate_configs_sm120
which (in FAST_BUILD) returns CtaShape128x128x256B that can exceed SM121 SMEM
limits; modify the FAST_BUILD path in get_candidate_configs_sm121 to check the
FP4_ONLY flag before delegating — if config has FP4_ONLY then call
get_candidate_configs_sm120(config), otherwise select SM121-safe candidates
(avoid CtaShape128x128x256B) or fallback to an SM121-specific safe list; update
the logic around GROUPED_GEMM, FAST_BUILD, and FP4_ONLY in
get_candidate_configs_sm121 to mirror the non-FAST_BUILD guard behavior.

In `@flashinfer/jit/gemm/cutlass/generate_kernels.py`:
- Around line 1059-1062: The SM121-only branch is dead because
gen_cutlass_fused_moe_sm120_module() forces device_arch="120" so has_arch(121)
is never true; either remove the conditional and the generate_sm121_operations
call to eliminate dead code, or implement a parallel SM121 generator: add
gen_cutlass_fused_moe_sm121_module() mirroring
gen_cutlass_fused_moe_sm120_module() that passes device_arch="121" (and ensure
calling sites use it), then keep the if has_arch(121) and not has_arch(120)
branch to invoke generate_sm121_operations(True); update or remove references to
has_arch(121) accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 23eb8e22-b855-4109-9630-e4cca23e64b8

📥 Commits

Reviewing files that changed from the base of the PR and between 119743f and ab9b492.

📒 Files selected for processing (4)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • flashinfer/autotuner.py
  • flashinfer/compilation_context.py
  • flashinfer/jit/gemm/cutlass/generate_kernels.py

Comment on lines +646 to +647
return get_candidate_configs_sm120(config);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

FAST_BUILD non-GROUPED_GEMM path for SM121 may be incorrect.

In the FAST_BUILD branch (lines 645-647), when config does NOT have GROUPED_GEMM, it falls back to get_candidate_configs_sm120(config). However, get_candidate_configs_sm120 in FAST_BUILD mode (lines 589-599) returns CtaShape128x128x256B for non-grouped GEMM, which may exceed SM121's SMEM budget.

The non-FAST_BUILD path (lines 663-668) correctly delegates to SM120 for FP4_ONLY, but the FAST_BUILD path doesn't check FP4_ONLY before delegating.

Consider adding FP4_ONLY guard in FAST_BUILD path
 `#ifdef` FAST_BUILD
   if (config & CutlassGemmConfig::GROUPED_GEMM) {
     return {CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x64B,
                               MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO,
                               ClusterShape::ClusterShape_1x1x1}};
   } else {
+    if ((config & CutlassGemmConfig::FP4_ONLY) != 0) {
+      return get_candidate_configs_sm120(config);
+    } else {
+      TLLM_THROW("Not Implemented: SM121 non-group GEMM only supports nvfp4 in FAST_BUILD.");
+    }
-    return get_candidate_configs_sm120(config);
   }
 `#else`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return get_candidate_configs_sm120(config);
}
`#ifdef` FAST_BUILD
if (config & CutlassGemmConfig::GROUPED_GEMM) {
return {CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x64B,
MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO,
ClusterShape::ClusterShape_1x1x1}};
} else {
if ((config & CutlassGemmConfig::FP4_ONLY) != 0) {
return get_candidate_configs_sm120(config);
} else {
TLLM_THROW("Not Implemented: SM121 non-group GEMM only supports nvfp4 in FAST_BUILD.");
}
}
`#else`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp`
around lines 646 - 647, The FAST_BUILD branch in get_candidate_configs_sm121
unconditionally delegates non-GROUPED_GEMM cases to get_candidate_configs_sm120
which (in FAST_BUILD) returns CtaShape128x128x256B that can exceed SM121 SMEM
limits; modify the FAST_BUILD path in get_candidate_configs_sm121 to check the
FP4_ONLY flag before delegating — if config has FP4_ONLY then call
get_candidate_configs_sm120(config), otherwise select SM121-safe candidates
(avoid CtaShape128x128x256B) or fallback to an SM121-specific safe list; update
the logic around GROUPED_GEMM, FAST_BUILD, and FP4_ONLY in
get_candidate_configs_sm121 to mirror the non-FAST_BUILD guard behavior.

Comment thread flashinfer/jit/gemm/cutlass/generate_kernels.py Outdated
- Implemented `gen_cutlass_fused_moe_sm121_module` to generate modules for the SM121 architecture, ensuring compatibility with its shared memory constraints.
- Updated the `get_cutlass_fused_moe_module` function to handle the new SM121 backend.
- Refactored `get_candidate_configs_sm121` to streamline GEMM configuration retrieval.

This enhances the framework's capability to leverage the SM121 architecture effectively.
Andrii Skliar added 2 commits March 31, 2026 17:08
- Removed `gen_cutlass_fused_moe_sm121_module` and its references from the codebase, simplifying the architecture support.
- Updated `get_cutlass_fused_moe_module` to handle only SM120 and SM103 backends.
- Adjusted kernel generation logic to ensure compatibility with the remaining architectures.

This change streamlines the code and focuses on maintaining support for the more widely used architectures.
- Introduced a filtering mechanism in `get_candidate_tiles` to exclude tile configurations where both M and N are greater than or equal to 128 for the SM121 architecture, addressing shared memory constraints.
- Updated the return statements for various GEMM types to utilize the new filtering function, ensuring the autotuner does not consider invalid configurations.

This change enhances the efficiency of the autotuner by preventing it from evaluating known-bad configurations for SM121.
@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 31, 2026

/bot run

@aleozlx aleozlx added the run-ci label Mar 31, 2026
@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !480 has been created, and the CI pipeline #47388152 is currently running. I'll report back once the pipeline job completes.

@eugr
Copy link
Copy Markdown

eugr commented Apr 1, 2026

@askliar - according to CUDA documentation, sm120 also has 99KB of shared memory, which makes sense, because sm120/sm121 are pretty much identical. Not sure where 228KB for sm120 comes from?

Also, I haven't looked at more detail, but I assume it will build correctly for Spark if FLASHINFER_CUDA_ARCH_LIST=12.1a (not 12.1f), right?

@askliar
Copy link
Copy Markdown
Contributor Author

askliar commented Apr 1, 2026

@eugr just FYI, I have split this PR into two separate ones:
(Relu2, merged) #2927
(Tactic errors, in review) #2926

I will keep updating the latter with your comments!
Now to your questions:

  1. Your comment on SM120 is correct, my understanding was wrong! Thank you and I will update FlashInfer to reflect that.
  2. 12.1a/f are both okay, yes!

Thank you for your contributions!

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #47388152: 11/20 passed

@askliar askliar closed this Apr 1, 2026
@eugr
Copy link
Copy Markdown

eugr commented Apr 1, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants