Add Relu2 activation support in CUTLASS MoE backend and fix autotuner async CUDA error handling by askliar · Pull Request #2897 · flashinfer-ai/flashinfer

askliar · 2026-03-26T08:44:56Z

Summary

Add Relu2 (squared ReLU) activation support for the CUTLASS MoE GEMM path, enabling models that use Relu2 as their MoE gate activation (e.g. Nemotron-H MTP draft model) to run through the FlashInfer CUTLASS backend instead of throwing InvalidType at runtime.
SM121 (GB10) tile filtering for CUTLASS MoE GEMM — GB10 has ~99 KB SMEM per block (vs ~228 KB on GB200/SM120). Three of the four FP4 grouped GEMM tiles and several Ampere-style bf16 tiles exceed this budget and fail at runtime. This PR filters them out at the runtime tactic selection level so the autotuner never tries known-bad configurations.
Fix autotuner async CUDA error propagation — when a tactic probe launches a kernel that fails asynchronously (e.g. cudaErrorIllegalInstruction), the sticky error was not cleared and would surface later during CUDA graph capture or inference. Added torch.cuda.synchronize() drain and demoted the "Skipping tactic" log from WARNING to DEBUG since these failures are expected and recoverable.
Separate JIT cache key for SM121 — SM12.1 now normalizes to arch key 121f instead of sharing 120f with SM12.0, ensuring the JIT cache is isolated per architecture.

Depends on

This PR should be merged after #2913 (enable GDC for CUTLASS fused MoE PDL on SM12x), which adds the -DCUTLASS_ENABLE_GDC_FOR_SM100=1 compile flags that this PR's SM120 module generator relies on.

Changes

Relu2 activation (`epilogue_helpers.h`, `moe_gemm_template_dispatch.h`)

Two pieces were missing:

epilogue_helpers.h had no EpilogueOpDefaultRelu2 tag struct or Epilogue partial specialization, so there was no CUTLASS epilogue type for Relu2.
moeGemmBiasAct() had no case ActivationType::Relu2, causing it to fall through to InvalidType and throw.
The Relu2 functor itself (relu(x)²) already existed in fused_activations.h — this PR just wires it into the epilogue dispatch.

SM121 tile filtering (`cutlass_heuristic.cpp`, `compilation_context.py`)

GPU	Arch	SMEM/block
GB200	SM120	~228 KB
GB10	SM121	~99 KB
FP4 grouped GEMM (TMA Warp-Specialized path):

Added get_candidate_configs_sm121() that returns only CtaShape128x128x64B (~73 KB, fits in 99 KB). The other three SM120 tiles (128x128x128B, 256x128x64B, 128x256x64B) all exceed 99 KB.
Changed dispatch from sm >= 120 to sm == 120 + separate sm == 121 branch.
Ampere-style bf16/FP8 grouped GEMM (SM80 path):
Added filter_sm121 lambda in get_candidate_tiles() that removes tiles where both M >= 128 and N >= 128, since these also exceed SM121's SMEM budget.
JIT cache separation:
compilation_context.py: SM12.1 now returns (12, "1f") instead of (12, "0f"), giving GB10 cache path ~/.cache/flashinfer/<version>/121f/ separate from GB200's 120f/.

Autotuner robustness (`autotuner.py`)

Demoted "Skipping tactic" log from WARNING to DEBUG — tile-incompatibility failures are expected on SM121 and should not spam logs.
Added torch.cuda.synchronize() drain after failed tactic probes to clear sticky async CUDA errors (e.g. cudaErrorIllegalInstruction from failed TMA WS GEMM probes) before they surface during CUDA graph capture.

Test plan

Existing MoE tests pass (pytest tests/moe/)
Relu2-specific tests: pytest tests/moe/test_trtllm_gen_fused_moe.py -k Relu2
Manual: run Nemotron-H MTP on GB10 (SM121) with FlashInfer CUTLASS backend:
- No WARNING [Autotuner]: Skipping tactic log spam
- No sticky CUDA errors during graph capture
- Only CtaShape128x128x64B FP4 grouped GEMM tile selected
- JIT cache at ~/.cache/flashinfer/0.6.7/121f/

…patch - Introduced `EpilogueOpDefaultRelu2` struct in `epilogue_helpers.h` for Relu2 activation. - Updated `moe_gemm_template_dispatch.h` to handle `ActivationType::Relu2`, enabling the use of the new Relu2 activation in GEMM operations. - Enhanced the autotuner to clear pending CUDA errors during profiling, improving robustness in error handling. Signed-off-by: Andrii Skliar <askliar@nvidia.com>

gemini-code-assist · 2026-03-26T08:45:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the FlashInfer CUTLASS backend by introducing compatibility for models that employ Relu2 as their Mixture-of-Experts (MoE) gate activation, thereby expanding the range of supported neural network architectures. Concurrently, it improves the robustness of the autotuner by ensuring that transient CUDA errors are properly handled and cleared, preventing them from destabilizing later operations.

Highlights

Relu2 Activation Support: Added support for Relu2 (squared ReLU) activation within the CUTLASS MoE GEMM path, enabling models like Nemotron-H MTP to utilize this activation type without runtime errors.
Autotuner CUDA Error Handling: Fixed an issue in the autotuner where asynchronous CUDA errors from failed tactic probes were not cleared, preventing them from causing subsequent crashes during CUDA graph capture or inference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-26T08:45:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Relu2 epilogue support to CUTLASS and dispatch in MOE GEMM, tweaks autotuner tactic-failure handling to attempt CUDA synchronize, enables CUTLASS GDC compile flags for additional SM targets, and introduces SM121-specific kernel generation and heuristic selection with SM12.x normalization updates.

Changes

Cohort / File(s)	Summary
CUTLASS epilogue extension `csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue_helpers.h`	Added `EpilogueOpDefaultRelu2` tag and `Epilogue<..., EpilogueOpDefaultRelu2>` specialization mapping to `cutlass::epilogue::thread::LinearCombinationGeneric` configured with `Relu2`.
MOE GEMM dispatch `csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h`	Added `ActivationType::Relu2` case in `MoeGemmRunner::moeGemmBiasAct` to dispatch using the new epilogue tag.
Autotuner error handling `flashinfer/autotuner.py`	Changed tactic-profiling failure logging from `warning`→`debug` and attempt `torch.cuda.synchronize()` (suppressed exceptions) before marking a tactic as failed.
Fused-MoE & FP8 NVCC flags `flashinfer/jit/fused_moe.py`, `flashinfer/jit/gemm/fp8_blockscale.py`, `flashinfer/fused_moe/__init__.py`, `flashinfer/fused_moe/core.py`	Added `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` to NVCC/CUDA cflags where relevant; exported and dispatch support for new `gen_cutlass_fused_moe_sm121_module`.
SM12x/SM121 kernel generation `flashinfer/jit/gemm/cutlass/generate_kernels.py`	Added `is_sm121` flag to grouped-gemm generator, new `generate_sm121_operations`, adjusted dispatch to generate SM121-only ops when appropriate, and added inl_map entry for `(GemmKind.Grouped, 121)`.
Heuristic candidate generation `csrc/.../cutlass_heuristic.cpp`	Added `get_candidate_configs_sm121(...)`, changed SM120 check to exact-match and introduced SM121 branch; restricted SM121 candidates (FAST_BUILD vs non-FAST_BUILD) and FP4-only constraints.
Grid-dependency / GDC enablement `csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/arch/grid_dependency_control.h`	Extended compile-time detection to enable `CUTLASS_GDC_ENABLED` for SM100-family variants when macros indicate GDC support.
CUDA arch normalization `flashinfer/compilation_context.py`	Distinguished SM12.1 vs SM12.0 when CUDA toolchain >= 12.9, returning `compute_121f` for minor==1 and preserving prior behavior for minor==0.
Misc. grouped GEMM adjustments `csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/...` (related)	Small dispatch and compile-flag wiring changes to integrate new epilogue and SM121 support.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[feat] trtllm-gen mxfp8 gemm #2653 — related Relu2 activation additions and GEMM/epilogue integration.
fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops #2716 — related additions of CUTLASS GDC compile flags and build-time macros for SM100/SM90.
fix: Add fused MOE and GEMM AOT modules for SM121 #2654 — related SM121 fused-MoE module generation / AOT usage for the new SM121 generator.

Suggested labels

run-ci

Suggested reviewers

yzh119
aleozlx
cyx-6
yongwww
jimmyzho
nv-yunzheq
bkryu
kahyunnam

Poem

🐰 I hopped through kernels, stitched Relu2 bright,
Flags fluttered for SM12, a quiet compile night.
I poked a cuda-sync, then danced out of sight —
Binaries hum, and the rabbit naps tight. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the two main changes: adding Relu2 activation support to the CUTLASS MoE backend and fixing autotuner async CUDA error handling.
Description check	✅ Passed	The PR description comprehensively covers all required sections from the template and provides clear, detailed information about the changes, rationale, and testing approach.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new Relu2 activation function by extending the CUTLASS epilogue helpers and integrating it into the Mixture-of-Experts (MoE) GEMM kernel dispatch logic. Additionally, it enhances the flashinfer autotuner by adding a mechanism to clear pending asynchronous CUDA errors after failed profiling runs, which prevents these errors from affecting subsequent CUDA graph capture. I have no feedback to provide as there were no review comments.

coderabbitai

🧹 Nitpick comments (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

1019-1040: Add a fused-activation invariant guard before dispatch.

If inputs.use_fused_moe is ever true for Relu2 (or other non-gated activations), execution can hit a non-executing path in genericMoeGemmKernelLauncher::call. A fast-fail check here would prevent silent misconfiguration.

Proposed guard

 void MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType, IsMXFPX>::moeGemmBiasAct(
     GroupedGemmInput<T, WeightType, ScaleBiasType, OutputType> inputs,
     TmaWarpSpecializedGroupedGemmInput hopper_inputs) {
+  TLLM_CHECK_WITH_INFO(
+      !inputs.use_fused_moe ||
+          inputs.activation_type == ActivationType::Swiglu ||
+          inputs.activation_type == ActivationType::Geglu,
+      "use_fused_moe is only valid for gated activations (Swiglu/Geglu)");
+
   switch (inputs.activation_type) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h`
around lines 1019 - 1040, Add a fast-fail guard before the activation dispatch
to ensure non-gated activations cannot be used with fused MOE: check
inputs.use_fused_moe and if true for non-gated activations (e.g.,
ActivationType::Relu2, ActivationType::Relu, ActivationType::Gelu,
ActivationType::Silu, ActivationType::Identity, ActivationType::Swiglu,
ActivationType::Geglu) log/throw a clear error and return before calling
runGemm; place this check in the same scope that switches on
inputs.activation_type (the dispatch that calls runGemm) so misconfiguration
never reaches genericMoeGemmKernelLauncher::call. Ensure the error message
mentions use_fused_moe and the offending ActivationType to aid debugging.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h`:
- Around line 1019-1040: Add a fast-fail guard before the activation dispatch to
ensure non-gated activations cannot be used with fused MOE: check
inputs.use_fused_moe and if true for non-gated activations (e.g.,
ActivationType::Relu2, ActivationType::Relu, ActivationType::Gelu,
ActivationType::Silu, ActivationType::Identity, ActivationType::Swiglu,
ActivationType::Geglu) log/throw a clear error and return before calling
runGemm; place this check in the same scope that switches on
inputs.activation_type (the dispatch that calls runGemm) so misconfiguration
never reaches genericMoeGemmKernelLauncher::call. Ensure the error message
mentions use_fused_moe and the offending ActivationType to aid debugging.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef3b076f-ae24-496a-9133-6b2c7468e8fc

📥 Commits

Reviewing files that changed from the base of the PR and between d426b18 and 586ca5f.

📒 Files selected for processing (3)

csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue_helpers.h
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
flashinfer/autotuner.py

- Added compilation flag `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` to the `gen_cutlass_fused_moe_sm120`, `gen_cutlass_fused_moe_sm103`, and `gen_cutlass_fused_moe_sm100` functions to support GDC for SM100 architecture. Signed-off-by: Andrii Skliar <askliar@nvidia.com>

amirkl94 · 2026-03-26T13:18:38Z

        "-DENABLE_FP8",
        "-DENABLE_FP4",
        "-DUSING_OSS_CUTLASS_MOE_GEMM",
+        "-DCUTLASS_ENABLE_GDC_FOR_SM100=1",


@askliar What does this flag do?
Is there a way to add it in some common flags area or something?

needed for activate PDL #2708

amirkl94 · 2026-03-26T13:28:04Z

General question, why is the change in #1954 not enough for your case? It probably goes for a different code-path but I'd like to make sure we're not missing a different bug that causes this.

amirkl94 · 2026-03-26T13:29:43Z

                                        f"[Autotuner]: Failed when profiling {r} {tac}, shapes={shapes}. Error occurred: {e}"
                                    )

+                                    # Clear any pending async CUDA errors (e.g.


A bit of nitpicking but consider splitting this into a different PR

- Introduced `get_candidate_configs_sm121` function to handle GEMM configurations for the SM121 architecture, which has a reduced shared memory budget. - Updated `generate_sm120_grouped_gemm_operations` to accommodate the specific tile size constraints for SM121. - Enhanced `CompilationContext` to differentiate between SM120 and SM121 in the JIT cache. - Adjusted kernel generation logic to ensure compatibility with the new architecture. Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…re/add_relu2_for_default_backend

…2_for_default_backend

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp`:
- Around line 646-647: The FAST_BUILD branch in get_candidate_configs_sm121
unconditionally delegates non-GROUPED_GEMM cases to get_candidate_configs_sm120
which (in FAST_BUILD) returns CtaShape128x128x256B that can exceed SM121 SMEM
limits; modify the FAST_BUILD path in get_candidate_configs_sm121 to check the
FP4_ONLY flag before delegating — if config has FP4_ONLY then call
get_candidate_configs_sm120(config), otherwise select SM121-safe candidates
(avoid CtaShape128x128x256B) or fallback to an SM121-specific safe list; update
the logic around GROUPED_GEMM, FAST_BUILD, and FP4_ONLY in
get_candidate_configs_sm121 to mirror the non-FAST_BUILD guard behavior.

In `@flashinfer/jit/gemm/cutlass/generate_kernels.py`:
- Around line 1059-1062: The SM121-only branch is dead because
gen_cutlass_fused_moe_sm120_module() forces device_arch="120" so has_arch(121)
is never true; either remove the conditional and the generate_sm121_operations
call to eliminate dead code, or implement a parallel SM121 generator: add
gen_cutlass_fused_moe_sm121_module() mirroring
gen_cutlass_fused_moe_sm120_module() that passes device_arch="121" (and ensure
calling sites use it), then keep the if has_arch(121) and not has_arch(120)
branch to invoke generate_sm121_operations(True); update or remove references to
has_arch(121) accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 23eb8e22-b855-4109-9630-e4cca23e64b8

📥 Commits

Reviewing files that changed from the base of the PR and between 119743f and ab9b492.

📒 Files selected for processing (4)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
flashinfer/autotuner.py
flashinfer/compilation_context.py
flashinfer/jit/gemm/cutlass/generate_kernels.py

coderabbitai · 2026-03-31T14:08:06Z

+    return get_candidate_configs_sm120(config);
+  }


⚠️ Potential issue | 🟡 Minor

FAST_BUILD non-GROUPED_GEMM path for SM121 may be incorrect.

In the FAST_BUILD branch (lines 645-647), when config does NOT have GROUPED_GEMM, it falls back to get_candidate_configs_sm120(config). However, get_candidate_configs_sm120 in FAST_BUILD mode (lines 589-599) returns CtaShape128x128x256B for non-grouped GEMM, which may exceed SM121's SMEM budget.

The non-FAST_BUILD path (lines 663-668) correctly delegates to SM120 for FP4_ONLY, but the FAST_BUILD path doesn't check FP4_ONLY before delegating.

Consider adding FP4_ONLY guard in FAST_BUILD path

`#ifdef` FAST_BUILD if (config & CutlassGemmConfig::GROUPED_GEMM) { return {CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x64B, MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO, ClusterShape::ClusterShape_1x1x1}}; } else { + if ((config & CutlassGemmConfig::FP4_ONLY) != 0) { + return get_candidate_configs_sm120(config); + } else { + TLLM_THROW("Not Implemented: SM121 non-group GEMM only supports nvfp4 in FAST_BUILD."); + } - return get_candidate_configs_sm120(config); } `#else`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return get_candidate_configs_sm120(config);

}

`#ifdef` FAST_BUILD

if (config & CutlassGemmConfig::GROUPED_GEMM) {

return {CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x64B,

MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO,

ClusterShape::ClusterShape_1x1x1}};

} else {

if ((config & CutlassGemmConfig::FP4_ONLY) != 0) {

return get_candidate_configs_sm120(config);

} else {

TLLM_THROW("Not Implemented: SM121 non-group GEMM only supports nvfp4 in FAST_BUILD.");

}

}

`#else`

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp` around lines 646 - 647, The FAST_BUILD branch in get_candidate_configs_sm121 unconditionally delegates non-GROUPED_GEMM cases to get_candidate_configs_sm120 which (in FAST_BUILD) returns CtaShape128x128x256B that can exceed SM121 SMEM limits; modify the FAST_BUILD path in get_candidate_configs_sm121 to check the FP4_ONLY flag before delegating — if config has FP4_ONLY then call get_candidate_configs_sm120(config), otherwise select SM121-safe candidates (avoid CtaShape128x128x256B) or fallback to an SM121-specific safe list; update the logic around GROUPED_GEMM, FAST_BUILD, and FP4_ONLY in get_candidate_configs_sm121 to mirror the non-FAST_BUILD guard behavior.

- Implemented `gen_cutlass_fused_moe_sm121_module` to generate modules for the SM121 architecture, ensuring compatibility with its shared memory constraints. - Updated the `get_cutlass_fused_moe_module` function to handle the new SM121 backend. - Refactored `get_candidate_configs_sm121` to streamline GEMM configuration retrieval. This enhances the framework's capability to leverage the SM121 architecture effectively.

- Removed `gen_cutlass_fused_moe_sm121_module` and its references from the codebase, simplifying the architecture support. - Updated `get_cutlass_fused_moe_module` to handle only SM120 and SM103 backends. - Adjusted kernel generation logic to ensure compatibility with the remaining architectures. This change streamlines the code and focuses on maintaining support for the more widely used architectures.

- Introduced a filtering mechanism in `get_candidate_tiles` to exclude tile configurations where both M and N are greater than or equal to 128 for the SM121 architecture, addressing shared memory constraints. - Updated the return statements for various GEMM types to utilize the new filtering function, ensuring the autotuner does not consider invalid configurations. This change enhances the efficiency of the autotuner by preventing it from evaluating known-bad configurations for SM121.

aleozlx · 2026-03-31T21:07:23Z

/bot run

flashinfer-bot · 2026-03-31T21:08:56Z

GitLab MR !480 has been created, and the CI pipeline #47388152 is currently running. I'll report back once the pipeline job completes.

eugr · 2026-04-01T04:33:56Z

@askliar - according to CUDA documentation, sm120 also has 99KB of shared memory, which makes sense, because sm120/sm121 are pretty much identical. Not sure where 228KB for sm120 comes from?

Also, I haven't looked at more detail, but I assume it will build correctly for Spark if FLASHINFER_CUDA_ARCH_LIST=12.1a (not 12.1f), right?

askliar · 2026-04-01T09:03:04Z

@eugr just FYI, I have split this PR into two separate ones:
(Relu2, merged) #2927
(Tactic errors, in review) #2926

I will keep updating the latter with your comments!
Now to your questions:

Your comment on SM120 is correct, my understanding was wrong! Thank you and I will update FlashInfer to reflect that.
12.1a/f are both okay, yes!

Thank you for your contributions!

flashinfer-bot · 2026-04-01T10:19:51Z

[FAILED] Pipeline #47388152: 11/20 passed

eugr · 2026-04-01T16:10:07Z

Thanks!

askliar requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners March 26, 2026 08:44

askliar changed the title ~~feat: add Relu2 activation support in epilogue helpers and kernel dis…~~ Add Relu2 activation support in CUTLASS MoE backend and fix autotuner async CUDA error handling Mar 26, 2026

gemini-code-assist Bot reviewed Mar 26, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

amirkl94 reviewed Mar 26, 2026

View reviewed changes

johnnynunez and others added 4 commits March 30, 2026 00:46

feat: enable GDC for SM90 and SM100 in CUDA compilation

0f260d6

feat: enable GDC for SM90 and SM100 in CUDA compilation

5064ce7

Merge branch 'flashinfer-ai:main' into main

b9d21ec

flashinfer-bot added op: gemm op: moe labels Mar 31, 2026

Andrii Skliar added 2 commits March 31, 2026 15:58

Merge branch 'main' of github.com:flashinfer-ai/flashinfer into featu…

760ba63

…re/add_relu2_for_default_backend

Merge https://github.com/johnnynunez/flashinfer into feature/add_relu…

3602494

…2_for_default_backend

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

askliar requested review from IwakuraRein and jiahanc as code owners March 31, 2026 14:51

Andrii Skliar added 2 commits March 31, 2026 17:08

aleozlx added the run-ci label Mar 31, 2026

aleozlx approved these changes Mar 31, 2026

View reviewed changes

askliar closed this Apr 1, 2026

-    return get_candidate_configs_sm120(config);
-  }
+ `#ifdef` FAST_BUILD
+   if (config & CutlassGemmConfig::GROUPED_GEMM) {
+     return {CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x64B,
+                               MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO,
+                               ClusterShape::ClusterShape_1x1x1}};
+   } else {
+     if ((config & CutlassGemmConfig::FP4_ONLY) != 0) {
+       return get_candidate_configs_sm120(config);
+     } else {
+       TLLM_THROW("Not Implemented: SM121 non-group GEMM only supports nvfp4 in FAST_BUILD.");
+     }
+   }
+ `#else`

Conversation

askliar commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Depends on

Changes

Relu2 activation (epilogue_helpers.h, moe_gemm_template_dispatch.h)

SM121 tile filtering (cutlass_heuristic.cpp, compilation_context.py)

Autotuner robustness (autotuner.py)

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

amirkl94 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

johnnynunez Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amirkl94 commented Mar 26, 2026

Uh oh!

amirkl94 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aleozlx commented Mar 31, 2026

Uh oh!

flashinfer-bot commented Mar 31, 2026

Uh oh!

eugr commented Apr 1, 2026

Uh oh!

askliar commented Apr 1, 2026

Uh oh!

flashinfer-bot commented Apr 1, 2026

Uh oh!

eugr commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

askliar commented Mar 26, 2026 •

edited

Loading

Relu2 activation (`epilogue_helpers.h`, `moe_gemm_template_dispatch.h`)

SM121 tile filtering (`cutlass_heuristic.cpp`, `compilation_context.py`)

Autotuner robustness (`autotuner.py`)

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading

johnnynunez Mar 31, 2026 •

edited

Loading