fix: MXFP4/MXFP8 failures in SM120 FAST_BUILD and expand all_tiles[] (#2994)

askliar · Andrii Skliar · gemini-code-assist[bot] · web-flow · commit 04f4c0c9a666 · 2026-04-13T16:22:05.000-06:00
**Problem**
MXFP4 and MXFP8 GEMM operations were failing on SM120 because:
- The FAST_BUILD path returned a single hardcoded CtaShape128x128x64B
tile regardless of GROUPED_GEMM, and that tile is not valid for all
MXFP4/MXFP8 configurations
- The full-build all_tiles[] table was missing tiles needed by those
dtypes (128x128x128B, 128x128x64B, 256x128x64B),
leaving the autotuner with no viable candidate in some cases
**Fix**
- FAST_BUILD: differentiate grouped vs. non-grouped paths with tiles
known to work for MXFP4/MXFP8:
- Grouped: 128x128x128B + 128x128x64B
- Non-grouped: 128x128x256B + 128x128x64B
- Full-build all_tiles[]: add the three missing tiles so the autotuner
has a complete candidate set for MXFP4/MXFP8
workloads

&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;
## Summary by CodeRabbit

* **Performance &amp; Optimizations**
* More predictable kernel candidate selection and expanded
tile/configuration options for SM120-class GPUs to improve tuning and
performance.
* Broadened handling of grouped computation patterns to enable
additional configuration choices.

* **Build/Compatibility**
* Refined CUDA 12.9+ architecture suffixing for more accurate build
targeting.

* **Chores**
* Added type annotations and minor signature clarifications (no runtime
behavior changes).

* **Bug Fixes**
* MoE fusion path now forwards additional tensors/parameters to improve
fused operation correctness.
&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;

Co-authored-by: samuellees &lt;lsam@nvidia.com&gt;

---------

Signed-off-by: Andrii Skliar &lt;askliar@nvidia.com&gt;
Co-authored-by: Andrii Skliar &lt;askliar@nvidia.com&gt;
Co-authored-by: gemini-code-assist[bot] &lt;176961590+gemini-code-assist[bot]@users.noreply.github.com&gt;
Co-authored-by: Sam (Kesen Li) &lt;lsam@nvidia.com&gt;
Co-authored-by: Alex Yang &lt;aleyang@nvidia.com&gt;
diff --git a/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp b/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
@@ -587,28 +587,41 @@ std::vector<CutlassGemmConfig> get_candidate_configs_sm110(
 
 std::vector<CutlassGemmConfig> get_candidate_configs_sm120(
     CutlassGemmConfig::CandidateConfigTypeParam const config) {
+#ifdef FAST_BUILD
+  if (config & CutlassGemmConfig::GROUPED_GEMM) {
+    return {
+        CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x128B, MainloopScheduleType::AUTO,
+                          EpilogueScheduleType::AUTO, ClusterShape::ClusterShape_1x1x1},
+        CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x64B, MainloopScheduleType::AUTO,
+                          EpilogueScheduleType::AUTO, ClusterShape::ClusterShape_1x1x1}};
+  } else {
+    return {
+        CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x256B, MainloopScheduleType::AUTO,
+                          EpilogueScheduleType::AUTO, ClusterShape::ClusterShape_1x1x1},
+        CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x64B, MainloopScheduleType::AUTO,
+                          EpilogueScheduleType::AUTO, ClusterShape::ClusterShape_1x1x1}};
+  }
+#else
   if ((config & CutlassGemmConfig::FP4_ONLY) == 0) {
     if (config & CutlassGemmConfig::GROUPED_GEMM) {
       TLLM_THROW("Not Implemented: SM120 group GEMM only supports nvfp4.");
     }
     TLLM_THROW("Not Implemented: SM120 GEMM only supports nvfp4.");
   }
-  // Only tiles that satisfy ALL of:
-  //   1. Present in the dispatch table (SHAPE_CASE in moe_gemm_template_dispatch_tma_ws.h)
-  //   2. Pass are_tile_shapes_supported_sm120() constexpr check
-  //   3. Have compiled kernel templates (generate_sm120_grouped_gemm_operations)
-  //
-  // 128x128x128B is the only tile meeting all three criteria.  Its nominal SMEM
-  // (2 stages × (128+128) × 256 bytes = 128 KB) exceeds SM120's 100 KB budget,
-  // but CUTLASS StageCountAutoCarveout reduces the stage count to 1, bringing
-  // actual SMEM to ~64 KB.  can_implement() accepts it at runtime.
-  //
-  // K=64 tiles (128x128x64, 128x256x64, 256x128x64) are in the dispatch table
-  // but cannot be compiled for FP4 on SM120 (TMA layout static_assert failure),
-  // so they are intentionally excluded here.
-  return {CutlassGemmConfig{CutlassTileConfigSM120::CtaShape128x128x128B,
-                            MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO,
-                            ClusterShape::ClusterShape_1x1x1}};
+  // All candidate tiles for SM120 FP4. Invalid tiles for a given path are skipped
+  // gracefully by the try-catch in calcMaxWorkspaceSize.
+  static constexpr CutlassTileConfigSM120 all_tiles[] = {
+      CutlassTileConfigSM120::CtaShape128x128x128B, CutlassTileConfigSM120::CtaShape128x128x64B,
+      CutlassTileConfigSM120::CtaShape256x128x64B,  CutlassTileConfigSM120::CtaShape128x256x64B,
+      CutlassTileConfigSM120::CtaShape128x128x256B, CutlassTileConfigSM120::CtaShape256x128x128B,
+  };
+  std::vector<CutlassGemmConfig> result;
+  for (auto tile : all_tiles) {
+    result.push_back(CutlassGemmConfig{tile, MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO,
+                                       ClusterShape::ClusterShape_1x1x1});
+  }
+  return result;
+#endif
 }
 
 std::vector<CutlassGemmConfig> get_candidate_configs(
diff --git a/flashinfer/compilation_context.py b/flashinfer/compilation_context.py
@@ -36,7 +36,7 @@ def _normalize_cuda_arch(major: int, minor: int) -> tuple[int, str]:
         tuple with the correct architecture suffix for nvcc.
 
         SM 9.x  -> 'a' suffix (e.g. compute_90a)
-        SM 12.x -> 'f' suffix with minor version preserved (e.g. compute_120f for SM120, compute_121f for SM121).
+        SM 12.x -> 'f' suffix with minor version preserved (e.g. compute_120f for SM120, compute_121a for SM121).
         Each SM 12.x variant gets its own cubin to avoid running SM120 code on SM121 (DGX Spark) which
         can cause cudaErrorIllegalInstruction. Requires CUDA >= 12.9.
         SM 10+  -> 'a' suffix (e.g. compute_100a)
@@ -48,7 +48,10 @@ def _normalize_cuda_arch(major: int, minor: int) -> tuple[int, str]:
             from flashinfer.jit.cpp_ext import is_cuda_version_at_least
 
             if is_cuda_version_at_least("12.9"):
-                return (major, str(minor) + "f")
+                if minor == 0:
+                    return (major, "0f")
+                else:
+                    return (major, str(minor) + "a")
             else:
                 raise RuntimeError("SM 12.x requires CUDA >= 12.9")
         elif major >= 10: