fix: arch 12.1 -> "sm120a" flag for Spark, CUDA 12.9 (#2839)

kahyunnam · murphymatt · commit 20b64965a98f · 2026-03-31T00:04:32.000Z
## 📌 Description Bug found in nightly [Spark, 12.9] matrix https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/285092631, where Spark compiles to "120a" (see "/tmp/.cache/flashinfer/0.6.6/120a/" path in log below). ``` E RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device /tmp/.cache/flashinfer/0.6.6/120a/generated/single_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_head_dim_qk_128_head_dim_vo_128_posenc_2_use_swa_False_use_logits_cap_False/single_decode.cu:100: RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device ``` Root cause was flashinfer-ai/flashinfer#2725 , where we added logic for compiling both Spark and Thor to 120f, but on the condition that cuda version is 13 or higher. Lower (12.9) defaults to 'a' suffix, 120a. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Strengthened CUDA validation for SM 12.x GPUs: now requires CUDA 12.9 or newer and emits a clear error if unmet, replacing the previous silent fallback behavior.
diff --git a/flashinfer/compilation_context.py b/flashinfer/compilation_context.py
@@ -36,28 +36,20 @@ def _normalize_cuda_arch(major: int, minor: int) -> tuple[int, str]:
         tuple with the correct architecture suffix for nvcc.
 
         SM 9.x  -> 'a' suffix (e.g. compute_90a)
-        SM 12.x -> always normalized to SM 120 with 'f' suffix (e.g. compute_120f)
-                    when the installed CUDA toolchain supports it (CUDA >= 13.0),
-                    otherwise 'a'.  This covers both SM 12.0 and SM 12.1 (DGX Spark).
+        SM 12.x -> always normalized to SM 120 with 'f' suffix (e.g. compute_120f).
+        This covers both SM 12.0 and SM 12.1 (DGX Spark) when the installed CUDA toolchain supports it (CUDA >= 12.9).
         SM 10+  -> 'a' suffix (e.g. compute_100a)
         SM < 9  -> no suffix
         """
         if major == 9:
             return (major, str(minor) + "a")
         elif major == 12:
-            try:
-                from flashinfer.jit.cpp_ext import is_cuda_version_at_least
+            from flashinfer.jit.cpp_ext import is_cuda_version_at_least
 
-                if is_cuda_version_at_least("13.0"):
-                    return (major, "0f")
-            except (ImportError, RuntimeError, ValueError):
-                logger.debug(
-                    "Could not determine CUDA version; "
-                    "falling back to 'a' suffix for SM %d.%d",
-                    major,
-                    minor,
-                )
-            return (major, "0a")
+            if is_cuda_version_at_least("12.9"):
+                return (major, "0f")
+            else:
+                raise RuntimeError("SM 12.x requires CUDA >= 12.9")
         elif major >= 10:
             return (major, str(minor) + "a")
         return (major, str(minor))