[SM121] Enable native block-scaled dot_scaled for DGX Spark (GB10) (#10010)

ianbarber · web-flow · commit b3e36937a4f2 · 2026-04-12T23:13:29.000-07:00
SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled instructions as SM120 (RTX 5090) but was excluded from the native lowering path by exact compute capability checks. Without this fix, dot_scaled on SM121 falls through to DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270 TFLOPS with native mma.sync block-scaled FP4. Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4 (scale_vec::4X, ue4m3). # New contributor declaration - [ x] I am not making a trivial change, such as fixing a typo in a comment. - [ x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ x ] This PR does not need a test because current test paths cover the flow, though there are no GB10s in CI to verify AFAIK it does work for me. - Select one of the following. - [x ] I have not added any `lit` tests.
diff --git a/lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp b/lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
@@ -677,7 +677,7 @@ class ScaledBlockedToMMA : public mlir::OpRewritePattern<triton::DotScaledOp> {
   mlir::LogicalResult
   matchAndRewrite(triton::DotScaledOp dotOp,
                   mlir::PatternRewriter &rewriter) const override {
-    if (computeCapability != 120)
+    if (computeCapability / 10 != 12)
       return failure();
 
     auto numCTAs = lookupNumCTAs(rewriter);
@@ -924,7 +924,7 @@ static bool mmav2SupportsFp8Operands(int computeCapability) {
   // although PTX instructions for mma v2 w/ fp8 operands exist for sm90 and
   // sm100, they are emulated as fp16 upcasts + fp16 HMMA in SASS. sm120 has
   // hardware support for fp8 operands w/ mmav2.
-  return computeCapability == 89 || computeCapability == 120;
+  return computeCapability == 89 || computeCapability / 10 == 12;
 }
 
 // promote operands of dot op if the existing combination is not natively