PartitionK fp32 accumulator + interleave loads

PaulZhang12 · facebook-github-bot · commit b69a816aece3 · 2025-03-19T10:15:28.000-07:00
Summary: Background: [Adding SPLIT_K to Triton Templates](https://docs.google.com/document/d/1K1DwmVkzqoB_uDoWkPvmOa2xra5yko5JmQzy5Jxy-rg/edit?tab=t.0) This diff makes two changes to the partitionK kernel: 1. FP32 accumulator throughout 2. Interleave loads from the split-k dimension for coalesced memory accesses. **FP32 accumulator**: Previously, intra kernel we accumulated in fp32, but the intermediate buffer was fp16, meaning the k dimension would be reduced in fp16. This loss of precision could hurt accuracy. After discussing with sijiac and eellison, since we wanted the same correctness of cuBLAS/cutlass which uses fp32 accumulation throughout, we made this change in the kernel, though it hurt performance. **Interleave loads optimization**: Before if K = 4, and PK = 2, each kernel instance would process half of the K dimension sequentially, with kernel instance 0 processing K=0, 1 and kernel instance 1 processing K=2,3. Now, kernel instance 0 would process K=0, 2 and kernel instance 1 processes K=1,3. In the previous case, when K scales up, loads between instances are nowhere near each other, decreasing chance for coalesced memory loads and cache hits. With the new case, those chances increase as loads are closer to each other in memory. **Results**: By changing accumulation throughout to FP32, performance takes a big hit, going from ~94% of aten performance -> 81% on average across all shapes. However, with interleave loads, the FP32 accumulation performance improves to 86% of aten performance. Given traditional SPLIT_K performance in triton_ops_matmul is at ~91%, this is acceptable. DecomposeK has highest performance of ~96% of aten. This kernel could be further improved, potentially with TMA/persistent optimizations? Reviewed By: sijiac Differential Revision: D71437375 fbshipit-source-id: b79fe3022770c9b744e54bbbb73905695240c8fa
diff --git a/tritonbench/operators/gemm/partition_k.py b/tritonbench/operators/gemm/partition_k.py
@@ -144,7 +144,7 @@ def _matmul_partition_k(
     # See above `Pointer Arithmetic` section for details
     offs_am = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
     offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
-    offs_k = (pid_pk * PK_SIZE + tl.arange(0, BLOCK_SIZE_K)) % K
+    offs_k = (pid_pk * BLOCK_SIZE_K + tl.arange(0, BLOCK_SIZE_K)) % K
     a_ptrs = a_ptr + (offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak)
     b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_bn[None, :] * stride_bn)
 
@@ -162,9 +162,8 @@ def _matmul_partition_k(
         a = tl.load(a_ptrs)
         b = tl.load(b_ptrs)
         accumulator += tl.dot(a, b)
-        a_ptrs += BLOCK_SIZE_K * stride_ak
-        b_ptrs += BLOCK_SIZE_K * stride_bk
-    acc = accumulator.to(tl.float16)
+        a_ptrs += PK_SIZE * stride_ak
+        b_ptrs += PK_SIZE * stride_bk
 
     offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
     offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
@@ -175,7 +174,7 @@ def _matmul_partition_k(
         + stride_cb_n * offs_cn[None, :, None]
         + stride_cb_k * offs_ck[None, None, :]
     )
-    tl.store(c_buf_ptrs, acc[:, :, None])
+    tl.store(c_buf_ptrs, accumulator[:, :, None])
 
 
 @triton.jit
@@ -228,7 +227,8 @@ def matmul_partition_k(a, b, triton_reduce=False):
     # Allocates output.
     partitionK_SIZE = K // partitionK
 
-    c_buf = torch.empty((M, N, partitionK), device=a.device, dtype=a.dtype)
+    # Enforce accumulation in float32 for accuracy
+    c_buf = torch.empty((M, N, partitionK), device=a.device, dtype=torch.float32)
     c = torch.empty((M, N), device=a.device, dtype=a.dtype)
     # 1D launch kernel where each block gets its own program.
 
@@ -276,4 +276,4 @@ def matmul_partition_k(a, b, triton_reduce=False):
         )
         return c
     else:
-        return c_buf.sum(dim=2)
+        return c_buf.sum(dim=2).to(a.dtype)