Commit f51f3b7
authored
[Gluon] Add 2CTA block-scaled matmul example with cuBLAS comparison (triton-lang#9697)
High-performance 2CTA warp-specialized block-scaled MMA. Two CTAs
cooperate per output tile, sharing operands to increase arithmetic
intensity and reduce the per-CTA SMEM footprint.
```
block-scale-matmul-mxfp8-mxfp8:
MNK 1cta (TFLOPS) 2cta (TFLOPS) 2cta/1cta cublas (TFLOPS) 2cta/cublas
8192 2525.9 2895.0 1.15 2894.5 1.00
16384 2409.3 2755.0 1.14 2647.7 1.04
32768 2468.7 2632.9 1.07 2587.0 1.02
block-scale-matmul-nvfp4-nvfp4:
MNK 1cta (TFLOPS) 2cta (TFLOPS) 2cta/1cta cublas (TFLOPS) 2cta/cublas
8192 4781.9 5730.0 1.20 5589.8 1.03
16384 4837.0 5562.0 1.15 4313.6 1.29
32768 4723.3 5362.4 1.14 4933.4 1.09
block-scale-matmul-mxfp8-mxfp4:
MNK 1cta (TFLOPS) 2cta (TFLOPS) 2cta/1cta
8192 2738.9 3149.8 1.15
16384 2735.8 2930.4 1.07
32768 2632.8 2773.1 1.05
block-scale-matmul-mxfp4-mxfp4:
MNK 1cta (TFLOPS) 2cta (TFLOPS) 2cta/1cta
8192 4956.5 5819.2 1.17
16384 5196.6 5581.2 1.07
32768 4862.2 5511.0 1.13
```1 parent 93f05e1 commit f51f3b7
1 file changed
Lines changed: 990 additions & 0 deletions
0 commit comments