Skip to content

Commit f51f3b7

Browse files
authored
[Gluon] Add 2CTA block-scaled matmul example with cuBLAS comparison (triton-lang#9697)
High-performance 2CTA warp-specialized block-scaled MMA. Two CTAs cooperate per output tile, sharing operands to increase arithmetic intensity and reduce the per-CTA SMEM footprint. ``` block-scale-matmul-mxfp8-mxfp8: MNK 1cta (TFLOPS) 2cta (TFLOPS) 2cta/1cta cublas (TFLOPS) 2cta/cublas 8192 2525.9 2895.0 1.15 2894.5 1.00 16384 2409.3 2755.0 1.14 2647.7 1.04 32768 2468.7 2632.9 1.07 2587.0 1.02 block-scale-matmul-nvfp4-nvfp4: MNK 1cta (TFLOPS) 2cta (TFLOPS) 2cta/1cta cublas (TFLOPS) 2cta/cublas 8192 4781.9 5730.0 1.20 5589.8 1.03 16384 4837.0 5562.0 1.15 4313.6 1.29 32768 4723.3 5362.4 1.14 4933.4 1.09 block-scale-matmul-mxfp8-mxfp4: MNK 1cta (TFLOPS) 2cta (TFLOPS) 2cta/1cta 8192 2738.9 3149.8 1.15 16384 2735.8 2930.4 1.07 32768 2632.8 2773.1 1.05 block-scale-matmul-mxfp4-mxfp4: MNK 1cta (TFLOPS) 2cta (TFLOPS) 2cta/1cta 8192 4956.5 5819.2 1.17 16384 5196.6 5581.2 1.07 32768 4862.2 5511.0 1.13 ```
1 parent 93f05e1 commit f51f3b7

1 file changed

Lines changed: 990 additions & 0 deletions

File tree

0 commit comments

Comments
 (0)