Commit b07e76a
authored
[Bench][Blackwell] Fix warp specialization for fp8 x mxfp4 bench (#6537)
This pr-chain brings the performance of the mixed fp8 x mxfp4 MOE kernel
on par with fp8 x fp8 kernel:
* About 10% slower in the dense benchmarks
* About 10% faster in the llama4 benchmarks
Applies a bug fix for padded scale loads in fp8 x mxfp4 mode
ensuring TMA load requirements are met when using the unpacked
fp4 (padded) layout. This only occurs after enabling warp
specialization.1 parent a0e3e78 commit b07e76a
2 files changed
Lines changed: 6 additions & 2 deletions
File tree
- bench/triton_bench/matmul_ogs_details
- lib/Dialect/TritonGPU/Transforms/WarpSpecialization
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
293 | 293 | | |
294 | 294 | | |
295 | 295 | | |
296 | | - | |
| 296 | + | |
297 | 297 | | |
298 | 298 | | |
299 | 299 | | |
| |||
Lines changed: 5 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| |||
139 | 140 | | |
140 | 141 | | |
141 | 142 | | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
142 | 146 | | |
143 | | - | |
| 147 | + | |
144 | 148 | | |
145 | 149 | | |
146 | 150 | | |
| |||
0 commit comments