Goal
Bring the inner-tiled path without ukernels to performance parity with the mmt4d path with ukernels.
So this is more ambitious than merely parity between inner_tiled and mmt4d. This is about making inner_tiled codegen so good that switching to inner_tiled removes much of the need for ukernels.
PRs
Benchmark
Square matmul 4096x4096x4096, dynamic shapes, AMD Zen 4
(Ryzen 9 7950X3D), iree-benchmark-module --device=local-task:
| element types |
mmt4d + ukernel |
inner_tiled (before PRs) |
inner_tiled (after PRs) |
inner_tiled vs ukernel |
| f32 × f32 → f32 |
115 ms |
~102 ms |
101 ms |
1.14× faster |
| f16 × f16 → f32 |
132 ms |
~126 ms |
124 ms |
1.06× faster |
| bf16 × bf16 → f32 |
51.1 ms |
~52 ms |
51.9 ms |
parity (~2%) |
| i16 × i16 → i32 |
52.0 ms |
~53 ms |
52.6 ms |
parity (~1%) |
| i8 × i8 → i32 |
43.6 ms |
72.5 ms |
46.9 ms |
inner_tiled 1.07× slower due to relayout dispatch ; kernel itself is just as fast. See #24514 |
The "before PRs" column is inner_tiled prior to this PR series. f32, f16,
bf16 and i16 are essentially unchanged (their small before/after differences
are run-to-run noise) — the PRs target i8. i8 improves from 65% slower than the
ukernel to ~7% slower.
i8: residual ~7% gap — parked for now
Associated issue: #24514
The remaining i8 gap is entirely the encoding-relayout dispatch (the pack of
the C matrix), which currently scalarizes instead of vectorizing. Closing it
requires new vectorization/lowering infrastructure for iree_linalg_ext
relayout ops on CPU — it is not a wiring change. The precise root cause and the
proposed fix (map_load rooted on the padded side + a MapLoadOpVectorization
model + lowering) are written up separately: #24514.
We are parking the i8 parity goal at ~7% pending that infrastructure work. f32,
f16, bf16 and i16 are at parity today.
Goal
Bring the inner-tiled path without ukernels to performance parity with the mmt4d path with ukernels.
So this is more ambitious than merely parity between inner_tiled and mmt4d. This is about making inner_tiled codegen so good that switching to inner_tiled removes much of the need for ukernels.
PRs
[ ] [Codegen] Cap hoisted statically-bound allocations at the stack budget.Benchmark
Square matmul
4096x4096x4096, dynamic shapes, AMD Zen 4(Ryzen 9 7950X3D),
iree-benchmark-module --device=local-task:mmt4d+ ukernelinner_tiled(before PRs)inner_tiled(after PRs)inner_tiledvs ukernelThe "before PRs" column is
inner_tiledprior to this PR series. f32, f16,bf16 and i16 are essentially unchanged (their small before/after differences
are run-to-run noise) — the PRs target i8. i8 improves from 65% slower than the
ukernel to ~7% slower.
i8: residual ~7% gap — parked for now
Associated issue: #24514
The remaining i8 gap is entirely the encoding-relayout dispatch (the
packofthe C matrix), which currently scalarizes instead of vectorizing. Closing it
requires new vectorization/lowering infrastructure for
iree_linalg_extrelayout ops on CPU — it is not a wiring change. The precise root cause and the
proposed fix (
map_loadrooted on the padded side + aMapLoadOpVectorizationmodel + lowering) are written up separately: #24514.
We are parking the i8 parity goal at ~7% pending that infrastructure work. f32,
f16, bf16 and i16 are at parity today.