Tracking: `inner_tiled` CPU performance parity with `mmt4d` + ukernels

## Goal

Bring the inner-tiled path *without* ukernels to performance parity with the mmt4d path *with* ukernels.

So this is more ambitious than merely parity between inner_tiled and mmt4d.  This is about making inner_tiled codegen so good that switching to inner_tiled removes much of the need for ukernels.

## PRs

- [X] [[Codegen][CPU] Route inner_tiled broadcast into m_bcst-foldable slot.](https://github.com/iree-org/iree/pull/24516#top)
- [X] [[Codegen][CPU] Flatten contiguous trailing dims of transfers before unrolling.](https://github.com/iree-org/iree/pull/24517#top)
- ~~[ ] [[Codegen] Cap hoisted statically-bound allocations at the stack budget.](https://github.com/iree-org/iree/pull/24532#top)~~
- [X] [[Codegen][CPU] Fold reshape-containing encoding relayouts to map_store.](https://github.com/iree-org/iree/pull/24533#top)
- [X] [[Codegen][CPU] Add x86 AVX-512 VNNI 16x16x2 i8 MMA intrinsic.](https://github.com/iree-org/iree/pull/24534#top)
- [ ] [[Codegen][CPU] Fold reshapes into map_store per reassociation group.](https://github.com/iree-org/iree/pull/24535#top)

## Benchmark

Square matmul `4096x4096x4096`, dynamic shapes, AMD Zen 4
(Ryzen 9 7950X3D), `iree-benchmark-module --device=local-task`:

| element types | `mmt4d` + ukernel | `inner_tiled` (before PRs) | `inner_tiled` (after PRs) | `inner_tiled` vs ukernel |
|---|---|---|---|---|
| f32 × f32 → f32 | 115 ms | ~102 ms | 101 ms | **1.14× faster** |
| f16 × f16 → f32 | 132 ms | ~126 ms | 124 ms | **1.06× faster** |
| bf16 × bf16 → f32 | 51.1 ms | ~52 ms | 51.9 ms | parity (~2%) |
| i16 × i16 → i32 | 52.0 ms | ~53 ms | 52.6 ms | parity (~1%) |
| i8 × i8 → i32 | 43.6 ms | 72.5 ms | 46.9 ms | inner_tiled 1.07× slower due to relayout dispatch ; kernel itself is just as fast. See https://github.com/iree-org/iree/issues/24514 |

The "before PRs" column is `inner_tiled` prior to this PR series. f32, f16,
bf16 and i16 are essentially unchanged (their small before/after differences
are run-to-run noise) — the PRs target i8. i8 improves from 65% slower than the
ukernel to ~7% slower.

## i8: residual ~7% gap — parked for now

*Associated issue:* https://github.com/iree-org/iree/issues/24514

The remaining i8 gap is entirely the encoding-relayout dispatch (the `pack` of
the C matrix), which currently scalarizes instead of vectorizing. Closing it
requires new vectorization/lowering infrastructure for `iree_linalg_ext`
relayout ops on CPU — it is not a wiring change. The precise root cause and the
proposed fix (`map_load` rooted on the padded side + a `MapLoadOpVectorization`
model + lowering) are written up separately: https://github.com/iree-org/iree/issues/24514.

We are parking the i8 parity goal at ~7% pending that infrastructure work. f32,
f16, bf16 and i16 are at parity today.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: `inner_tiled` CPU performance parity with `mmt4d` + ukernels #24515

Goal

PRs

Benchmark

i8: residual ~7% gap — parked for now

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

element types	`mmt4d` + ukernel	`inner_tiled` (before PRs)	`inner_tiled` (after PRs)	`inner_tiled` vs ukernel
f32 × f32 → f32	115 ms	~102 ms	101 ms	1.14× faster
f16 × f16 → f32	132 ms	~126 ms	124 ms	1.06× faster
bf16 × bf16 → f32	51.1 ms	~52 ms	51.9 ms	parity (~2%)
i16 × i16 → i32	52.0 ms	~53 ms	52.6 ms	parity (~1%)
i8 × i8 → i32	43.6 ms	72.5 ms	46.9 ms	inner_tiled 1.07× slower due to relayout dispatch ; kernel itself is just as fast. See #24514

Tracking: inner_tiled CPU performance parity with mmt4d + ukernels #24515

Description

Goal

PRs

Benchmark

i8: residual ~7% gap — parked for now

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Tracking: `inner_tiled` CPU performance parity with `mmt4d` + ukernels #24515