Skip to content

Tracking: inner_tiled CPU performance parity with mmt4d + ukernels #24515

@bjacob

Description

@bjacob

Goal

Bring the inner-tiled path without ukernels to performance parity with the mmt4d path with ukernels.

So this is more ambitious than merely parity between inner_tiled and mmt4d. This is about making inner_tiled codegen so good that switching to inner_tiled removes much of the need for ukernels.

PRs

Benchmark

Square matmul 4096x4096x4096, dynamic shapes, AMD Zen 4
(Ryzen 9 7950X3D), iree-benchmark-module --device=local-task:

element types mmt4d + ukernel inner_tiled (before PRs) inner_tiled (after PRs) inner_tiled vs ukernel
f32 × f32 → f32 115 ms ~102 ms 101 ms 1.14× faster
f16 × f16 → f32 132 ms ~126 ms 124 ms 1.06× faster
bf16 × bf16 → f32 51.1 ms ~52 ms 51.9 ms parity (~2%)
i16 × i16 → i32 52.0 ms ~53 ms 52.6 ms parity (~1%)
i8 × i8 → i32 43.6 ms 72.5 ms 46.9 ms inner_tiled 1.07× slower due to relayout dispatch ; kernel itself is just as fast. See #24514

The "before PRs" column is inner_tiled prior to this PR series. f32, f16,
bf16 and i16 are essentially unchanged (their small before/after differences
are run-to-run noise) — the PRs target i8. i8 improves from 65% slower than the
ukernel to ~7% slower.

i8: residual ~7% gap — parked for now

Associated issue: #24514

The remaining i8 gap is entirely the encoding-relayout dispatch (the pack of
the C matrix), which currently scalarizes instead of vectorizing. Closing it
requires new vectorization/lowering infrastructure for iree_linalg_ext
relayout ops on CPU — it is not a wiring change. The precise root cause and the
proposed fix (map_load rooted on the padded side + a MapLoadOpVectorization
model + lowering) are written up separately: #24514.

We are parking the i8 parity goal at ~7% pending that infrastructure work. f32,
f16, bf16 and i16 are at parity today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    codegen/llvmLLVM code generation compiler backendperformance ⚡Performance/optimization related work across the compiler and runtime
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions