Skip to content

Replace matmul transform-dialect scripts with C++ pass pipeline#1593

Open
erwei-xilinx wants to merge 43 commits into
Xilinx:mainfrom
erwei-xilinx:matmul-codegen-cpp-pipeline
Open

Replace matmul transform-dialect scripts with C++ pass pipeline#1593
erwei-xilinx wants to merge 43 commits into
Xilinx:mainfrom
erwei-xilinx:matmul-codegen-cpp-pipeline

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

@erwei-xilinx erwei-xilinx commented May 7, 2026

Summary

Replace the per-test transform-dialect scripts driving matmul tiling/bufferization/vectorization with a single C++ MLIR pass taking a flat options struct. Goal: parametric, generally-applicable, debuggable, individually testable; supersedes the per-test transform_aie2*.mlir scripts.

Status: draft / WIP — pipeline is functional and HW-validated end-to-end on NPU2 for 9 matmul examples (7 NPU2 HW-validated, 1 NPU1 partial, 1 NPU1 deferred).

Scope of this PR: the air-matmul-codegen orchestrator + carrier-attribute infrastructure as the external API. All configuration is specified externally via the orchestrator's options. Automatic shape→config heuristics are deferred to a follow-up PR.

What's in this PR

Single public pass surface

After several rounds of consolidation, the matmul codegen pipeline is now exposed as one registered pass (air-matmul-codegen) with ~38 flat options, plus the general-utility air-fold-unit-extent-dims (used standalone outside the matmul flow). All internal phases — launch tile, packs, K-tiles, core tile, bufferization to L1/L2 allocs, prologue/epilogue, one-shot-bufferize, tile-for-vectorize, vec-prep — are free-function bodies invoked in fixed order from the orchestrator. Markers between phases are orchestrator-internal constants, not configurable.

Pipeline pass count (matmul-codegen surface): 21 (initial) → 9 (after stage groupings) → 1.

Phase order (each gated by its config)

A  tile-launch-tile (launch-tile)
B  pack-and-transpose (l2-pack-sizes + perms)
C  bufferize-output-l2 — runs before B in single-pack flows (so the
   tile-l3-l2-copies / fuse-output-truncf-first pre-steps see un-packed
   IR), and after B in two-pack flows (so the L2 alloc takes the packed
   shape expected by the L1 pack)
D  pack-and-transpose (l1-pack-sizes + perms; L1-output bufferize)
E  tile-k-and-fuse-packs (outer-k-tile-factor)
F  bufferize-l1-inputs into L2 (auto when D ran)
H  tile-cores (core-tile)
I  tile-k-and-fuse-packs (inner-k-tile-factor)
J  bufferize-l1-inputs into L1 (auto when H ran)
K  prologue-epilogue (prologue-tile / epilogue-tile)
L  one-shot-bufferize (one-shot-bufferize)
M  tile-for-vectorize (matmul-vec-tile)
N  vec-prep composite (do-vec-prep)

The orchestrator runs at ModuleOp scope so it can schedule one-shot-bufferize internally. Empty list option / 0 / false skips the corresponding phase. Pack-output bufferization to L1 auto-targets the LAST pack (L1 in two-pack flows, L2 in single-pack flows); bufferize-last-pack-output=false is available to inspect raw pack semantics (used by the pack_basic lit test).

Driver pipeline rewrites (9 matmul examples + 2 lit tests)

Example Target Pre-PR (legacy) Post-PR (cpp) Status
programming_examples/matrix_multiplication/bf16 iron-built vec stages only
programming_examples/matrix_multiplication/i8 iron-built vec stages only
test/xrt/37_matmul_transform_4x4_bf16 NPU2 (4×4, bf16, two-pack) transform_aie2p.mlir full A–N
test/xrt/44_triton_matmul_ver4_vector_ptr_opt NPU2 (4×4, bf16-out f32) transform_aie2p.mlir full A–N ✅ 4.72 → 4.03 ms
test/xrt/45_triton_matmul_ver4_strix_8x4 NPU2 (Strix, bf16-out f32) transform_aie2p.mlir full A–N ✅ 3.85 / 3.92 ms parity
test/xrt/46_triton_matmul_ver4_strix_8x4_i8_i8_i32 NPU2 (Strix, i8) transform_aie2p.mlir full A–N ✅ 2.23 → 2.06 ms
test/xrt/48_triton_matmul_ver4_strix_4x4_bf16_output NPU2 (4×4, bf16-out bf16) transform_aie2p.mlir full A–N
test/xrt/53_matmul_padding_bf16 NPU2 (4×4, bf16 padding) transform_aie2p.mlir full A–N
test/xrt/54_matmul_padding_f32_bf16_emulation NPU2 (4×4, f32 bf16-emul) transform_aie2p.mlir full A–N
test/xrt/39_triton_matmul_ver3_vectorized NPU1 (Phoenix, single-pack) transform.mlir partial ⚠️ orchestrator parses + IR is structurally valid; downstream aiecc hangs on Strix machine. Needs NPU1 HW to validate.
test/xrt/12_matmul_transform_1x4_bf16 NPU1 (Phoenix, 1×4, bf16) transform.mlir not migrated ⏭️ uses transform.structured.pad (padding, not packing) + custom test.exe harness; padding-mode orchestrator support is a separate PR

Per-test driver pipeline shape:

  • Test 37 (two-pack): one invocation runs A–N end-to-end (orchestrator wraps one-shot-bufferize + vec-prep).
  • Tests 44/45/46/48/53/54 (single-pack with herd materialization): two invocations bracketing scf-forall-to-parallel + air-par-to-herd + air-herd-vectorize — first runs A–M (do-vec-prep=false), second runs only N.
  • prog_ex bf16/i8: iron-built; orchestrator runs only M (vec tile) + N (vec prep).
  • Test 39 (NPU1 single-pack): one invocation A–M + post-orchestrator herd materialization.
  • pack_basic.mlir + tile_copies_basic.mlir RUN lines updated; expected output adjusted for post-pack canonicalize.

Other code cleanup

  • Consolidated duplicate findMarkedOp / findMarkedForLoop into air::findOpWithAttr / air::findOpOfTypeWithAttr<OpTy> in air/Util/Util.h.
  • Replaced local ConvertMemrefCopyToLinalgCopyPattern with the existing runConvertMemrefCopyToLinalgCopy helper.
  • Replaced Option<std::string> + custom int-list parser with tablegen ListOption<int64_t>.
  • Extracted tileAsForallResult helper for the scf::tileUsingSCF{ForallOp} + replaceOp sequence shared across three internal phases.
  • Replaced ad-hoc topological worklist in hoistInterveningDeps with upstream mlir::computeTopologicalSorting.

Carrier attribute (air.matmul_codegen_config)

Kept in the codebase for future heuristic use. The orchestrator currently does NOT read it (per-phase options are passed explicitly by the caller). External producers (autotuners, tools, future heuristic) can set the attribute without changing the pipeline; reintroducing read-paths is a one-line change in each phase impl.

Test results

  • All 7 NPU2 HW tests PASS (37 / 44 / 45 / 46 / 48 / 53 / 54), per-iter timings as in the table above.
  • prog_ex bf16 + i8 PASS on NPU2.
  • check-air-mlir: 381/388 effective passing — one long-standing failure (Transform/AIRTransform/AIRBufferize/air_transform_payload.mlir) unrelated to this PR.

What's NOT in this PR (follow-ups)

  • NPU1 HW validation for tests 12 & 39. Test 39 has a --use-cpp-pipeline path; orchestrator pipeline runs cleanly but aiecc hangs on the local Strix when targeting NPU1 IR. Needs Phoenix HW + investigation. Test 12 not yet migrated (padding flow).
  • Padding-mode orchestrator support. Test 12 uses transform.structured.pad instead of pack. Adding a padding option + params to air-matmul-codegen is a separate PR.
  • Heuristic that derives air.matmul_codegen_config from matmul shape + target — separate PR.
  • Derivation-driven tile_cores (needs modelling air-collapse-herd).
  • Heuristic for two-pack-level auto-detection.
  • Triton-XDNA driver.py invocation of the cpp pipeline (lives in the Triton-XDNA repo).
  • Replace runHoistLoopInvariantTransfers (~150 LOC) with upstream linalg::hoistRedundantVectorTransfers — investigated locally; semantics diverge for the multi-pair-aliasing-same-memref case where upstream conservatively refuses while our areEquivalentIndices-driven check correctly hoists. Needs upstream extension or a hybrid wrapper.
  • Optional runEliminateRedundantVectorTransfers / runFuseTruncfLinalg upstream-replacement micro-trims (~60 LOC combined, marginal).

erwei-xilinx and others added 30 commits May 7, 2026 14:53
Replaces the per-test transform-dialect scripts driving matmul
tiling/bufferization/vectorization with a sequence of focused C++
MLIR passes. Goal: parametric, generally-applicable, debuggable,
individually testable; eventually supersede the per-test
transform_aie2*.mlir scripts. See MATMUL_CODEGEN_PIPELINE_PLAN.md
for the full design and status table.

Milestones landed:
- M0: air-matmul-pack-and-transpose, air-matmul-tile-l3-to-l2-copies
  (byte-identical IR vs transform-script Phases 1+3).
- M1: Group B (vectorization-prep) passes for prog_ex matmul/{bf16,i8}
  (HW-validated on NPU2).
- M2: Group A passes for tests 53/54 single-pack flow (HW-validated).
- M3a/M3b: air.matmul_codegen_config carrier attr + heuristic pass +
  L1-fit guardrail; --use-cpp-pipeline implies M3 (HW-validated;
  shape-sweep 5/6 PASS).
- M4a: two-pack-level (test 37) infrastructure incl. two marker-flow
  fixes in tile-k-and-fuse-packs (HW-validated; perf within noise of
  legacy across all three tests).

Tests:
- 390/391 lit tests pass (the 1 failure is pre-existing, unrelated).
- Tests 37/53/54 cpp paths PASS on NPU2 hardware.
- prog_ex matmul/{bf16,i8} cpp paths PASS on NPU2.

Backup commit; not yet a PR. Outstanding: M2d (delete obsolete
per-test transform scripts), M3c (real derivation for tile_cores),
M4b (heuristic for two-pack), M5 (Triton-XDNA backend integration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eline

Test 48's transform_aie2p.mlir maps phase-by-phase to existing
M0/M1/M2 passes — no new infrastructure needed; the test 53 cpp
pipeline string was reusable verbatim. The heuristic produces the
same tile/pack/vector parameters that the legacy script hand-wires
(pack [8,8,8], tile_cores [8,8,0], K-tile 8, prologue [8,8],
epilogue [64,64], vector [2,2,1,...]).

Hardware-validated on NPU2: PASS for both legacy and cpp paths.
Perf parity confirmed (3-run min times: legacy 0.211ms vs
cpp 0.208ms — within noise).

Open question on `air-hoist-cast-pairs` resolved: fixed-point
converges to STRUCTURALLY identical IR vs the 4 sequential
`transform.air.hoist_cast_pair` calls in the legacy script (same
op counts, alloc shapes, nesting; only diffs are SSA renumbering
and missing `prologue_herd`/`compute_herd`/`epilogue_herd`
annotations on cpp herds — purely cosmetic).

Adds AIR_DUMP_FINAL_IR env var for debugging.

Phase 2 (Triton-XDNA's driver.py invokes the cpp pipeline directly,
removing the obsolete tile-and-promote default + transform-script
loading) lives in the Triton-XDNA repo, not mlir-air.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes air-matmul-set-codegen-config (the M3a/M3b pass that wrote
the air.matmul_codegen_config carrier attribute via a hardcoded
target+dtype lookup). Heuristics belong in a separate PR; this PR
should only expose the consumer pipeline with all configuration
specified externally.

What stays:
- The carrier-attribute infrastructure (MatmulCodegenConfig.{h,cpp})
  is kept as the external API. Each consumer pass reads
  air.matmul_codegen_config when present and falls back to its
  pass-options otherwise — so external producers (a future heuristic
  PR, autotuners, end-user tooling) can specify configuration without
  changing the pipeline.
- All consumer passes are unchanged.

What changes in tests:
- Tests 53, 54, 48 each pass tile/pack/vector parameters explicitly
  via per-pass options strings (M2 style). Test 37 was already
  written this way.
- --use-codegen-config flag removed; --use-cpp-pipeline is the only
  toggle.

HW-validated on NPU2 (all five paths PASS):
- prog_ex matmul/{i8, bf16}
- test/xrt/{37, 48, 53, 54} cpp pipelines
390/391 lit tests pass (the 1 failure is pre-existing, unrelated).

Also scrubs all iree / iree-amd-aie references from PR-scope files
(MATMUL_CODEGEN_PIPELINE_PLAN.md only — no source code referenced
those projects).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This planning doc should not be committed. PR description carries the
summary; in-tree planning docs are not wanted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit-driven cleanup of the new matmul-codegen passes:

1. Promote findMarkedOp/findMarkedForLoop to air/Util/Util.h as
   findOpWithAttr / findOpOfTypeWithAttr<OpTy>. Replaces 3 local
   redefinitions plus 4 inlined walks across the new files.

2. Drop the local ConvertMemrefCopyToLinalgCopyPattern in
   AIRMatmulTileL3ToL2Copies.cpp; call the existing
   runConvertMemrefCopyToLinalgCopy helper instead (already exposed
   in AIRMatmulCodegenHelpers.h).

3. Extract tileAsForallResult helper for the
   scf::tileUsingSCF{LoopType=ForallOp} + replaceOp sequence.
   AIRMatmulTileCores and AIRMatmulTileLaunchTile now share it
   instead of open-coding the same 8 lines.

4. Replace Option<std::string> + parseIntList with tablegen
   ListOption<int64_t> for tile-sizes / prologue-tile-sizes /
   epilogue-tile-sizes / fill-iterator-interchange. Deletes the
   parseIntList helper outright.

Net ~30 LOC removed. All 5 HW tests still PASS on NPU2; lit unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The intervening-deps move-above-anchor logic in
hoistInterveningDeps had its own iterative "find ops with all-resolved
operands and move them" loop. Upstream provides
mlir::computeTopologicalSorting in mlir/Analysis/TopologicalSortUtils.h
that does exactly this — sorts an arbitrary op set in dependency order,
treating ops outside the set as already-ready (incomplete-chain
semantics). After sorting, a single moveBefore pass yields the same
result.

Net 13 LOC removed. All 5 HW tests still PASS on NPU2; lit unchanged
at 390/391.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four of the registered passes had no parametric external config — they
walked for a marker and called a 1-2 line helper. They were noise in
the public pass registry without adding an external API surface.

Dropped from tablegen registration:
- air-matmul-cleanup-bufferize
- air-matmul-fuse-pingpong-loops
- air-matmul-fuse-output-truncf
- air-hoist-static-alloc

Replaced by:
- One new combined pass air-matmul-post-bufferize-cleanup (cleanup +
  pingpong-fuse run back-to-back in all tests using them).
- Boolean option `fuse-output-truncf-first` on
  air-matmul-bufferize-output-l2 (fuse-truncf must precede the L2 fill
  bufferize so the fill's element type matches the post-fuse matmul).
- Boolean option `hoist-static-alloc-first` on
  air-matmul-prologue-epilogue (used by the M4 K-peel flow).

The pass bodies survive as plain C++ functions in
AIRMatmulBufferizationPasses.h (`runFusePingpongLoopsImpl`,
`runFuseOutputTruncfImpl`, `runHoistStaticAllocImpl`), called either
from the combined pass or from the option-driven steps in the
parametric passes.

Net 69 LOC removed; public pass count drops by 3 (4 dropped, 1 added).
All 5 HW tests still PASS on NPU2; lit unchanged at 390/391.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six of the M1 vec-prep passes (eliminate-redundant-vector-transfers,
flatten-for-iter-args, hoist-loop-invariant-transfers,
hoist-vector-transfer-pointers, vector-cast-for-emulation, hoist-cast-pairs)
were registered separately but in practice ALWAYS fire as a fixed-order
sequence inside the matmul codegen pipeline. Nobody invokes them
individually outside this workflow.

Replaced with one composite air-matmul-codegen-vec-prep that runs them
in fixed order, with boolean options to enable/disable each step and
two parallel option groups (cast1-* / cast2-*) for the 0-2
vector-cast-for-emulation invocations (test 54 needs two with different
target element types; others need one or zero).

air-fold-unit-extent-dims kept registered separately because the
prog_ex pipelines invoke it standalone outside the vec-prep block.

Per-step bodies survive as static helper functions inside the same TU,
called by the composite. The runFoo helpers in
AIRMatmulCodegenHelpers.cpp are unchanged.

Net 207 LOC removed; 5 fewer registered passes (6 dropped, 1 added).
All 5 HW tests + 2 prog_ex paths still PASS on NPU2; lit unchanged at
390/391.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Continued the matmul-named pass-registry cleanup:

- air-matmul-tile-l3-to-l2-copies → option `do-tile-l3-to-l2-copies` on
  bufferize-output-l2 (with k-l2-tile + copy-marker options moved over).
  Body extracted as `runTileL3ToL2CopiesImpl` for direct call.
- air-matmul-bufferize-l1-output → option `do-bufferize-l1-output` on
  pack-and-transpose (tail step).
  Body extracted as `runBufferizeL1OutputImpl`.
- air-matmul-post-bufferize-cleanup → option
  `do-post-bufferize-cleanup-first` on tile-for-vectorize (pre-step).
  Body extracted as `runPostBufferizeCleanupImpl`.

Net 3 fewer registered passes. From 12 → 9.

Public matmul-codegen pass count progression in this PR:
  21 (initial) → 17 → 14 → 12 → 9.

The 9 remaining passes are all genuinely parametric:
  air-matmul-tile-launch-tile, air-matmul-pack-and-transpose,
  air-matmul-bufferize-output-l2, air-matmul-tile-k-and-fuse-packs,
  air-matmul-tile-cores, air-matmul-bufferize-l1-inputs,
  air-matmul-prologue-epilogue, air-matmul-tile-for-vectorize,
  air-matmul-codegen-vec-prep + air-fold-unit-extent-dims (general
  utility). All have substantive options (tile sizes, perms, markers,
  memory spaces) that justify standalone exposure.

All 5 HW tests + 2 prog_ex paths still PASS on NPU2. Lit unchanged
at 390/391.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e orchestrator

Replaces the 9 internal tablegen entries with a single `air-matmul-codegen`
pass taking a flat options struct (~38 fields).  Internal phases become free
functions invoked in fixed order; markers between phases are
orchestrator-internal constants.

Phase placement (each gated by its config; canonicalize/cse runs between):
  A  tile-launch-tile (launch-tile)
  B  pack-and-transpose (l2-pack-sizes + perms)
  C  bufferize-output-l2 — runs before B in single-pack flows (so the
     tile-l3-l2-copies / fuse-output-truncf-first pre-steps see un-packed
     IR), and after B in two-pack flows (so the L2 alloc takes the packed
     shape expected by the L1 pack)
  D  pack-and-transpose (l1-pack-sizes + perms; L1-output bufferize)
  E  tile-k-and-fuse-packs (outer-k-tile-factor)
  F  bufferize-l1-inputs into L2 (auto when D ran)
  H  tile-cores (core-tile)
  I  tile-k-and-fuse-packs (inner-k-tile-factor)
  J  bufferize-l1-inputs into L1 (auto when H ran)
  K  prologue-epilogue (prologue-tile / epilogue-tile)
  L  one-shot-bufferize (one-shot-bufferize, module-scope sub-pipeline)
  M  tile-for-vectorize (matmul-vec-tile)
  N  vec-prep composite (do-vec-prep)

Pipeline pass count (matmul-codegen surface):
  21 (initial PR) → 9 (after stage groupings) → **1 + general-utility
  air-fold-unit-extent-dims**.  The orchestrator runs at ModuleOp scope so
  it can schedule one-shot-bufferize internally.

Pack output bufferization auto-detect:
  - single-pack flow (l1-pack-sizes empty): the L2 pack is the LAST pack;
    its output bufferizes to L1 by default.
  - two-pack flow: the L1 pack is the LAST pack; its output bufferizes to
    L1 by default.
  - bufferize-last-pack-output=false to inspect raw pack semantics
    (used by the pack_basic lit test).

Driver pipeline rewrites (all 5 HW tests + 2 lit tests):
  - tests 37/48/53/54 + prog_ex bf16/i8 each collapse the prior 12-line
    phase list into one or two air-matmul-codegen invocations.
  - tests 48/53/54 use two invocations bracketing
    scf-forall-to-parallel + air-par-to-herd + air-herd-vectorize: first
    invocation runs A-K + L + M; second invocation runs only N.
  - prog_ex bf16/i8 set everything before M to skip and enable only M+N.
  - pack_basic.mlir + tile_copies_basic.mlir RUN lines updated to the new
    surface; expected output adjusted for the post-pack canonicalize.

All 5 HW tests PASS on NPU2.  check-air-mlir 381/388 effective passing
(unchanged from prior commit; the 1 long-standing air_transform_payload.mlir
failure is unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI black check flagged the multi-line tuple in `output_shapes_dtypes` as
unformatted.  Apply black formatting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runs `git clang-format origin/main` on the diff vs main, which catches
formatting on all changed lines (not just freshly-touched ones).  My local
`clang-format-17 -i` only formatted the latest edits; this commit applies
clang-format to the cumulative diff against main, normalizing:

- Passes.td (matmul-codegen entry layout)
- AIRMatmulCodegen.cpp (multi-line call argument grouping)
- AIRMatmulCodegenHelpers.{h,cpp} (lambda + multi-line call grouping)
- MatmulCodegenConfig.{h,cpp} (function signature wrapping)
- AIRLinalgCodegen.cpp (signature wrapping)

Verified `ninja air-opt` still builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds --use-cpp-pipeline + --profile-iters to four legacy transform-script
matmul tests; replaces the per-test transform_aie2{,p}.mlir with calls to
the air-matmul-codegen orchestrator.

Migrations and HW perf (NPU2 / Strix, 30 iters median, lower is better):

| Test                                          | Legacy   | CPP      | Status      |
|-----------------------------------------------|----------|----------|-------------|
| 44 triton_matmul_ver4_vector_ptr_opt (aie2p) | 4.715 ms | 4.032 ms | PASS, faster|
| 45 triton_matmul_ver4_strix_8x4              | 3.853 ms | 3.922 ms | PASS, parity|
| 46 triton_matmul_ver4_strix_8x4_i8_i8_i32    | 2.231 ms | 2.064 ms | PASS, faster|
| 39 triton_matmul_ver3_vectorized (NPU1)      | n/a      | n/a      | partial*    |

* Test 39 targets aie2 / NPU1 (Phoenix). The orchestrator pipeline parses
  cleanly and produces valid IR, but downstream aiecc compilation hangs on
  the local Strix machine. Needs NPU1 hardware to fully validate.

Test 12 (matmul_transform_1x4_bf16) NOT migrated: it uses
transform.structured.pad (padding flow) rather than packing, plus a hand-
rolled test.exe harness instead of XRTRunner. Padding-mode orchestrator
support is a separate PR.

Orchestrator extensions:

* Phase C placement: in single-pack flows (l1-pack-sizes empty) Phase C
  now runs BEFORE Phase A (launch-tile) so the L2 alloc lands at LAUNCH
  scope outside any per-core forall. Two-pack flows still run Phase C
  after Phase B (matches existing test 37 behavior).
* Phase F': in single-pack flows with NO tile-cores (e.g. launch-tile-
  only flows like test 39), bufferize the K-fused L1 packs to L1 here
  using the lhs_pack_in_k / rhs_pack_in_k markers (Phase J's
  fused_*_l1_pack markers don't fire without tile-cores).
* runTileLaunchTileImpl now tags the inner per-launch-tile matmul with
  matmul_compute, so downstream tile-for-vectorize finds it in
  launch-tile-only flows where there's no separate tile-cores step.

All 7 NPU2 tests (37, 44, 45, 46, 48, 53, 54) PASS on hardware.
check-air-mlir 381/388 effective passing (unchanged baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…umElements

The MATMUL_CODEGEN_PIPELINE_PLAN.md doc was never checked in, and the
"M0/M1a/M2/M3/M4/M5" + "Group A" milestone numbers it referenced are
meaningless without it. Sweep all source comments and docstrings to drop
both the plan-doc citations and the milestone tags, replacing them with
direct prose about what each phase does relative to the
air-matmul-codegen orchestrator. No behavioural change.

Also dedupe `xilinx::air::getVectorNumElements(VectorType)` against
upstream `mlir::VectorType::getNumElements()` (inherited from
ShapedTypeInterface as part of `extraSharedClassDeclaration`). The local
helper was a literal product-of-shape-dims wrapper; vectors always have
static shape so the upstream `hasStaticShape()` assert is a no-op.
Replaces 5 callsites and deletes the local helper + its decl.

Verified: ninja air-opt builds clean; check-air-mlir 381/388 effective
passing (unchanged baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alentTo

The local helper hand-rolled three cases (SSA equality, AffineApplyOp +
map+operands, ConstantIndexOp + value), all of which reduce to upstream
`OperationEquivalence::isEquivalentTo` semantics: same op kind + same
attributes + operand SSA-equality. The upstream version is also more
general — it correctly recognizes equivalent index expressions built
from arith ops (e.g. `arith.muli %iv, %c4`) which the hardcoded
AffineApply/ConstantIndex check would miss.

Validated:
- check-air-mlir 381/388 effective passing (unchanged baseline).
- All 7 NPU2 HW tests PASS (37, 44, 45, 46, 48, 53, 54).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tching

`air::isEquivalentTo` (Util.cpp) is the air-side wrapper that mirrors
upstream `OperationEquivalence::isEquivalentTo` but adds operand
const-int equivalence: two operand SSAs that are distinct but both fold
to the same constant int are treated as equivalent. For comparing memref
indices this is the right semantics — `affine.apply (%iv, %c0)` and
`affine.apply (%iv, %const_0)` should match when both `%c0`/`%const_0`
fold to 0, even if they are distinct SSA values produced by separate
`arith.constant 0 : index` ops in different scopes.

Validated:
- check-air-mlir 381/388 effective passing (unchanged baseline).
- All 7 NPU2 HW tests PASS.
- test 45 perf: 3.957 ms median (within noise of previous 3.92 ms).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper has only two callsites, both in AIRMatmulCodegenHelpers.cpp
(areIdenticalReads + the hoist-loop-invariant transfer scan).  Drop the
header decl and mark the function `static`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntTo

The hand-rolled helper checked op kind + source memref + index SSA-or-
constant equivalence + result vector type. Upstream
`OperationEquivalence::isEquivalentTo(read1, read2, IgnoreLocations)`:

- checks op kind ✓
- checks attributes (permutation_map, in_bounds, padding) — strictly
  better than the hand-rolled version, which silently merged reads with
  different in_bounds / permutation_map
- checks operand SSA equality (indices + source memref) — strict, but
  `eliminate-redundant-vector-transfers` runs after canonicalize so
  duplicate constants are CSE'd into the same SSA value before this fires
- checks result types ✓

Single callsite inlined; helper + header decl deleted.

Validated:
- check-air-mlir 381/388 effective passing (unchanged baseline).
- All 7 NPU2 HW tests PASS.
- test 45 perf: 3.863 ms median (legacy was 3.853 ms; within noise).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hand-rolled transitive operand walk in dependsOnLoopIV duplicates
air::traceDependentInductionVar, which already traces scf.for IVs,
scf.parallel IVs, herd IDs, scf.for iter_args, peels air.execute, and
gathers operands from parent affine.if guards. The wrapper preserves the
existing predicate semantics (returns true iff loopIV appears in the
collected dependency history) and gains iter_arg + air.execute coverage
for cloneOpAndOperands and the hoist helpers for free.
All callsites are inside AIRMatmulCodegenHelpers.cpp. With the body now
delegating to air::traceDependentInductionVar, the helper has no other
consumers and can become a static file-local.
Replace the hand-rolled per-operand recursion with a top-level operand
walk that calls air::cloneOpAndOperands (Util.cpp) per operand chain.
The slice traversal is now handled by upstream getBackwardSlice with a
per-op filter that rejects ops outside the loop, already mapped, or
IV-dependent (matching the prior per-operand-edge semantics in aggregate).

The local helper retains the entry-level early-return on already-mapped
target results so repeated calls with shared mapping (e.g., the index
loop in runHoistVectorTransferPointers) still de-dup. All callsites are
in this .cpp; helper is now file-static and the header decl is removed.
…ne pass

The standalone AIRFoldUnitExtentDims pass is a thin wrapper around
runFoldUnitExtentDimsOnFunc, only ever co-invoked with air-matmul-codegen
in the 11 known pipelines. Move all responsibility into the orchestrator:

- Add Phase-0 option do-pre-fold-unit-extent-dims (default false). Runs
  the fold once before any other phase. Used as initial IR cleanup when
  invoking the orchestrator on IRON-emitted IR with stray unit dims.
- Phase-N vec-prep already calls the fold (gated vec-prep-fold-unit-
  extent-dims, default true) and covers the post-vec-prep cleanup.
- Drop AIRFoldUnitExtentDims pass class, factory, registration, and
  tablegen entry.
- Update 11 run.py callsites:
  * Tests 39/44/45/46/48/53/54: drop the trailing air-fold-unit-extent-
    dims (vec-prep's internal fold now handles it). Test 39 has no second
    orchestrator, so add a minimal air-matmul-codegen{do-pre-fold-...}
    invocation in its place.
  * prog_examples i8/bf16: replace pre-orchestrator and inter-orchestrator
    folds with do-pre-fold-unit-extent-dims=true on the first orchestrator
    and re-enabled vec-prep-fold-unit-extent-dims (default) on the second.
The local hoistStaticallyBoundAllocationsInFunc<T> template was only
ever instantiated with memref::AllocOp, with the std::is_same branches
gating dealloc insertion to that one type anyway. The xilinx::air::
hoistStaticAllocsInFunc wrapper existed solely to expose a non-template,
namespaced symbol for cross-TU linking.

De-template both helpers, fold them into xilinx::air::, and route the
transform op directly through the public free function. One symbol
instead of two; ~10 lines of branch logic removed.
runHoistLoopInvariantTransfersStep and runHoistVectorTransferPointersStep
both walked herds for innermost scf.for ops with the same ~15-line
hasInnerFor probe. Lift into findInnermostForsInHerds with an optional
herd filter; the second caller passes herdHasVectorContract to skip
fill/epilogue herds.
Python-side timing of XRT kernel invocations is too noisy for
performance evaluation (Python overhead, GIL, GC, scheduling). The
benchmark method and its _make_xrt_backend factory in xrt_runner.py
were a mistake; remove them entirely.

Also drop the corresponding --profile-iters argparse option and
runner.benchmark callsite from 8 test/xrt/*/run.py drivers (37, 39,
44, 45, 46, 48, 53, 54).

Drive-by: rewrite the leftover "Iron-built flow" comment in i8/run.py
and bf16/run.py to "Direct-codegen flow" since it just describes what
--direct-codegen does, not the IR origin.
i8/run.py and bf16/run.py each carried a ~125-line inline transform
script under `if False:`. Permanently unreachable, kept only as
reference. Delete; the transform-dialect path is preserved in git
history if anyone needs to revive it.

Verified i8 --direct-codegen --arch aie2p --compile-mode compile-and-run
still PASSes after the deletion.
Each NPU2 transform-script lit gets a sibling *_cpp.lit that exercises
the C++ orchestrator path via --use-cpp-pipeline. Until now the cpp
pipeline had zero CI coverage (manual --use-cpp-pipeline only); the
existing transform-script lits were the sole exercised path.

Tests covered: 37, 44, 45 (each: standard + elf), 46, 48, 53, 54.

For 46 and 54 the existing lit is Makefile-driven; the new cpp lit
invokes run.py directly with FileCheck since the Makefile run target
doesn't pipe through --use-cpp-pipeline. Test 39 is intentionally
skipped: its cpp pipeline currently targets NPU1 (aie2) only and was
flagged as a partial migration in the prior PR session.

Each new lit was validated locally on NPU2 (PASS).
The Phase-0 fold is idempotent and a no-op on already-clean IR, so
there's no reason to make callers opt in. Flip the default and drop
the explicit `do-pre-fold-unit-extent-dims=true` in the i8/bf16
prog_examples and the test 39 fold-only invocation.
The orchestrator had 52 options. 7 of them were pure structural
toggles or always-on bools that no caller varied:

- do-pre-fold-unit-extent-dims (default true; phase always runs)
- do-vec-prep (default true; pre-vectorize the 7 of 8 vec-prep walks
  find no IR to operate on; the 8th is the same fold Phase 0 already
  runs)
- vec-prep-fold-unit-extent-dims
- vec-prep-eliminate-redundant-vector-transfers
- vec-prep-hoist-loop-invariant-transfers
- vec-prep-flatten-for-iter-args
- vec-prep-hoist-vector-transfer-pointers

All seven are removed from Passes.td, the orchestrator runOnFunc, and
runCodegenVecPrepImpl. The 9 run.py callers stop having to pass
do-vec-prep=false on the first orchestrator invocation; first-call
vec-prep is now a no-op walk.

Verified on NPU2 HW:
- Test 46 cpp profile: 6182 -> 6130 gflops (-0.84%, within ±5% noise)
- Test 54 cpp profile: 116.7 -> 116.7 gflops (-0.04%)
- Correctness sweep tests 37/44/45/46/48/53/54 cpp pipeline: all PASS
erwei-xilinx and others added 11 commits May 8, 2026 18:27
Replace test 45's two mutually exclusive paths (--use-cpp-pipeline opt-in)
with a single transform-script path that delegates the matmul-specific
work to air-matmul-codegen via transform.apply_registered_pass. The
script keeps the non-matmul plumbing (scf.forall->herd, herd-vectorize,
canon/cse) but no longer hand-rolls the tile/pack/bufferize/vectorize
sequence -- those are orchestrator options now.

- transform_aie2p.mlir: 355 -> 76 lines. All transform.air.* matmul ops
  replaced by two apply_registered_pass "air-matmul-codegen" calls
  (pre-vectorize half + vec-prep half).
- run.py: drop --use-cpp-pipeline argparse + cpp_pipeline branch.
- Drop the run_npu2_peano_cpp.lit + run_npu2_peano_elf_cpp.lit drivers
  (they exercised the now-removed path).

Verified on NPU2: existing run_npu2_peano.lit + run_npu2_peano_elf.lit
both PASS through the rewritten script (3 runs each).

This is the validation lap for the unified-path proposal. The remaining
6 tests (37, 44, 46, 48, 53, 54) follow the same pattern and will be
migrated in subsequent commits.
Same shape as test 45: both transform_aie2{,p}.mlir scripts (NPU1, NPU2)
become thin wrappers around two air-matmul-codegen invocations with
intermediate par-to-herd + herd-vectorize. Drop --use-cpp-pipeline +
--arch plumbing in run.py. Drop run_npu2_peano_cpp.lit + elf variant.

NPU2 validated: existing run_npu2_peano.lit + run_npu2_peano_elf.lit
both PASS through the rewritten script. NPU1 not validated locally; CI
will exercise via run_npu1_peano.lit.
…_pass

NPU2 transform_aie2p.mlir becomes a thin wrapper around one
air-matmul-codegen invocation (two-pack-level flow). Drop the cpp
pipeline branch and the cpp-pipeline lits. NPU1 transform_aie2.mlir is
left as the legacy hand-rolled path (the cpp pipeline string only
covered NPU2; no NPU1 orchestrator options were exercised before).

Verified on NPU2: run_npu2_peano.lit + run_npu2_peano_elf.lit both PASS.
Same shape as test 45/44; vec-prep cast type is i32 (i8 acc -> i32) instead
of f32. Drop --use-cpp-pipeline plumbing in run.py + run_npu2_peano_cpp.lit.
NPU2 validated: PASS!
bf16-out variant: sets fuse-output-truncf-first=true (pre-step) and
vec-prep-hoist-cast-pairs=true (cleans up f32-to-bf16 trunc/extf pairs
around iter_arg). Drop --use-cpp-pipeline plumbing + cpp lit. NPU2 PASS.
bf16-out non-tile-aligned variant: same shape as test 48; transform
script defaults to k-l2-tile=16. run.py keeps a small re.sub block to
rewrite k-l2-tile + outer-k-tile-factor when --k-l2-tile differs.

Drop --use-cpp-pipeline plumbing + the cpp lit. NPU2 PASS.
…5/44/46/48/53

Test 54 (f32-in/out + BFP16 emulation) joins tests 45/44/46/48/53 with a
unified transform-script path that delegates matmul codegen to the C++
air-matmul-codegen orchestrator via transform.apply_registered_pass.
Drop --use-cpp-pipeline branch + run_npu2_peano_cpp.lit.

Profiling test 54 surfaced a -32% perf regression that affected ALL the
unified scripts in this PR: replacing `func.func(canonicalize,cse,fold-
memref-alias-ops)` (full passes that iterate to fixed point + DCE) with
transform.apply_patterns (one-shot pattern application) lost critical
canonicalization. Fix: switch the cleanup blocks in tests 45/44/46/48/
53/54 to chain three transform.apply_registered_pass invocations
("canonicalize", "cse", "fold-memref-alias-ops") on a fresh func handle
each, matching the original pass semantics.

Verified on NPU2:
- Test 46 cpp profile: 6182 -> 6119 gflops (-1.02%, within +/-5% noise)
- Test 54 cpp profile: 116.7 -> 115.8 gflops (-0.79%)
- Correctness sweep tests 44, 45, 46, 48, 53 all PASS
The two-line rc-then-exit shape was a leftover from the prior
--profile-iters plumbing (which checked rc==0 before benchmarking).
With --profile-iters gone, rc was single-use and the indirection
served no purpose. Collapse across all 8 test/xrt run.py drivers.
Finish unifying the NPU1 paths that were left as legacy in the prior
pass: test 37's transform_aie2.mlir (two-pack mmul=4x4x8) and test 39's
transform.mlir (single-pack mmul=4x4x8). Drop test 39's now-dead
--use-cpp-pipeline branch.

Both target NPU1 (aie2/Phoenix) which I cannot validate locally
(aiecc compile of NPU1 IR hangs on Strix host, known issue). Scripts
parse cleanly; CI run_npu1_peano.lit drivers will exercise the actual
HW path. Options derived from the legacy script's tile/pack/permute
sequence (test 37) and from the previously-existing cpp pipeline string
in run.py (test 39, which I deleted in the same commit).
OperationEquivalence::isEquivalentTo (introduced in 773953c) is strict
on operand SSA equality, so it fails to merge two vector.transfer_read
ops whose indices are computed by distinct-but-identical affine.apply
ops, or come from two scf.for iter_args initialized to the same value.
The transform-op path in AIRLinalgCodegen.cpp::areIdenticalReads still
uses the value-aware matcher and correctly catches both cases.

Concrete impact: the orchestrator's vec-prep left an extra B-tile
vector.transfer_read in the inner loop of the bf16 llama matmul (M=128,
K=2048, N=512/2048/8192). After hoist-vector-transfer-pointers lifted
the index into a redundant iter_arg, Peano's MC encoder hit
'getmBMsOpValue: Register not in mBMs' UNREACHABLE on the resulting
VLD_x_pstm_nrm_imm_pseudo. mlir-aie@b37dc33's placer changes exposed
the regression in CI; older mlir-aie scheduled around it.

Inline the AIRLinalgCodegen.cpp::areIdenticalReads body
(base + index-by-index areEquivalentIndices + vector type) and extend
areEquivalentIndices to recognise affine.apply ops with identical map
and operands. With the fix, core_7_5.opt.ll for run_npu2_llama_8x4_kv
is byte-identical to the transform-script output on origin/main, and
all four llama lit tests (qo, kv, gate_up, down) compile cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx marked this pull request as ready for review May 12, 2026 17:52
@erwei-xilinx erwei-xilinx requested a review from fifield as a code owner May 12, 2026 17:52
Copilot AI review requested due to automatic review settings May 12, 2026 17:52
@erwei-xilinx erwei-xilinx requested a review from jgmelber as a code owner May 12, 2026 17:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces per-test transform-dialect matmul scripts with a single registered MLIR pass (air-matmul-codegen) that orchestrates tiling/packing/bufferization/vectorization as a fixed-order C++ pipeline driven by flat pass options, and updates several XRT tests/examples to invoke the new orchestrator.

Changes:

  • Add the air-matmul-codegen orchestrator pass (+ supporting phase implementations and headers) and register/build it.
  • Add lit tests covering copy-tiling and pack/transpose behavior.
  • Update multiple matmul XRT tests/examples to use transform.apply_registered_pass (or a PassManager pipeline) to drive codegen via the C++ orchestrator.

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test/xrt/54_matmul_padding_f32_bf16_emulation/transform_aie2p.mlir Replace legacy transform script with orchestrator pass invocations + cleanup passes.
test/xrt/54_matmul_padding_f32_bf16_emulation/run.py Rewrite transform script options (k-l2-tile / outer-k factor / epilogue-tile) before running transform.
test/xrt/53_matmul_padding_bf16/transform_aie2p.mlir Replace legacy transform script with orchestrator pass invocations + cleanup passes.
test/xrt/53_matmul_padding_bf16/run.py Rewrite orchestrator options in script; add debug print-and-exit flag.
test/xrt/48_triton_matmul_ver4_strix_4x4_bf16_output/run.py Clarify script role; add optional final-IR dump env hook.
test/xrt/46_triton_matmul_ver4_strix_8x4_i8_i8_i32/run.py Update CLI help and comments to reflect orchestrator-backed script.
test/xrt/45_triton_matmul_ver4_strix_8x4/run.py Update comments/help to reflect orchestrator-backed script.
test/xrt/44_triton_matmul_ver4_vector_ptr_opt/transform_aie2p.mlir Replace large annotated script with orchestrator-based transform script (AIE2P).
test/xrt/44_triton_matmul_ver4_vector_ptr_opt/transform_aie2.mlir Replace large annotated script with orchestrator-based transform script (AIE2).
test/xrt/44_triton_matmul_ver4_vector_ptr_opt/run.py Update CLI help and comments to reflect orchestrator-backed script.
test/xrt/39_triton_matmul_ver3_vectorized/run.py Add --use-cpp-pipeline to run the C++ orchestrator pipeline directly.
test/xrt/37_matmul_transform_4x4_bf16/transform_aie2p.mlir Replace legacy transform script with a single orchestrator invocation (two-pack flow).
test/xrt/37_matmul_transform_4x4_bf16/run.py Update comments to describe orchestrator-backed vs legacy script roles.
programming_examples/matrix_multiplication/i8/run.py Replace hand-written vectorization transform IR with orchestrator vectorize/vec-prep pipeline.
programming_examples/matrix_multiplication/bf16/run.py Replace hand-written vectorization transform IR with orchestrator vectorize/vec-prep pipeline.
mlir/test/Transform/AIRMatmulTileL3ToL2Copies/tile_copies_basic.mlir New lit test for memref.copy→linalg.copy conversion + K-tiling + loop markers.
mlir/test/Transform/AIRMatmulPackAndTranspose/pack_basic.mlir New lit test for pack + transpose permutations and canonicalization behavior.
mlir/lib/Util/Util.cpp Add air::findOpWithAttr implementation.
mlir/lib/Util/MatmulCodegenConfig.cpp Add carrier-attribute reader/writer utilities for matmul codegen configs.
mlir/lib/Util/CMakeLists.txt Build MatmulCodegenConfig.cpp into AIRUtil.
mlir/lib/Transform/Passes.cpp Register the new air-matmul-codegen pass.
mlir/lib/Transform/CMakeLists.txt Add new matmul codegen source files to the Transform library.
mlir/lib/Transform/AIRMatmulVectorizePasses.cpp Implement vectorization/vec-prep phases as free functions for the orchestrator.
mlir/lib/Transform/AIRMatmulTilePasses.cpp Implement tiling phases (launch/K/core/prologue+epilogue) as free functions.
mlir/lib/Transform/AIRMatmulTileL3ToL2Copies.cpp Implement L3→L2 copy tiling as a free function invoked by orchestrator.
mlir/lib/Transform/AIRMatmulPackAndTranspose.cpp Implement pack+transpose phase and optional output bufferization.
mlir/lib/Transform/AIRMatmulCodegen.cpp New orchestrator pass wiring phases and canonicalize/CSE/one-shot-bufferize.
mlir/lib/Transform/AIRMatmulBufferizationPasses.cpp Implement output/L1 bufferization and post-bufferize cleanup steps.
mlir/lib/Transform/AIRLinalgBufferize.cpp Refactor hoist-static-alloc helper into a non-templated utility function.
mlir/include/air/Util/Util.h Declare findOpWithAttr and templated findOpOfTypeWithAttr.
mlir/include/air/Util/MatmulCodegenConfig.h Declare carrier-attribute schema and helper APIs.
mlir/include/air/Transform/Passes.td Add TableGen definition/options/docs for air-matmul-codegen.
mlir/include/air/Transform/Passes.h Export new matmul codegen/phase headers from pass umbrella header.
mlir/include/air/Transform/PassDetail.h Add generated pass defs for new passes.
mlir/include/air/Transform/AIRMatmulVectorizePasses.h Header for orchestrator vectorization/vec-prep free functions.
mlir/include/air/Transform/AIRMatmulTilePasses.h Header for orchestrator tiling free functions.
mlir/include/air/Transform/AIRMatmulTileL3ToL2Copies.h Header for orchestrator L3→L2 copy tiling free function.
mlir/include/air/Transform/AIRMatmulPackAndTranspose.h Header for orchestrator pack+transpose free function.
mlir/include/air/Transform/AIRMatmulCodegenHelpers.h New shared helper entry points used by both transform ops and C++ passes.
mlir/include/air/Transform/AIRMatmulCodegen.h Header declaring the orchestrator pass factory functions.
mlir/include/air/Transform/AIRMatmulBufferizationPasses.h Header for orchestrator bufferization/cleanup free functions.
mlir/include/air/Transform/AIRLinalgBufferize.h Export hoistStaticAllocsInFunc utility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Transform/AIRMatmulCodegen.cpp
Comment thread mlir/lib/Transform/AIRMatmulTilePasses.cpp Outdated
Comment thread mlir/lib/Transform/AIRMatmulVectorizePasses.cpp
Comment thread mlir/include/air/Transform/Passes.td
Comment thread mlir/lib/Util/MatmulCodegenConfig.cpp Outdated
Comment thread mlir/lib/Util/MatmulCodegenConfig.cpp
Comment thread mlir/lib/Transform/AIRMatmulPackAndTranspose.cpp Outdated
Comment thread mlir/include/air/Util/Util.h Outdated
Comment thread test/xrt/53_matmul_padding_bf16/run.py
erwei-xilinx and others added 2 commits May 12, 2026 11:11
Eight Copilot review comments:

* MatmulCodegenConfig.cpp: switch the carrier dictionary to the
  setDiscardableAttr / getDiscardableAttr APIs the header documented.
  setAttr / getAttrOfType happened to route the namespaced attr to
  discardable storage, but the explicit APIs make intent obvious and
  match getDiscardableAttrs() snapshots elsewhere.

* AIRMatmulPackAndTranspose.cpp: the snapshot/restore pair around
  linalg::pack already used getDiscardableAttrs() to read; the restore
  now uses setDiscardableAttr to symmetrically write so carrier
  metadata survives discardable-attr snapshotting downstream.

* Util.h findOpWithAttr / findOpOfTypeWithAttr: comment said the
  helpers searched for a discardable attribute, but the implementation
  uses Operation::hasAttr (matches both inherent and discardable). Fix
  the comment to describe actual behavior; both helpers are used for
  unit-attr markers (regular) and discardable carrier configs alike.

* AIRMatmulCodegen Phase L: phase L runs module-scoped one-shot
  bufferize inside a per-func loop, so on a multi-func module the first
  call would bufferize all funcs and subsequent calls would see
  already-memref IR for their tensor-based phases. All current callers
  build a single-func kernel; emit a clear error when the module has
  more than one func.func and one-shot-bufferize is enabled, instead of
  silently misbehaving.

* AIRMatmulTilePasses runTileKAndFusePacksImpl: stop silently clamping
  k-iter-index with std::min; emit an error for an out-of-range value.

* AIRMatmulVectorizePasses tileWithScfFor: pad with zeros only when the
  caller passes fewer sizes than the iteration domain rank. If the
  caller passes more, emit an error rather than truncating or letting
  setTileSizes misbehave.

* Passes.td air-matmul-codegen description: clarify that phase N
  (vec-prep) always runs and is a cheap no-op on pre-vectorize IR (the
  individual steps walk for vector ops). The previous text suggested an
  "N=false" toggle that does not exist.

* test/xrt/53_matmul_padding_bf16/run.py: drop unused `import os`.

Validation: incremental ninja build + ninja install OK; check-air-mlir
381 passing, 7 expected-fail, 1 pre-existing failure
(Transform/AIRTransform/AIRBufferize/air_transform_payload.mlir,
also fails on origin/main); local NPU2 smoke test of bf16 llama_8x4_kv
and i8 run4x4 with --compile-mode compile-only both pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hapes

The bf16/i8/i16 matmul generators previously hard-asserted M % tile_m == 0
and N % tile_n == 0, so any unaligned GEMM shape (e.g. typical model
layers) had to use a separate hand-written example (test/xrt/53,
test/xrt/54).  Move that capability into the standard generator so a
single canonical path handles aligned and padded shapes alike.

Per example, in __main__:
  * Compute M_padded = ceil(args.m / (tile_m * herd_m)) * (tile_m * herd_m)
    and likewise for N_padded.
  * Pass the padded shape to build_module so the IR is constructed for a
    full-launch-tile grid.
  * When padding is needed, walk the resulting IR for the single
    air.launch op and attach
        air.actual_sizes = array<i64: M_actual, N_actual, 1>.
    The aircc pipeline already runs air-split-launch-for-padding (see
    tools/aircc/aircc.cpp:1028); when the attribute is absent the pass
    is a no-op, so aligned shapes keep producing byte-identical IR to
    before this change.
  * Allocate input_a / input_b at the padded shape, fill the
    [0:M, :] / [:, 0:N] interior with random data, leave the tail zero.
    XRTRunner sees a (M_padded, N_padded) output buffer; sampled-output
    indices stay in [0, M) x [0, N) so we only validate the interior.

Inside build_module:
  * Replace the affine.apply that computed launch-iv * (tile_m * herd_m)
    with arith.muli on the launch block ID.  inferTileSize in
    mlir/lib/Transform/AIRSplitLaunchForPadding.cpp only recognises
    arith.muli of the launch ID; with affine.apply the pass would emit
    "could not infer tile sizes from launch body offset computations"
    and fail.
  * Replace the M %% tile_m / N %% tile_n asserts with the padded
    invariants M %% (tile_m * herd_m) == 0 / N %% (tile_n * herd_n) == 0,
    which the new __main__ guarantees.

Validation: bf16/i8/i16 each compile cleanly for both an aligned shape
(M=N=K=256) and an unaligned shape (M=N=200, K=256) on aie2p with
direct codegen.  bf16 also compiles M=100, N=500, K=2048 on the 8x4
herd (padded to 128x512), confirming padding works at LLAMA-class
scales through the same generator.  The existing run_npu2_*.lit driver
for the bf16 llama_8x4_kv shape (M=128, N=512, K=2048, all aligned)
still compiles to the same payload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants