Replace matmul transform-dialect scripts with C++ pass pipeline by erwei-xilinx · Pull Request #1593 · Xilinx/mlir-air

erwei-xilinx · 2026-05-07T22:03:10Z

Summary

Replace the per-test transform-dialect scripts driving matmul tiling/bufferization/vectorization with a single C++ MLIR pass taking a flat options struct. Goal: parametric, generally-applicable, debuggable, individually testable; supersedes the per-test transform_aie2*.mlir scripts.

Status: draft / WIP — pipeline is functional and HW-validated end-to-end on NPU2 for 9 matmul examples (7 NPU2 HW-validated, 1 NPU1 partial, 1 NPU1 deferred).

Scope of this PR: the air-matmul-codegen orchestrator + carrier-attribute infrastructure as the external API. All configuration is specified externally via the orchestrator's options. Automatic shape→config heuristics are deferred to a follow-up PR.

What's in this PR

Single public pass surface

After several rounds of consolidation, the matmul codegen pipeline is now exposed as one registered pass (air-matmul-codegen) with ~38 flat options, plus the general-utility air-fold-unit-extent-dims (used standalone outside the matmul flow). All internal phases — launch tile, packs, K-tiles, core tile, bufferization to L1/L2 allocs, prologue/epilogue, one-shot-bufferize, tile-for-vectorize, vec-prep — are free-function bodies invoked in fixed order from the orchestrator. Markers between phases are orchestrator-internal constants, not configurable.

Pipeline pass count (matmul-codegen surface): 21 (initial) → 9 (after stage groupings) → 1.

Phase order (each gated by its config)

A  tile-launch-tile (launch-tile)
B  pack-and-transpose (l2-pack-sizes + perms)
C  bufferize-output-l2 — runs before B in single-pack flows (so the
   tile-l3-l2-copies / fuse-output-truncf-first pre-steps see un-packed
   IR), and after B in two-pack flows (so the L2 alloc takes the packed
   shape expected by the L1 pack)
D  pack-and-transpose (l1-pack-sizes + perms; L1-output bufferize)
E  tile-k-and-fuse-packs (outer-k-tile-factor)
F  bufferize-l1-inputs into L2 (auto when D ran)
H  tile-cores (core-tile)
I  tile-k-and-fuse-packs (inner-k-tile-factor)
J  bufferize-l1-inputs into L1 (auto when H ran)
K  prologue-epilogue (prologue-tile / epilogue-tile)
L  one-shot-bufferize (one-shot-bufferize)
M  tile-for-vectorize (matmul-vec-tile)
N  vec-prep composite (do-vec-prep)

The orchestrator runs at ModuleOp scope so it can schedule one-shot-bufferize internally. Empty list option / 0 / false skips the corresponding phase. Pack-output bufferization to L1 auto-targets the LAST pack (L1 in two-pack flows, L2 in single-pack flows); bufferize-last-pack-output=false is available to inspect raw pack semantics (used by the pack_basic lit test).

Driver pipeline rewrites (9 matmul examples + 2 lit tests)

Example	Target	Pre-PR (legacy)	Post-PR (cpp)	Status
`programming_examples/matrix_multiplication/bf16`	iron-built	—	vec stages only	✅
`programming_examples/matrix_multiplication/i8`	iron-built	—	vec stages only	✅
`test/xrt/37_matmul_transform_4x4_bf16`	NPU2 (4×4, bf16, two-pack)	transform_aie2p.mlir	full A–N	✅
`test/xrt/44_triton_matmul_ver4_vector_ptr_opt`	NPU2 (4×4, bf16-out f32)	transform_aie2p.mlir	full A–N	✅ 4.72 → 4.03 ms
`test/xrt/45_triton_matmul_ver4_strix_8x4`	NPU2 (Strix, bf16-out f32)	transform_aie2p.mlir	full A–N	✅ 3.85 / 3.92 ms parity
`test/xrt/46_triton_matmul_ver4_strix_8x4_i8_i8_i32`	NPU2 (Strix, i8)	transform_aie2p.mlir	full A–N	✅ 2.23 → 2.06 ms
`test/xrt/48_triton_matmul_ver4_strix_4x4_bf16_output`	NPU2 (4×4, bf16-out bf16)	transform_aie2p.mlir	full A–N	✅
`test/xrt/53_matmul_padding_bf16`	NPU2 (4×4, bf16 padding)	transform_aie2p.mlir	full A–N	✅
`test/xrt/54_matmul_padding_f32_bf16_emulation`	NPU2 (4×4, f32 bf16-emul)	transform_aie2p.mlir	full A–N	✅
`test/xrt/39_triton_matmul_ver3_vectorized`	NPU1 (Phoenix, single-pack)	transform.mlir	partial	⚠️ orchestrator parses + IR is structurally valid; downstream `aiecc` hangs on Strix machine. Needs NPU1 HW to validate.
`test/xrt/12_matmul_transform_1x4_bf16`	NPU1 (Phoenix, 1×4, bf16)	transform.mlir	not migrated	⏭️ uses `transform.structured.pad` (padding, not packing) + custom `test.exe` harness; padding-mode orchestrator support is a separate PR

Per-test driver pipeline shape:

Test 37 (two-pack): one invocation runs A–N end-to-end (orchestrator wraps one-shot-bufferize + vec-prep).
Tests 44/45/46/48/53/54 (single-pack with herd materialization): two invocations bracketing scf-forall-to-parallel + air-par-to-herd + air-herd-vectorize — first runs A–M (do-vec-prep=false), second runs only N.
prog_ex bf16/i8: iron-built; orchestrator runs only M (vec tile) + N (vec prep).
Test 39 (NPU1 single-pack): one invocation A–M + post-orchestrator herd materialization.
pack_basic.mlir + tile_copies_basic.mlir RUN lines updated; expected output adjusted for post-pack canonicalize.

Other code cleanup

Consolidated duplicate findMarkedOp / findMarkedForLoop into air::findOpWithAttr / air::findOpOfTypeWithAttr<OpTy> in air/Util/Util.h.
Replaced local ConvertMemrefCopyToLinalgCopyPattern with the existing runConvertMemrefCopyToLinalgCopy helper.
Replaced Option<std::string> + custom int-list parser with tablegen ListOption<int64_t>.
Extracted tileAsForallResult helper for the scf::tileUsingSCF{ForallOp} + replaceOp sequence shared across three internal phases.
Replaced ad-hoc topological worklist in hoistInterveningDeps with upstream mlir::computeTopologicalSorting.

Carrier attribute (`air.matmul_codegen_config`)

Kept in the codebase for future heuristic use. The orchestrator currently does NOT read it (per-phase options are passed explicitly by the caller). External producers (autotuners, tools, future heuristic) can set the attribute without changing the pipeline; reintroducing read-paths is a one-line change in each phase impl.

Test results

All 7 NPU2 HW tests PASS (37 / 44 / 45 / 46 / 48 / 53 / 54), per-iter timings as in the table above.
prog_ex bf16 + i8 PASS on NPU2.
check-air-mlir: 381/388 effective passing — one long-standing failure (Transform/AIRTransform/AIRBufferize/air_transform_payload.mlir) unrelated to this PR.

What's NOT in this PR (follow-ups)

NPU1 HW validation for tests 12 & 39. Test 39 has a --use-cpp-pipeline path; orchestrator pipeline runs cleanly but aiecc hangs on the local Strix when targeting NPU1 IR. Needs Phoenix HW + investigation. Test 12 not yet migrated (padding flow).
Padding-mode orchestrator support. Test 12 uses transform.structured.pad instead of pack. Adding a padding option + params to air-matmul-codegen is a separate PR.
Heuristic that derives air.matmul_codegen_config from matmul shape + target — separate PR.
Derivation-driven tile_cores (needs modelling air-collapse-herd).
Heuristic for two-pack-level auto-detection.
Triton-XDNA driver.py invocation of the cpp pipeline (lives in the Triton-XDNA repo).
Replace runHoistLoopInvariantTransfers (~150 LOC) with upstream linalg::hoistRedundantVectorTransfers — investigated locally; semantics diverge for the multi-pair-aliasing-same-memref case where upstream conservatively refuses while our areEquivalentIndices-driven check correctly hoists. Needs upstream extension or a hybrid wrapper.
Optional runEliminateRedundantVectorTransfers / runFuseTruncfLinalg upstream-replacement micro-trims (~60 LOC combined, marginal).

Replaces the per-test transform-dialect scripts driving matmul tiling/bufferization/vectorization with a sequence of focused C++ MLIR passes. Goal: parametric, generally-applicable, debuggable, individually testable; eventually supersede the per-test transform_aie2*.mlir scripts. See MATMUL_CODEGEN_PIPELINE_PLAN.md for the full design and status table. Milestones landed: - M0: air-matmul-pack-and-transpose, air-matmul-tile-l3-to-l2-copies (byte-identical IR vs transform-script Phases 1+3). - M1: Group B (vectorization-prep) passes for prog_ex matmul/{bf16,i8} (HW-validated on NPU2). - M2: Group A passes for tests 53/54 single-pack flow (HW-validated). - M3a/M3b: air.matmul_codegen_config carrier attr + heuristic pass + L1-fit guardrail; --use-cpp-pipeline implies M3 (HW-validated; shape-sweep 5/6 PASS). - M4a: two-pack-level (test 37) infrastructure incl. two marker-flow fixes in tile-k-and-fuse-packs (HW-validated; perf within noise of legacy across all three tests). Tests: - 390/391 lit tests pass (the 1 failure is pre-existing, unrelated). - Tests 37/53/54 cpp paths PASS on NPU2 hardware. - prog_ex matmul/{bf16,i8} cpp paths PASS on NPU2. Backup commit; not yet a PR. Outstanding: M2d (delete obsolete per-test transform scripts), M3c (real derivation for tile_cores), M4b (heuristic for two-pack), M5 (Triton-XDNA backend integration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eline Test 48's transform_aie2p.mlir maps phase-by-phase to existing M0/M1/M2 passes — no new infrastructure needed; the test 53 cpp pipeline string was reusable verbatim. The heuristic produces the same tile/pack/vector parameters that the legacy script hand-wires (pack [8,8,8], tile_cores [8,8,0], K-tile 8, prologue [8,8], epilogue [64,64], vector [2,2,1,...]). Hardware-validated on NPU2: PASS for both legacy and cpp paths. Perf parity confirmed (3-run min times: legacy 0.211ms vs cpp 0.208ms — within noise). Open question on `air-hoist-cast-pairs` resolved: fixed-point converges to STRUCTURALLY identical IR vs the 4 sequential `transform.air.hoist_cast_pair` calls in the legacy script (same op counts, alloc shapes, nesting; only diffs are SSA renumbering and missing `prologue_herd`/`compute_herd`/`epilogue_herd` annotations on cpp herds — purely cosmetic). Adds AIR_DUMP_FINAL_IR env var for debugging. Phase 2 (Triton-XDNA's driver.py invokes the cpp pipeline directly, removing the obsolete tile-and-promote default + transform-script loading) lives in the Triton-XDNA repo, not mlir-air. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Removes air-matmul-set-codegen-config (the M3a/M3b pass that wrote the air.matmul_codegen_config carrier attribute via a hardcoded target+dtype lookup). Heuristics belong in a separate PR; this PR should only expose the consumer pipeline with all configuration specified externally. What stays: - The carrier-attribute infrastructure (MatmulCodegenConfig.{h,cpp}) is kept as the external API. Each consumer pass reads air.matmul_codegen_config when present and falls back to its pass-options otherwise — so external producers (a future heuristic PR, autotuners, end-user tooling) can specify configuration without changing the pipeline. - All consumer passes are unchanged. What changes in tests: - Tests 53, 54, 48 each pass tile/pack/vector parameters explicitly via per-pass options strings (M2 style). Test 37 was already written this way. - --use-codegen-config flag removed; --use-cpp-pipeline is the only toggle. HW-validated on NPU2 (all five paths PASS): - prog_ex matmul/{i8, bf16} - test/xrt/{37, 48, 53, 54} cpp pipelines 390/391 lit tests pass (the 1 failure is pre-existing, unrelated). Also scrubs all iree / iree-amd-aie references from PR-scope files (MATMUL_CODEGEN_PIPELINE_PLAN.md only — no source code referenced those projects). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This planning doc should not be committed. PR description carries the summary; in-tree planning docs are not wanted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Audit-driven cleanup of the new matmul-codegen passes: 1. Promote findMarkedOp/findMarkedForLoop to air/Util/Util.h as findOpWithAttr / findOpOfTypeWithAttr<OpTy>. Replaces 3 local redefinitions plus 4 inlined walks across the new files. 2. Drop the local ConvertMemrefCopyToLinalgCopyPattern in AIRMatmulTileL3ToL2Copies.cpp; call the existing runConvertMemrefCopyToLinalgCopy helper instead (already exposed in AIRMatmulCodegenHelpers.h). 3. Extract tileAsForallResult helper for the scf::tileUsingSCF{LoopType=ForallOp} + replaceOp sequence. AIRMatmulTileCores and AIRMatmulTileLaunchTile now share it instead of open-coding the same 8 lines. 4. Replace Option<std::string> + parseIntList with tablegen ListOption<int64_t> for tile-sizes / prologue-tile-sizes / epilogue-tile-sizes / fill-iterator-interchange. Deletes the parseIntList helper outright. Net ~30 LOC removed. All 5 HW tests still PASS on NPU2; lit unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The intervening-deps move-above-anchor logic in hoistInterveningDeps had its own iterative "find ops with all-resolved operands and move them" loop. Upstream provides mlir::computeTopologicalSorting in mlir/Analysis/TopologicalSortUtils.h that does exactly this — sorts an arbitrary op set in dependency order, treating ops outside the set as already-ready (incomplete-chain semantics). After sorting, a single moveBefore pass yields the same result. Net 13 LOC removed. All 5 HW tests still PASS on NPU2; lit unchanged at 390/391. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four of the registered passes had no parametric external config — they walked for a marker and called a 1-2 line helper. They were noise in the public pass registry without adding an external API surface. Dropped from tablegen registration: - air-matmul-cleanup-bufferize - air-matmul-fuse-pingpong-loops - air-matmul-fuse-output-truncf - air-hoist-static-alloc Replaced by: - One new combined pass air-matmul-post-bufferize-cleanup (cleanup + pingpong-fuse run back-to-back in all tests using them). - Boolean option `fuse-output-truncf-first` on air-matmul-bufferize-output-l2 (fuse-truncf must precede the L2 fill bufferize so the fill's element type matches the post-fuse matmul). - Boolean option `hoist-static-alloc-first` on air-matmul-prologue-epilogue (used by the M4 K-peel flow). The pass bodies survive as plain C++ functions in AIRMatmulBufferizationPasses.h (`runFusePingpongLoopsImpl`, `runFuseOutputTruncfImpl`, `runHoistStaticAllocImpl`), called either from the combined pass or from the option-driven steps in the parametric passes. Net 69 LOC removed; public pass count drops by 3 (4 dropped, 1 added). All 5 HW tests still PASS on NPU2; lit unchanged at 390/391. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Six of the M1 vec-prep passes (eliminate-redundant-vector-transfers, flatten-for-iter-args, hoist-loop-invariant-transfers, hoist-vector-transfer-pointers, vector-cast-for-emulation, hoist-cast-pairs) were registered separately but in practice ALWAYS fire as a fixed-order sequence inside the matmul codegen pipeline. Nobody invokes them individually outside this workflow. Replaced with one composite air-matmul-codegen-vec-prep that runs them in fixed order, with boolean options to enable/disable each step and two parallel option groups (cast1-* / cast2-*) for the 0-2 vector-cast-for-emulation invocations (test 54 needs two with different target element types; others need one or zero). air-fold-unit-extent-dims kept registered separately because the prog_ex pipelines invoke it standalone outside the vec-prep block. Per-step bodies survive as static helper functions inside the same TU, called by the composite. The runFoo helpers in AIRMatmulCodegenHelpers.cpp are unchanged. Net 207 LOC removed; 5 fewer registered passes (6 dropped, 1 added). All 5 HW tests + 2 prog_ex paths still PASS on NPU2; lit unchanged at 390/391. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Continued the matmul-named pass-registry cleanup: - air-matmul-tile-l3-to-l2-copies → option `do-tile-l3-to-l2-copies` on bufferize-output-l2 (with k-l2-tile + copy-marker options moved over). Body extracted as `runTileL3ToL2CopiesImpl` for direct call. - air-matmul-bufferize-l1-output → option `do-bufferize-l1-output` on pack-and-transpose (tail step). Body extracted as `runBufferizeL1OutputImpl`. - air-matmul-post-bufferize-cleanup → option `do-post-bufferize-cleanup-first` on tile-for-vectorize (pre-step). Body extracted as `runPostBufferizeCleanupImpl`. Net 3 fewer registered passes. From 12 → 9. Public matmul-codegen pass count progression in this PR: 21 (initial) → 17 → 14 → 12 → 9. The 9 remaining passes are all genuinely parametric: air-matmul-tile-launch-tile, air-matmul-pack-and-transpose, air-matmul-bufferize-output-l2, air-matmul-tile-k-and-fuse-packs, air-matmul-tile-cores, air-matmul-bufferize-l1-inputs, air-matmul-prologue-epilogue, air-matmul-tile-for-vectorize, air-matmul-codegen-vec-prep + air-fold-unit-extent-dims (general utility). All have substantive options (tile sizes, perms, markers, memory spaces) that justify standalone exposure. All 5 HW tests + 2 prog_ex paths still PASS on NPU2. Lit unchanged at 390/391. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e orchestrator Replaces the 9 internal tablegen entries with a single `air-matmul-codegen` pass taking a flat options struct (~38 fields). Internal phases become free functions invoked in fixed order; markers between phases are orchestrator-internal constants. Phase placement (each gated by its config; canonicalize/cse runs between): A tile-launch-tile (launch-tile) B pack-and-transpose (l2-pack-sizes + perms) C bufferize-output-l2 — runs before B in single-pack flows (so the tile-l3-l2-copies / fuse-output-truncf-first pre-steps see un-packed IR), and after B in two-pack flows (so the L2 alloc takes the packed shape expected by the L1 pack) D pack-and-transpose (l1-pack-sizes + perms; L1-output bufferize) E tile-k-and-fuse-packs (outer-k-tile-factor) F bufferize-l1-inputs into L2 (auto when D ran) H tile-cores (core-tile) I tile-k-and-fuse-packs (inner-k-tile-factor) J bufferize-l1-inputs into L1 (auto when H ran) K prologue-epilogue (prologue-tile / epilogue-tile) L one-shot-bufferize (one-shot-bufferize, module-scope sub-pipeline) M tile-for-vectorize (matmul-vec-tile) N vec-prep composite (do-vec-prep) Pipeline pass count (matmul-codegen surface): 21 (initial PR) → 9 (after stage groupings) → **1 + general-utility air-fold-unit-extent-dims**. The orchestrator runs at ModuleOp scope so it can schedule one-shot-bufferize internally. Pack output bufferization auto-detect: - single-pack flow (l1-pack-sizes empty): the L2 pack is the LAST pack; its output bufferizes to L1 by default. - two-pack flow: the L1 pack is the LAST pack; its output bufferizes to L1 by default. - bufferize-last-pack-output=false to inspect raw pack semantics (used by the pack_basic lit test). Driver pipeline rewrites (all 5 HW tests + 2 lit tests): - tests 37/48/53/54 + prog_ex bf16/i8 each collapse the prior 12-line phase list into one or two air-matmul-codegen invocations. - tests 48/53/54 use two invocations bracketing scf-forall-to-parallel + air-par-to-herd + air-herd-vectorize: first invocation runs A-K + L + M; second invocation runs only N. - prog_ex bf16/i8 set everything before M to skip and enable only M+N. - pack_basic.mlir + tile_copies_basic.mlir RUN lines updated to the new surface; expected output adjusted for the post-pack canonicalize. All 5 HW tests PASS on NPU2. check-air-mlir 381/388 effective passing (unchanged from prior commit; the 1 long-standing air_transform_payload.mlir failure is unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI black check flagged the multi-line tuple in `output_shapes_dtypes` as unformatted. Apply black formatting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI runs `git clang-format origin/main` on the diff vs main, which catches formatting on all changed lines (not just freshly-touched ones). My local `clang-format-17 -i` only formatted the latest edits; this commit applies clang-format to the cumulative diff against main, normalizing: - Passes.td (matmul-codegen entry layout) - AIRMatmulCodegen.cpp (multi-line call argument grouping) - AIRMatmulCodegenHelpers.{h,cpp} (lambda + multi-line call grouping) - MatmulCodegenConfig.{h,cpp} (function signature wrapping) - AIRLinalgCodegen.cpp (signature wrapping) Verified `ninja air-opt` still builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds --use-cpp-pipeline + --profile-iters to four legacy transform-script matmul tests; replaces the per-test transform_aie2{,p}.mlir with calls to the air-matmul-codegen orchestrator. Migrations and HW perf (NPU2 / Strix, 30 iters median, lower is better): | Test | Legacy | CPP | Status | |-----------------------------------------------|----------|----------|-------------| | 44 triton_matmul_ver4_vector_ptr_opt (aie2p) | 4.715 ms | 4.032 ms | PASS, faster| | 45 triton_matmul_ver4_strix_8x4 | 3.853 ms | 3.922 ms | PASS, parity| | 46 triton_matmul_ver4_strix_8x4_i8_i8_i32 | 2.231 ms | 2.064 ms | PASS, faster| | 39 triton_matmul_ver3_vectorized (NPU1) | n/a | n/a | partial* | * Test 39 targets aie2 / NPU1 (Phoenix). The orchestrator pipeline parses cleanly and produces valid IR, but downstream aiecc compilation hangs on the local Strix machine. Needs NPU1 hardware to fully validate. Test 12 (matmul_transform_1x4_bf16) NOT migrated: it uses transform.structured.pad (padding flow) rather than packing, plus a hand- rolled test.exe harness instead of XRTRunner. Padding-mode orchestrator support is a separate PR. Orchestrator extensions: * Phase C placement: in single-pack flows (l1-pack-sizes empty) Phase C now runs BEFORE Phase A (launch-tile) so the L2 alloc lands at LAUNCH scope outside any per-core forall. Two-pack flows still run Phase C after Phase B (matches existing test 37 behavior). * Phase F': in single-pack flows with NO tile-cores (e.g. launch-tile- only flows like test 39), bufferize the K-fused L1 packs to L1 here using the lhs_pack_in_k / rhs_pack_in_k markers (Phase J's fused_*_l1_pack markers don't fire without tile-cores). * runTileLaunchTileImpl now tags the inner per-launch-tile matmul with matmul_compute, so downstream tile-for-vectorize finds it in launch-tile-only flows where there's no separate tile-cores step. All 7 NPU2 tests (37, 44, 45, 46, 48, 53, 54) PASS on hardware. check-air-mlir 381/388 effective passing (unchanged baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…umElements The MATMUL_CODEGEN_PIPELINE_PLAN.md doc was never checked in, and the "M0/M1a/M2/M3/M4/M5" + "Group A" milestone numbers it referenced are meaningless without it. Sweep all source comments and docstrings to drop both the plan-doc citations and the milestone tags, replacing them with direct prose about what each phase does relative to the air-matmul-codegen orchestrator. No behavioural change. Also dedupe `xilinx::air::getVectorNumElements(VectorType)` against upstream `mlir::VectorType::getNumElements()` (inherited from ShapedTypeInterface as part of `extraSharedClassDeclaration`). The local helper was a literal product-of-shape-dims wrapper; vectors always have static shape so the upstream `hasStaticShape()` assert is a no-op. Replaces 5 callsites and deletes the local helper + its decl. Verified: ninja air-opt builds clean; check-air-mlir 381/388 effective passing (unchanged baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…alentTo The local helper hand-rolled three cases (SSA equality, AffineApplyOp + map+operands, ConstantIndexOp + value), all of which reduce to upstream `OperationEquivalence::isEquivalentTo` semantics: same op kind + same attributes + operand SSA-equality. The upstream version is also more general — it correctly recognizes equivalent index expressions built from arith ops (e.g. `arith.muli %iv, %c4`) which the hardcoded AffineApply/ConstantIndex check would miss. Validated: - check-air-mlir 381/388 effective passing (unchanged baseline). - All 7 NPU2 HW tests PASS (37, 44, 45, 46, 48, 53, 54). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tching `air::isEquivalentTo` (Util.cpp) is the air-side wrapper that mirrors upstream `OperationEquivalence::isEquivalentTo` but adds operand const-int equivalence: two operand SSAs that are distinct but both fold to the same constant int are treated as equivalent. For comparing memref indices this is the right semantics — `affine.apply (%iv, %c0)` and `affine.apply (%iv, %const_0)` should match when both `%c0`/`%const_0` fold to 0, even if they are distinct SSA values produced by separate `arith.constant 0 : index` ops in different scopes. Validated: - check-air-mlir 381/388 effective passing (unchanged baseline). - All 7 NPU2 HW tests PASS. - test 45 perf: 3.957 ms median (within noise of previous 3.92 ms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The wrapper has only two callsites, both in AIRMatmulCodegenHelpers.cpp (areIdenticalReads + the hoist-loop-invariant transfer scan). Drop the header decl and mark the function `static`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ntTo The hand-rolled helper checked op kind + source memref + index SSA-or- constant equivalence + result vector type. Upstream `OperationEquivalence::isEquivalentTo(read1, read2, IgnoreLocations)`: - checks op kind ✓ - checks attributes (permutation_map, in_bounds, padding) — strictly better than the hand-rolled version, which silently merged reads with different in_bounds / permutation_map - checks operand SSA equality (indices + source memref) — strict, but `eliminate-redundant-vector-transfers` runs after canonicalize so duplicate constants are CSE'd into the same SSA value before this fires - checks result types ✓ Single callsite inlined; helper + header decl deleted. Validated: - check-air-mlir 381/388 effective passing (unchanged baseline). - All 7 NPU2 HW tests PASS. - test 45 perf: 3.863 ms median (legacy was 3.853 ms; within noise). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The hand-rolled transitive operand walk in dependsOnLoopIV duplicates air::traceDependentInductionVar, which already traces scf.for IVs, scf.parallel IVs, herd IDs, scf.for iter_args, peels air.execute, and gathers operands from parent affine.if guards. The wrapper preserves the existing predicate semantics (returns true iff loopIV appears in the collected dependency history) and gains iter_arg + air.execute coverage for cloneOpAndOperands and the hoist helpers for free.

All callsites are inside AIRMatmulCodegenHelpers.cpp. With the body now delegating to air::traceDependentInductionVar, the helper has no other consumers and can become a static file-local.

Replace the hand-rolled per-operand recursion with a top-level operand walk that calls air::cloneOpAndOperands (Util.cpp) per operand chain. The slice traversal is now handled by upstream getBackwardSlice with a per-op filter that rejects ops outside the loop, already mapped, or IV-dependent (matching the prior per-operand-edge semantics in aggregate). The local helper retains the entry-level early-return on already-mapped target results so repeated calls with shared mapping (e.g., the index loop in runHoistVectorTransferPointers) still de-dup. All callsites are in this .cpp; helper is now file-static and the header decl is removed.

…ne pass The standalone AIRFoldUnitExtentDims pass is a thin wrapper around runFoldUnitExtentDimsOnFunc, only ever co-invoked with air-matmul-codegen in the 11 known pipelines. Move all responsibility into the orchestrator: - Add Phase-0 option do-pre-fold-unit-extent-dims (default false). Runs the fold once before any other phase. Used as initial IR cleanup when invoking the orchestrator on IRON-emitted IR with stray unit dims. - Phase-N vec-prep already calls the fold (gated vec-prep-fold-unit- extent-dims, default true) and covers the post-vec-prep cleanup. - Drop AIRFoldUnitExtentDims pass class, factory, registration, and tablegen entry. - Update 11 run.py callsites: * Tests 39/44/45/46/48/53/54: drop the trailing air-fold-unit-extent- dims (vec-prep's internal fold now handles it). Test 39 has no second orchestrator, so add a minimal air-matmul-codegen{do-pre-fold-...} invocation in its place. * prog_examples i8/bf16: replace pre-orchestrator and inter-orchestrator folds with do-pre-fold-unit-extent-dims=true on the first orchestrator and re-enabled vec-prep-fold-unit-extent-dims (default) on the second.

The local hoistStaticallyBoundAllocationsInFunc<T> template was only ever instantiated with memref::AllocOp, with the std::is_same branches gating dealloc insertion to that one type anyway. The xilinx::air:: hoistStaticAllocsInFunc wrapper existed solely to expose a non-template, namespaced symbol for cross-TU linking. De-template both helpers, fold them into xilinx::air::, and route the transform op directly through the public free function. One symbol instead of two; ~10 lines of branch logic removed.

runHoistLoopInvariantTransfersStep and runHoistVectorTransferPointersStep both walked herds for innermost scf.for ops with the same ~15-line hasInnerFor probe. Lift into findInnermostForsInHerds with an optional herd filter; the second caller passes herdHasVectorContract to skip fill/epilogue herds.

Python-side timing of XRT kernel invocations is too noisy for performance evaluation (Python overhead, GIL, GC, scheduling). The benchmark method and its _make_xrt_backend factory in xrt_runner.py were a mistake; remove them entirely. Also drop the corresponding --profile-iters argparse option and runner.benchmark callsite from 8 test/xrt/*/run.py drivers (37, 39, 44, 45, 46, 48, 53, 54). Drive-by: rewrite the leftover "Iron-built flow" comment in i8/run.py and bf16/run.py to "Direct-codegen flow" since it just describes what --direct-codegen does, not the IR origin.

i8/run.py and bf16/run.py each carried a ~125-line inline transform script under `if False:`. Permanently unreachable, kept only as reference. Delete; the transform-dialect path is preserved in git history if anyone needs to revive it. Verified i8 --direct-codegen --arch aie2p --compile-mode compile-and-run still PASSes after the deletion.

Each NPU2 transform-script lit gets a sibling *_cpp.lit that exercises the C++ orchestrator path via --use-cpp-pipeline. Until now the cpp pipeline had zero CI coverage (manual --use-cpp-pipeline only); the existing transform-script lits were the sole exercised path. Tests covered: 37, 44, 45 (each: standard + elf), 46, 48, 53, 54. For 46 and 54 the existing lit is Makefile-driven; the new cpp lit invokes run.py directly with FileCheck since the Makefile run target doesn't pipe through --use-cpp-pipeline. Test 39 is intentionally skipped: its cpp pipeline currently targets NPU1 (aie2) only and was flagged as a partial migration in the prior PR session. Each new lit was validated locally on NPU2 (PASS).

The Phase-0 fold is idempotent and a no-op on already-clean IR, so there's no reason to make callers opt in. Flip the default and drop the explicit `do-pre-fold-unit-extent-dims=true` in the i8/bf16 prog_examples and the test 39 fold-only invocation.

The orchestrator had 52 options. 7 of them were pure structural toggles or always-on bools that no caller varied: - do-pre-fold-unit-extent-dims (default true; phase always runs) - do-vec-prep (default true; pre-vectorize the 7 of 8 vec-prep walks find no IR to operate on; the 8th is the same fold Phase 0 already runs) - vec-prep-fold-unit-extent-dims - vec-prep-eliminate-redundant-vector-transfers - vec-prep-hoist-loop-invariant-transfers - vec-prep-flatten-for-iter-args - vec-prep-hoist-vector-transfer-pointers All seven are removed from Passes.td, the orchestrator runOnFunc, and runCodegenVecPrepImpl. The 9 run.py callers stop having to pass do-vec-prep=false on the first orchestrator invocation; first-call vec-prep is now a no-op walk. Verified on NPU2 HW: - Test 46 cpp profile: 6182 -> 6130 gflops (-0.84%, within ±5% noise) - Test 54 cpp profile: 116.7 -> 116.7 gflops (-0.04%) - Correctness sweep tests 37/44/45/46/48/53/54 cpp pipeline: all PASS

Replace test 45's two mutually exclusive paths (--use-cpp-pipeline opt-in) with a single transform-script path that delegates the matmul-specific work to air-matmul-codegen via transform.apply_registered_pass. The script keeps the non-matmul plumbing (scf.forall->herd, herd-vectorize, canon/cse) but no longer hand-rolls the tile/pack/bufferize/vectorize sequence -- those are orchestrator options now. - transform_aie2p.mlir: 355 -> 76 lines. All transform.air.* matmul ops replaced by two apply_registered_pass "air-matmul-codegen" calls (pre-vectorize half + vec-prep half). - run.py: drop --use-cpp-pipeline argparse + cpp_pipeline branch. - Drop the run_npu2_peano_cpp.lit + run_npu2_peano_elf_cpp.lit drivers (they exercised the now-removed path). Verified on NPU2: existing run_npu2_peano.lit + run_npu2_peano_elf.lit both PASS through the rewritten script (3 runs each). This is the validation lap for the unified-path proposal. The remaining 6 tests (37, 44, 46, 48, 53, 54) follow the same pattern and will be migrated in subsequent commits.

Same shape as test 45: both transform_aie2{,p}.mlir scripts (NPU1, NPU2) become thin wrappers around two air-matmul-codegen invocations with intermediate par-to-herd + herd-vectorize. Drop --use-cpp-pipeline + --arch plumbing in run.py. Drop run_npu2_peano_cpp.lit + elf variant. NPU2 validated: existing run_npu2_peano.lit + run_npu2_peano_elf.lit both PASS through the rewritten script. NPU1 not validated locally; CI will exercise via run_npu1_peano.lit.

…_pass NPU2 transform_aie2p.mlir becomes a thin wrapper around one air-matmul-codegen invocation (two-pack-level flow). Drop the cpp pipeline branch and the cpp-pipeline lits. NPU1 transform_aie2.mlir is left as the legacy hand-rolled path (the cpp pipeline string only covered NPU2; no NPU1 orchestrator options were exercised before). Verified on NPU2: run_npu2_peano.lit + run_npu2_peano_elf.lit both PASS.

Same shape as test 45/44; vec-prep cast type is i32 (i8 acc -> i32) instead of f32. Drop --use-cpp-pipeline plumbing in run.py + run_npu2_peano_cpp.lit. NPU2 validated: PASS!

bf16-out variant: sets fuse-output-truncf-first=true (pre-step) and vec-prep-hoist-cast-pairs=true (cleans up f32-to-bf16 trunc/extf pairs around iter_arg). Drop --use-cpp-pipeline plumbing + cpp lit. NPU2 PASS.

bf16-out non-tile-aligned variant: same shape as test 48; transform script defaults to k-l2-tile=16. run.py keeps a small re.sub block to rewrite k-l2-tile + outer-k-tile-factor when --k-l2-tile differs. Drop --use-cpp-pipeline plumbing + the cpp lit. NPU2 PASS.

…5/44/46/48/53 Test 54 (f32-in/out + BFP16 emulation) joins tests 45/44/46/48/53 with a unified transform-script path that delegates matmul codegen to the C++ air-matmul-codegen orchestrator via transform.apply_registered_pass. Drop --use-cpp-pipeline branch + run_npu2_peano_cpp.lit. Profiling test 54 surfaced a -32% perf regression that affected ALL the unified scripts in this PR: replacing `func.func(canonicalize,cse,fold- memref-alias-ops)` (full passes that iterate to fixed point + DCE) with transform.apply_patterns (one-shot pattern application) lost critical canonicalization. Fix: switch the cleanup blocks in tests 45/44/46/48/ 53/54 to chain three transform.apply_registered_pass invocations ("canonicalize", "cse", "fold-memref-alias-ops") on a fresh func handle each, matching the original pass semantics. Verified on NPU2: - Test 46 cpp profile: 6182 -> 6119 gflops (-1.02%, within +/-5% noise) - Test 54 cpp profile: 116.7 -> 115.8 gflops (-0.79%) - Correctness sweep tests 44, 45, 46, 48, 53 all PASS

The two-line rc-then-exit shape was a leftover from the prior --profile-iters plumbing (which checked rc==0 before benchmarking). With --profile-iters gone, rc was single-use and the indirection served no purpose. Collapse across all 8 test/xrt run.py drivers.

Finish unifying the NPU1 paths that were left as legacy in the prior pass: test 37's transform_aie2.mlir (two-pack mmul=4x4x8) and test 39's transform.mlir (single-pack mmul=4x4x8). Drop test 39's now-dead --use-cpp-pipeline branch. Both target NPU1 (aie2/Phoenix) which I cannot validate locally (aiecc compile of NPU1 IR hangs on Strix host, known issue). Scripts parse cleanly; CI run_npu1_peano.lit drivers will exercise the actual HW path. Options derived from the legacy script's tile/pack/permute sequence (test 37) and from the previously-existing cpp pipeline string in run.py (test 39, which I deleted in the same commit).

…_pass" This reverts commit 494fec7.

OperationEquivalence::isEquivalentTo (introduced in 773953c) is strict on operand SSA equality, so it fails to merge two vector.transfer_read ops whose indices are computed by distinct-but-identical affine.apply ops, or come from two scf.for iter_args initialized to the same value. The transform-op path in AIRLinalgCodegen.cpp::areIdenticalReads still uses the value-aware matcher and correctly catches both cases. Concrete impact: the orchestrator's vec-prep left an extra B-tile vector.transfer_read in the inner loop of the bf16 llama matmul (M=128, K=2048, N=512/2048/8192). After hoist-vector-transfer-pointers lifted the index into a redundant iter_arg, Peano's MC encoder hit 'getmBMsOpValue: Register not in mBMs' UNREACHABLE on the resulting VLD_x_pstm_nrm_imm_pseudo. mlir-aie@b37dc33's placer changes exposed the regression in CI; older mlir-aie scheduled around it. Inline the AIRLinalgCodegen.cpp::areIdenticalReads body (base + index-by-index areEquivalentIndices + vector type) and extend areEquivalentIndices to recognise affine.apply ops with identical map and operands. With the fix, core_7_5.opt.ll for run_npu2_llama_8x4_kv is byte-identical to the transform-script output on origin/main, and all four llama lit tests (qo, kv, gate_up, down) compile cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR replaces per-test transform-dialect matmul scripts with a single registered MLIR pass (air-matmul-codegen) that orchestrates tiling/packing/bufferization/vectorization as a fixed-order C++ pipeline driven by flat pass options, and updates several XRT tests/examples to invoke the new orchestrator.

Changes:

Add the air-matmul-codegen orchestrator pass (+ supporting phase implementations and headers) and register/build it.
Add lit tests covering copy-tiling and pack/transpose behavior.
Update multiple matmul XRT tests/examples to use transform.apply_registered_pass (or a PassManager pipeline) to drive codegen via the C++ orchestrator.

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
test/xrt/54_matmul_padding_f32_bf16_emulation/transform_aie2p.mlir	Replace legacy transform script with orchestrator pass invocations + cleanup passes.
test/xrt/54_matmul_padding_f32_bf16_emulation/run.py	Rewrite transform script options (k-l2-tile / outer-k factor / epilogue-tile) before running transform.
test/xrt/53_matmul_padding_bf16/transform_aie2p.mlir	Replace legacy transform script with orchestrator pass invocations + cleanup passes.
test/xrt/53_matmul_padding_bf16/run.py	Rewrite orchestrator options in script; add debug print-and-exit flag.
test/xrt/48_triton_matmul_ver4_strix_4x4_bf16_output/run.py	Clarify script role; add optional final-IR dump env hook.
test/xrt/46_triton_matmul_ver4_strix_8x4_i8_i8_i32/run.py	Update CLI help and comments to reflect orchestrator-backed script.
test/xrt/45_triton_matmul_ver4_strix_8x4/run.py	Update comments/help to reflect orchestrator-backed script.
test/xrt/44_triton_matmul_ver4_vector_ptr_opt/transform_aie2p.mlir	Replace large annotated script with orchestrator-based transform script (AIE2P).
test/xrt/44_triton_matmul_ver4_vector_ptr_opt/transform_aie2.mlir	Replace large annotated script with orchestrator-based transform script (AIE2).
test/xrt/44_triton_matmul_ver4_vector_ptr_opt/run.py	Update CLI help and comments to reflect orchestrator-backed script.
test/xrt/39_triton_matmul_ver3_vectorized/run.py	Add `--use-cpp-pipeline` to run the C++ orchestrator pipeline directly.
test/xrt/37_matmul_transform_4x4_bf16/transform_aie2p.mlir	Replace legacy transform script with a single orchestrator invocation (two-pack flow).
test/xrt/37_matmul_transform_4x4_bf16/run.py	Update comments to describe orchestrator-backed vs legacy script roles.
programming_examples/matrix_multiplication/i8/run.py	Replace hand-written vectorization transform IR with orchestrator vectorize/vec-prep pipeline.
programming_examples/matrix_multiplication/bf16/run.py	Replace hand-written vectorization transform IR with orchestrator vectorize/vec-prep pipeline.
mlir/test/Transform/AIRMatmulTileL3ToL2Copies/tile_copies_basic.mlir	New lit test for memref.copy→linalg.copy conversion + K-tiling + loop markers.
mlir/test/Transform/AIRMatmulPackAndTranspose/pack_basic.mlir	New lit test for pack + transpose permutations and canonicalization behavior.
mlir/lib/Util/Util.cpp	Add `air::findOpWithAttr` implementation.
mlir/lib/Util/MatmulCodegenConfig.cpp	Add carrier-attribute reader/writer utilities for matmul codegen configs.
mlir/lib/Util/CMakeLists.txt	Build `MatmulCodegenConfig.cpp` into AIRUtil.
mlir/lib/Transform/Passes.cpp	Register the new `air-matmul-codegen` pass.
mlir/lib/Transform/CMakeLists.txt	Add new matmul codegen source files to the Transform library.
mlir/lib/Transform/AIRMatmulVectorizePasses.cpp	Implement vectorization/vec-prep phases as free functions for the orchestrator.
mlir/lib/Transform/AIRMatmulTilePasses.cpp	Implement tiling phases (launch/K/core/prologue+epilogue) as free functions.
mlir/lib/Transform/AIRMatmulTileL3ToL2Copies.cpp	Implement L3→L2 copy tiling as a free function invoked by orchestrator.
mlir/lib/Transform/AIRMatmulPackAndTranspose.cpp	Implement pack+transpose phase and optional output bufferization.
mlir/lib/Transform/AIRMatmulCodegen.cpp	New orchestrator pass wiring phases and canonicalize/CSE/one-shot-bufferize.
mlir/lib/Transform/AIRMatmulBufferizationPasses.cpp	Implement output/L1 bufferization and post-bufferize cleanup steps.
mlir/lib/Transform/AIRLinalgBufferize.cpp	Refactor hoist-static-alloc helper into a non-templated utility function.
mlir/include/air/Util/Util.h	Declare `findOpWithAttr` and templated `findOpOfTypeWithAttr`.
mlir/include/air/Util/MatmulCodegenConfig.h	Declare carrier-attribute schema and helper APIs.
mlir/include/air/Transform/Passes.td	Add TableGen definition/options/docs for `air-matmul-codegen`.
mlir/include/air/Transform/Passes.h	Export new matmul codegen/phase headers from pass umbrella header.
mlir/include/air/Transform/PassDetail.h	Add generated pass defs for new passes.
mlir/include/air/Transform/AIRMatmulVectorizePasses.h	Header for orchestrator vectorization/vec-prep free functions.
mlir/include/air/Transform/AIRMatmulTilePasses.h	Header for orchestrator tiling free functions.
mlir/include/air/Transform/AIRMatmulTileL3ToL2Copies.h	Header for orchestrator L3→L2 copy tiling free function.
mlir/include/air/Transform/AIRMatmulPackAndTranspose.h	Header for orchestrator pack+transpose free function.
mlir/include/air/Transform/AIRMatmulCodegenHelpers.h	New shared helper entry points used by both transform ops and C++ passes.
mlir/include/air/Transform/AIRMatmulCodegen.h	Header declaring the orchestrator pass factory functions.
mlir/include/air/Transform/AIRMatmulBufferizationPasses.h	Header for orchestrator bufferization/cleanup free functions.
mlir/include/air/Transform/AIRLinalgBufferize.h	Export `hoistStaticAllocsInFunc` utility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Eight Copilot review comments: * MatmulCodegenConfig.cpp: switch the carrier dictionary to the setDiscardableAttr / getDiscardableAttr APIs the header documented. setAttr / getAttrOfType happened to route the namespaced attr to discardable storage, but the explicit APIs make intent obvious and match getDiscardableAttrs() snapshots elsewhere. * AIRMatmulPackAndTranspose.cpp: the snapshot/restore pair around linalg::pack already used getDiscardableAttrs() to read; the restore now uses setDiscardableAttr to symmetrically write so carrier metadata survives discardable-attr snapshotting downstream. * Util.h findOpWithAttr / findOpOfTypeWithAttr: comment said the helpers searched for a discardable attribute, but the implementation uses Operation::hasAttr (matches both inherent and discardable). Fix the comment to describe actual behavior; both helpers are used for unit-attr markers (regular) and discardable carrier configs alike. * AIRMatmulCodegen Phase L: phase L runs module-scoped one-shot bufferize inside a per-func loop, so on a multi-func module the first call would bufferize all funcs and subsequent calls would see already-memref IR for their tensor-based phases. All current callers build a single-func kernel; emit a clear error when the module has more than one func.func and one-shot-bufferize is enabled, instead of silently misbehaving. * AIRMatmulTilePasses runTileKAndFusePacksImpl: stop silently clamping k-iter-index with std::min; emit an error for an out-of-range value. * AIRMatmulVectorizePasses tileWithScfFor: pad with zeros only when the caller passes fewer sizes than the iteration domain rank. If the caller passes more, emit an error rather than truncating or letting setTileSizes misbehave. * Passes.td air-matmul-codegen description: clarify that phase N (vec-prep) always runs and is a cheap no-op on pre-vectorize IR (the individual steps walk for vector ops). The previous text suggested an "N=false" toggle that does not exist. * test/xrt/53_matmul_padding_bf16/run.py: drop unused `import os`. Validation: incremental ninja build + ninja install OK; check-air-mlir 381 passing, 7 expected-fail, 1 pre-existing failure (Transform/AIRTransform/AIRBufferize/air_transform_payload.mlir, also fails on origin/main); local NPU2 smoke test of bf16 llama_8x4_kv and i8 run4x4 with --compile-mode compile-only both pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hapes The bf16/i8/i16 matmul generators previously hard-asserted M % tile_m == 0 and N % tile_n == 0, so any unaligned GEMM shape (e.g. typical model layers) had to use a separate hand-written example (test/xrt/53, test/xrt/54). Move that capability into the standard generator so a single canonical path handles aligned and padded shapes alike. Per example, in __main__: * Compute M_padded = ceil(args.m / (tile_m * herd_m)) * (tile_m * herd_m) and likewise for N_padded. * Pass the padded shape to build_module so the IR is constructed for a full-launch-tile grid. * When padding is needed, walk the resulting IR for the single air.launch op and attach air.actual_sizes = array<i64: M_actual, N_actual, 1>. The aircc pipeline already runs air-split-launch-for-padding (see tools/aircc/aircc.cpp:1028); when the attribute is absent the pass is a no-op, so aligned shapes keep producing byte-identical IR to before this change. * Allocate input_a / input_b at the padded shape, fill the [0:M, :] / [:, 0:N] interior with random data, leave the tail zero. XRTRunner sees a (M_padded, N_padded) output buffer; sampled-output indices stay in [0, M) x [0, N) so we only validate the interior. Inside build_module: * Replace the affine.apply that computed launch-iv * (tile_m * herd_m) with arith.muli on the launch block ID. inferTileSize in mlir/lib/Transform/AIRSplitLaunchForPadding.cpp only recognises arith.muli of the launch ID; with affine.apply the pass would emit "could not infer tile sizes from launch body offset computations" and fail. * Replace the M %% tile_m / N %% tile_n asserts with the padded invariants M %% (tile_m * herd_m) == 0 / N %% (tile_n * herd_n) == 0, which the new __main__ guarantees. Validation: bf16/i8/i16 each compile cleanly for both an aligned shape (M=N=K=256) and an unaligned shape (M=N=200, K=256) on aie2p with direct codegen. bf16 also compiles M=100, N=500, K=2048 on the 8x4 herd (padded to 128x512), confirming padding works at LLAMA-class scales through the same generator. The existing run_npu2_*.lit driver for the bf16 llama_8x4_kv shape (M=128, N=512, K=2048, all aligned) still compiles to the same payload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

erwei-xilinx and others added 30 commits May 7, 2026 14:53

Remove MATMUL_CODEGEN_PIPELINE_PLAN.md from the PR

fc98ec9

This planning doc should not be committed. PR description carries the summary; in-tree planning docs are not wanted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Format python/air/backend/xrt_runner.py with black

b1a1c89

CI black check flagged the multi-line tuple in `output_shapes_dtypes` as unformatted. Apply black formatting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop verbose comment on areEquivalentIndices

c6b57af

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Make dependsOnLoopIV file-local; drop header decl

5884a7c

All callsites are inside AIRMatmulCodegenHelpers.cpp. With the body now delegating to air::traceDependentInductionVar, the helper has no other consumers and can become a static file-local.

erwei-xilinx and others added 11 commits May 8, 2026 18:27

test 46: unify transform-script + cpp paths via apply_registered_pass

9d27104

Same shape as test 45/44; vec-prep cast type is i32 (i8 acc -> i32) instead of f32. Drop --use-cpp-pipeline plumbing in run.py + run_npu2_peano_cpp.lit. NPU2 validated: PASS!

test 48: unify transform-script + cpp paths via apply_registered_pass

0fd49e0

bf16-out variant: sets fuse-output-truncf-first=true (pre-step) and vec-prep-hoist-cast-pairs=true (cleans up f32-to-bf16 trunc/extf pairs around iter_arg). Drop --use-cpp-pipeline plumbing + cpp lit. NPU2 PASS.

Revert "test 37/aie2 + test 39/transform: migrate to apply_registered…

afa19b0

…_pass" This reverts commit 494fec7.

erwei-xilinx marked this pull request as ready for review May 12, 2026 17:52

erwei-xilinx requested a review from fifield as a code owner May 12, 2026 17:52

Copilot AI review requested due to automatic review settings May 12, 2026 17:52

erwei-xilinx requested a review from jgmelber as a code owner May 12, 2026 17:52

Copilot started reviewing on behalf of erwei-xilinx May 12, 2026 17:54 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

erwei-xilinx and others added 2 commits May 12, 2026 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace matmul transform-dialect scripts with C++ pass pipeline#1593

Replace matmul transform-dialect scripts with C++ pass pipeline#1593
erwei-xilinx wants to merge 43 commits into
Xilinx:mainfrom
erwei-xilinx:matmul-codegen-cpp-pipeline

erwei-xilinx commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

Single public pass surface

Phase order (each gated by its config)

Driver pipeline rewrites (9 matmul examples + 2 lit tests)

Other code cleanup

Carrier attribute (air.matmul_codegen_config)

Test results

What's NOT in this PR (follow-ups)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented May 7, 2026 •

edited

Loading

Carrier attribute (`air.matmul_codegen_config`)