Replace matmul transform-dialect scripts with C++ pass pipeline#1593
Open
erwei-xilinx wants to merge 43 commits into
Open
Replace matmul transform-dialect scripts with C++ pass pipeline#1593erwei-xilinx wants to merge 43 commits into
erwei-xilinx wants to merge 43 commits into
Conversation
Replaces the per-test transform-dialect scripts driving matmul
tiling/bufferization/vectorization with a sequence of focused C++
MLIR passes. Goal: parametric, generally-applicable, debuggable,
individually testable; eventually supersede the per-test
transform_aie2*.mlir scripts. See MATMUL_CODEGEN_PIPELINE_PLAN.md
for the full design and status table.
Milestones landed:
- M0: air-matmul-pack-and-transpose, air-matmul-tile-l3-to-l2-copies
(byte-identical IR vs transform-script Phases 1+3).
- M1: Group B (vectorization-prep) passes for prog_ex matmul/{bf16,i8}
(HW-validated on NPU2).
- M2: Group A passes for tests 53/54 single-pack flow (HW-validated).
- M3a/M3b: air.matmul_codegen_config carrier attr + heuristic pass +
L1-fit guardrail; --use-cpp-pipeline implies M3 (HW-validated;
shape-sweep 5/6 PASS).
- M4a: two-pack-level (test 37) infrastructure incl. two marker-flow
fixes in tile-k-and-fuse-packs (HW-validated; perf within noise of
legacy across all three tests).
Tests:
- 390/391 lit tests pass (the 1 failure is pre-existing, unrelated).
- Tests 37/53/54 cpp paths PASS on NPU2 hardware.
- prog_ex matmul/{bf16,i8} cpp paths PASS on NPU2.
Backup commit; not yet a PR. Outstanding: M2d (delete obsolete
per-test transform scripts), M3c (real derivation for tile_cores),
M4b (heuristic for two-pack), M5 (Triton-XDNA backend integration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eline Test 48's transform_aie2p.mlir maps phase-by-phase to existing M0/M1/M2 passes — no new infrastructure needed; the test 53 cpp pipeline string was reusable verbatim. The heuristic produces the same tile/pack/vector parameters that the legacy script hand-wires (pack [8,8,8], tile_cores [8,8,0], K-tile 8, prologue [8,8], epilogue [64,64], vector [2,2,1,...]). Hardware-validated on NPU2: PASS for both legacy and cpp paths. Perf parity confirmed (3-run min times: legacy 0.211ms vs cpp 0.208ms — within noise). Open question on `air-hoist-cast-pairs` resolved: fixed-point converges to STRUCTURALLY identical IR vs the 4 sequential `transform.air.hoist_cast_pair` calls in the legacy script (same op counts, alloc shapes, nesting; only diffs are SSA renumbering and missing `prologue_herd`/`compute_herd`/`epilogue_herd` annotations on cpp herds — purely cosmetic). Adds AIR_DUMP_FINAL_IR env var for debugging. Phase 2 (Triton-XDNA's driver.py invokes the cpp pipeline directly, removing the obsolete tile-and-promote default + transform-script loading) lives in the Triton-XDNA repo, not mlir-air. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes air-matmul-set-codegen-config (the M3a/M3b pass that wrote
the air.matmul_codegen_config carrier attribute via a hardcoded
target+dtype lookup). Heuristics belong in a separate PR; this PR
should only expose the consumer pipeline with all configuration
specified externally.
What stays:
- The carrier-attribute infrastructure (MatmulCodegenConfig.{h,cpp})
is kept as the external API. Each consumer pass reads
air.matmul_codegen_config when present and falls back to its
pass-options otherwise — so external producers (a future heuristic
PR, autotuners, end-user tooling) can specify configuration without
changing the pipeline.
- All consumer passes are unchanged.
What changes in tests:
- Tests 53, 54, 48 each pass tile/pack/vector parameters explicitly
via per-pass options strings (M2 style). Test 37 was already
written this way.
- --use-codegen-config flag removed; --use-cpp-pipeline is the only
toggle.
HW-validated on NPU2 (all five paths PASS):
- prog_ex matmul/{i8, bf16}
- test/xrt/{37, 48, 53, 54} cpp pipelines
390/391 lit tests pass (the 1 failure is pre-existing, unrelated).
Also scrubs all iree / iree-amd-aie references from PR-scope files
(MATMUL_CODEGEN_PIPELINE_PLAN.md only — no source code referenced
those projects).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This planning doc should not be committed. PR description carries the summary; in-tree planning docs are not wanted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit-driven cleanup of the new matmul-codegen passes:
1. Promote findMarkedOp/findMarkedForLoop to air/Util/Util.h as
findOpWithAttr / findOpOfTypeWithAttr<OpTy>. Replaces 3 local
redefinitions plus 4 inlined walks across the new files.
2. Drop the local ConvertMemrefCopyToLinalgCopyPattern in
AIRMatmulTileL3ToL2Copies.cpp; call the existing
runConvertMemrefCopyToLinalgCopy helper instead (already exposed
in AIRMatmulCodegenHelpers.h).
3. Extract tileAsForallResult helper for the
scf::tileUsingSCF{LoopType=ForallOp} + replaceOp sequence.
AIRMatmulTileCores and AIRMatmulTileLaunchTile now share it
instead of open-coding the same 8 lines.
4. Replace Option<std::string> + parseIntList with tablegen
ListOption<int64_t> for tile-sizes / prologue-tile-sizes /
epilogue-tile-sizes / fill-iterator-interchange. Deletes the
parseIntList helper outright.
Net ~30 LOC removed. All 5 HW tests still PASS on NPU2; lit unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The intervening-deps move-above-anchor logic in hoistInterveningDeps had its own iterative "find ops with all-resolved operands and move them" loop. Upstream provides mlir::computeTopologicalSorting in mlir/Analysis/TopologicalSortUtils.h that does exactly this — sorts an arbitrary op set in dependency order, treating ops outside the set as already-ready (incomplete-chain semantics). After sorting, a single moveBefore pass yields the same result. Net 13 LOC removed. All 5 HW tests still PASS on NPU2; lit unchanged at 390/391. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four of the registered passes had no parametric external config — they walked for a marker and called a 1-2 line helper. They were noise in the public pass registry without adding an external API surface. Dropped from tablegen registration: - air-matmul-cleanup-bufferize - air-matmul-fuse-pingpong-loops - air-matmul-fuse-output-truncf - air-hoist-static-alloc Replaced by: - One new combined pass air-matmul-post-bufferize-cleanup (cleanup + pingpong-fuse run back-to-back in all tests using them). - Boolean option `fuse-output-truncf-first` on air-matmul-bufferize-output-l2 (fuse-truncf must precede the L2 fill bufferize so the fill's element type matches the post-fuse matmul). - Boolean option `hoist-static-alloc-first` on air-matmul-prologue-epilogue (used by the M4 K-peel flow). The pass bodies survive as plain C++ functions in AIRMatmulBufferizationPasses.h (`runFusePingpongLoopsImpl`, `runFuseOutputTruncfImpl`, `runHoistStaticAllocImpl`), called either from the combined pass or from the option-driven steps in the parametric passes. Net 69 LOC removed; public pass count drops by 3 (4 dropped, 1 added). All 5 HW tests still PASS on NPU2; lit unchanged at 390/391. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six of the M1 vec-prep passes (eliminate-redundant-vector-transfers, flatten-for-iter-args, hoist-loop-invariant-transfers, hoist-vector-transfer-pointers, vector-cast-for-emulation, hoist-cast-pairs) were registered separately but in practice ALWAYS fire as a fixed-order sequence inside the matmul codegen pipeline. Nobody invokes them individually outside this workflow. Replaced with one composite air-matmul-codegen-vec-prep that runs them in fixed order, with boolean options to enable/disable each step and two parallel option groups (cast1-* / cast2-*) for the 0-2 vector-cast-for-emulation invocations (test 54 needs two with different target element types; others need one or zero). air-fold-unit-extent-dims kept registered separately because the prog_ex pipelines invoke it standalone outside the vec-prep block. Per-step bodies survive as static helper functions inside the same TU, called by the composite. The runFoo helpers in AIRMatmulCodegenHelpers.cpp are unchanged. Net 207 LOC removed; 5 fewer registered passes (6 dropped, 1 added). All 5 HW tests + 2 prog_ex paths still PASS on NPU2; lit unchanged at 390/391. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Continued the matmul-named pass-registry cleanup: - air-matmul-tile-l3-to-l2-copies → option `do-tile-l3-to-l2-copies` on bufferize-output-l2 (with k-l2-tile + copy-marker options moved over). Body extracted as `runTileL3ToL2CopiesImpl` for direct call. - air-matmul-bufferize-l1-output → option `do-bufferize-l1-output` on pack-and-transpose (tail step). Body extracted as `runBufferizeL1OutputImpl`. - air-matmul-post-bufferize-cleanup → option `do-post-bufferize-cleanup-first` on tile-for-vectorize (pre-step). Body extracted as `runPostBufferizeCleanupImpl`. Net 3 fewer registered passes. From 12 → 9. Public matmul-codegen pass count progression in this PR: 21 (initial) → 17 → 14 → 12 → 9. The 9 remaining passes are all genuinely parametric: air-matmul-tile-launch-tile, air-matmul-pack-and-transpose, air-matmul-bufferize-output-l2, air-matmul-tile-k-and-fuse-packs, air-matmul-tile-cores, air-matmul-bufferize-l1-inputs, air-matmul-prologue-epilogue, air-matmul-tile-for-vectorize, air-matmul-codegen-vec-prep + air-fold-unit-extent-dims (general utility). All have substantive options (tile sizes, perms, markers, memory spaces) that justify standalone exposure. All 5 HW tests + 2 prog_ex paths still PASS on NPU2. Lit unchanged at 390/391. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e orchestrator
Replaces the 9 internal tablegen entries with a single `air-matmul-codegen`
pass taking a flat options struct (~38 fields). Internal phases become free
functions invoked in fixed order; markers between phases are
orchestrator-internal constants.
Phase placement (each gated by its config; canonicalize/cse runs between):
A tile-launch-tile (launch-tile)
B pack-and-transpose (l2-pack-sizes + perms)
C bufferize-output-l2 — runs before B in single-pack flows (so the
tile-l3-l2-copies / fuse-output-truncf-first pre-steps see un-packed
IR), and after B in two-pack flows (so the L2 alloc takes the packed
shape expected by the L1 pack)
D pack-and-transpose (l1-pack-sizes + perms; L1-output bufferize)
E tile-k-and-fuse-packs (outer-k-tile-factor)
F bufferize-l1-inputs into L2 (auto when D ran)
H tile-cores (core-tile)
I tile-k-and-fuse-packs (inner-k-tile-factor)
J bufferize-l1-inputs into L1 (auto when H ran)
K prologue-epilogue (prologue-tile / epilogue-tile)
L one-shot-bufferize (one-shot-bufferize, module-scope sub-pipeline)
M tile-for-vectorize (matmul-vec-tile)
N vec-prep composite (do-vec-prep)
Pipeline pass count (matmul-codegen surface):
21 (initial PR) → 9 (after stage groupings) → **1 + general-utility
air-fold-unit-extent-dims**. The orchestrator runs at ModuleOp scope so
it can schedule one-shot-bufferize internally.
Pack output bufferization auto-detect:
- single-pack flow (l1-pack-sizes empty): the L2 pack is the LAST pack;
its output bufferizes to L1 by default.
- two-pack flow: the L1 pack is the LAST pack; its output bufferizes to
L1 by default.
- bufferize-last-pack-output=false to inspect raw pack semantics
(used by the pack_basic lit test).
Driver pipeline rewrites (all 5 HW tests + 2 lit tests):
- tests 37/48/53/54 + prog_ex bf16/i8 each collapse the prior 12-line
phase list into one or two air-matmul-codegen invocations.
- tests 48/53/54 use two invocations bracketing
scf-forall-to-parallel + air-par-to-herd + air-herd-vectorize: first
invocation runs A-K + L + M; second invocation runs only N.
- prog_ex bf16/i8 set everything before M to skip and enable only M+N.
- pack_basic.mlir + tile_copies_basic.mlir RUN lines updated to the new
surface; expected output adjusted for the post-pack canonicalize.
All 5 HW tests PASS on NPU2. check-air-mlir 381/388 effective passing
(unchanged from prior commit; the 1 long-standing air_transform_payload.mlir
failure is unrelated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI black check flagged the multi-line tuple in `output_shapes_dtypes` as unformatted. Apply black formatting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runs `git clang-format origin/main` on the diff vs main, which catches
formatting on all changed lines (not just freshly-touched ones). My local
`clang-format-17 -i` only formatted the latest edits; this commit applies
clang-format to the cumulative diff against main, normalizing:
- Passes.td (matmul-codegen entry layout)
- AIRMatmulCodegen.cpp (multi-line call argument grouping)
- AIRMatmulCodegenHelpers.{h,cpp} (lambda + multi-line call grouping)
- MatmulCodegenConfig.{h,cpp} (function signature wrapping)
- AIRLinalgCodegen.cpp (signature wrapping)
Verified `ninja air-opt` still builds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds --use-cpp-pipeline + --profile-iters to four legacy transform-script
matmul tests; replaces the per-test transform_aie2{,p}.mlir with calls to
the air-matmul-codegen orchestrator.
Migrations and HW perf (NPU2 / Strix, 30 iters median, lower is better):
| Test | Legacy | CPP | Status |
|-----------------------------------------------|----------|----------|-------------|
| 44 triton_matmul_ver4_vector_ptr_opt (aie2p) | 4.715 ms | 4.032 ms | PASS, faster|
| 45 triton_matmul_ver4_strix_8x4 | 3.853 ms | 3.922 ms | PASS, parity|
| 46 triton_matmul_ver4_strix_8x4_i8_i8_i32 | 2.231 ms | 2.064 ms | PASS, faster|
| 39 triton_matmul_ver3_vectorized (NPU1) | n/a | n/a | partial* |
* Test 39 targets aie2 / NPU1 (Phoenix). The orchestrator pipeline parses
cleanly and produces valid IR, but downstream aiecc compilation hangs on
the local Strix machine. Needs NPU1 hardware to fully validate.
Test 12 (matmul_transform_1x4_bf16) NOT migrated: it uses
transform.structured.pad (padding flow) rather than packing, plus a hand-
rolled test.exe harness instead of XRTRunner. Padding-mode orchestrator
support is a separate PR.
Orchestrator extensions:
* Phase C placement: in single-pack flows (l1-pack-sizes empty) Phase C
now runs BEFORE Phase A (launch-tile) so the L2 alloc lands at LAUNCH
scope outside any per-core forall. Two-pack flows still run Phase C
after Phase B (matches existing test 37 behavior).
* Phase F': in single-pack flows with NO tile-cores (e.g. launch-tile-
only flows like test 39), bufferize the K-fused L1 packs to L1 here
using the lhs_pack_in_k / rhs_pack_in_k markers (Phase J's
fused_*_l1_pack markers don't fire without tile-cores).
* runTileLaunchTileImpl now tags the inner per-launch-tile matmul with
matmul_compute, so downstream tile-for-vectorize finds it in
launch-tile-only flows where there's no separate tile-cores step.
All 7 NPU2 tests (37, 44, 45, 46, 48, 53, 54) PASS on hardware.
check-air-mlir 381/388 effective passing (unchanged baseline).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…umElements The MATMUL_CODEGEN_PIPELINE_PLAN.md doc was never checked in, and the "M0/M1a/M2/M3/M4/M5" + "Group A" milestone numbers it referenced are meaningless without it. Sweep all source comments and docstrings to drop both the plan-doc citations and the milestone tags, replacing them with direct prose about what each phase does relative to the air-matmul-codegen orchestrator. No behavioural change. Also dedupe `xilinx::air::getVectorNumElements(VectorType)` against upstream `mlir::VectorType::getNumElements()` (inherited from ShapedTypeInterface as part of `extraSharedClassDeclaration`). The local helper was a literal product-of-shape-dims wrapper; vectors always have static shape so the upstream `hasStaticShape()` assert is a no-op. Replaces 5 callsites and deletes the local helper + its decl. Verified: ninja air-opt builds clean; check-air-mlir 381/388 effective passing (unchanged baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alentTo The local helper hand-rolled three cases (SSA equality, AffineApplyOp + map+operands, ConstantIndexOp + value), all of which reduce to upstream `OperationEquivalence::isEquivalentTo` semantics: same op kind + same attributes + operand SSA-equality. The upstream version is also more general — it correctly recognizes equivalent index expressions built from arith ops (e.g. `arith.muli %iv, %c4`) which the hardcoded AffineApply/ConstantIndex check would miss. Validated: - check-air-mlir 381/388 effective passing (unchanged baseline). - All 7 NPU2 HW tests PASS (37, 44, 45, 46, 48, 53, 54). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tching `air::isEquivalentTo` (Util.cpp) is the air-side wrapper that mirrors upstream `OperationEquivalence::isEquivalentTo` but adds operand const-int equivalence: two operand SSAs that are distinct but both fold to the same constant int are treated as equivalent. For comparing memref indices this is the right semantics — `affine.apply (%iv, %c0)` and `affine.apply (%iv, %const_0)` should match when both `%c0`/`%const_0` fold to 0, even if they are distinct SSA values produced by separate `arith.constant 0 : index` ops in different scopes. Validated: - check-air-mlir 381/388 effective passing (unchanged baseline). - All 7 NPU2 HW tests PASS. - test 45 perf: 3.957 ms median (within noise of previous 3.92 ms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper has only two callsites, both in AIRMatmulCodegenHelpers.cpp (areIdenticalReads + the hoist-loop-invariant transfer scan). Drop the header decl and mark the function `static`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntTo The hand-rolled helper checked op kind + source memref + index SSA-or- constant equivalence + result vector type. Upstream `OperationEquivalence::isEquivalentTo(read1, read2, IgnoreLocations)`: - checks op kind ✓ - checks attributes (permutation_map, in_bounds, padding) — strictly better than the hand-rolled version, which silently merged reads with different in_bounds / permutation_map - checks operand SSA equality (indices + source memref) — strict, but `eliminate-redundant-vector-transfers` runs after canonicalize so duplicate constants are CSE'd into the same SSA value before this fires - checks result types ✓ Single callsite inlined; helper + header decl deleted. Validated: - check-air-mlir 381/388 effective passing (unchanged baseline). - All 7 NPU2 HW tests PASS. - test 45 perf: 3.863 ms median (legacy was 3.853 ms; within noise). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hand-rolled transitive operand walk in dependsOnLoopIV duplicates air::traceDependentInductionVar, which already traces scf.for IVs, scf.parallel IVs, herd IDs, scf.for iter_args, peels air.execute, and gathers operands from parent affine.if guards. The wrapper preserves the existing predicate semantics (returns true iff loopIV appears in the collected dependency history) and gains iter_arg + air.execute coverage for cloneOpAndOperands and the hoist helpers for free.
All callsites are inside AIRMatmulCodegenHelpers.cpp. With the body now delegating to air::traceDependentInductionVar, the helper has no other consumers and can become a static file-local.
Replace the hand-rolled per-operand recursion with a top-level operand walk that calls air::cloneOpAndOperands (Util.cpp) per operand chain. The slice traversal is now handled by upstream getBackwardSlice with a per-op filter that rejects ops outside the loop, already mapped, or IV-dependent (matching the prior per-operand-edge semantics in aggregate). The local helper retains the entry-level early-return on already-mapped target results so repeated calls with shared mapping (e.g., the index loop in runHoistVectorTransferPointers) still de-dup. All callsites are in this .cpp; helper is now file-static and the header decl is removed.
…ne pass
The standalone AIRFoldUnitExtentDims pass is a thin wrapper around
runFoldUnitExtentDimsOnFunc, only ever co-invoked with air-matmul-codegen
in the 11 known pipelines. Move all responsibility into the orchestrator:
- Add Phase-0 option do-pre-fold-unit-extent-dims (default false). Runs
the fold once before any other phase. Used as initial IR cleanup when
invoking the orchestrator on IRON-emitted IR with stray unit dims.
- Phase-N vec-prep already calls the fold (gated vec-prep-fold-unit-
extent-dims, default true) and covers the post-vec-prep cleanup.
- Drop AIRFoldUnitExtentDims pass class, factory, registration, and
tablegen entry.
- Update 11 run.py callsites:
* Tests 39/44/45/46/48/53/54: drop the trailing air-fold-unit-extent-
dims (vec-prep's internal fold now handles it). Test 39 has no second
orchestrator, so add a minimal air-matmul-codegen{do-pre-fold-...}
invocation in its place.
* prog_examples i8/bf16: replace pre-orchestrator and inter-orchestrator
folds with do-pre-fold-unit-extent-dims=true on the first orchestrator
and re-enabled vec-prep-fold-unit-extent-dims (default) on the second.
The local hoistStaticallyBoundAllocationsInFunc<T> template was only ever instantiated with memref::AllocOp, with the std::is_same branches gating dealloc insertion to that one type anyway. The xilinx::air:: hoistStaticAllocsInFunc wrapper existed solely to expose a non-template, namespaced symbol for cross-TU linking. De-template both helpers, fold them into xilinx::air::, and route the transform op directly through the public free function. One symbol instead of two; ~10 lines of branch logic removed.
runHoistLoopInvariantTransfersStep and runHoistVectorTransferPointersStep both walked herds for innermost scf.for ops with the same ~15-line hasInnerFor probe. Lift into findInnermostForsInHerds with an optional herd filter; the second caller passes herdHasVectorContract to skip fill/epilogue herds.
Python-side timing of XRT kernel invocations is too noisy for performance evaluation (Python overhead, GIL, GC, scheduling). The benchmark method and its _make_xrt_backend factory in xrt_runner.py were a mistake; remove them entirely. Also drop the corresponding --profile-iters argparse option and runner.benchmark callsite from 8 test/xrt/*/run.py drivers (37, 39, 44, 45, 46, 48, 53, 54). Drive-by: rewrite the leftover "Iron-built flow" comment in i8/run.py and bf16/run.py to "Direct-codegen flow" since it just describes what --direct-codegen does, not the IR origin.
i8/run.py and bf16/run.py each carried a ~125-line inline transform script under `if False:`. Permanently unreachable, kept only as reference. Delete; the transform-dialect path is preserved in git history if anyone needs to revive it. Verified i8 --direct-codegen --arch aie2p --compile-mode compile-and-run still PASSes after the deletion.
Each NPU2 transform-script lit gets a sibling *_cpp.lit that exercises the C++ orchestrator path via --use-cpp-pipeline. Until now the cpp pipeline had zero CI coverage (manual --use-cpp-pipeline only); the existing transform-script lits were the sole exercised path. Tests covered: 37, 44, 45 (each: standard + elf), 46, 48, 53, 54. For 46 and 54 the existing lit is Makefile-driven; the new cpp lit invokes run.py directly with FileCheck since the Makefile run target doesn't pipe through --use-cpp-pipeline. Test 39 is intentionally skipped: its cpp pipeline currently targets NPU1 (aie2) only and was flagged as a partial migration in the prior PR session. Each new lit was validated locally on NPU2 (PASS).
The Phase-0 fold is idempotent and a no-op on already-clean IR, so there's no reason to make callers opt in. Flip the default and drop the explicit `do-pre-fold-unit-extent-dims=true` in the i8/bf16 prog_examples and the test 39 fold-only invocation.
The orchestrator had 52 options. 7 of them were pure structural toggles or always-on bools that no caller varied: - do-pre-fold-unit-extent-dims (default true; phase always runs) - do-vec-prep (default true; pre-vectorize the 7 of 8 vec-prep walks find no IR to operate on; the 8th is the same fold Phase 0 already runs) - vec-prep-fold-unit-extent-dims - vec-prep-eliminate-redundant-vector-transfers - vec-prep-hoist-loop-invariant-transfers - vec-prep-flatten-for-iter-args - vec-prep-hoist-vector-transfer-pointers All seven are removed from Passes.td, the orchestrator runOnFunc, and runCodegenVecPrepImpl. The 9 run.py callers stop having to pass do-vec-prep=false on the first orchestrator invocation; first-call vec-prep is now a no-op walk. Verified on NPU2 HW: - Test 46 cpp profile: 6182 -> 6130 gflops (-0.84%, within ±5% noise) - Test 54 cpp profile: 116.7 -> 116.7 gflops (-0.04%) - Correctness sweep tests 37/44/45/46/48/53/54 cpp pipeline: all PASS
Replace test 45's two mutually exclusive paths (--use-cpp-pipeline opt-in) with a single transform-script path that delegates the matmul-specific work to air-matmul-codegen via transform.apply_registered_pass. The script keeps the non-matmul plumbing (scf.forall->herd, herd-vectorize, canon/cse) but no longer hand-rolls the tile/pack/bufferize/vectorize sequence -- those are orchestrator options now. - transform_aie2p.mlir: 355 -> 76 lines. All transform.air.* matmul ops replaced by two apply_registered_pass "air-matmul-codegen" calls (pre-vectorize half + vec-prep half). - run.py: drop --use-cpp-pipeline argparse + cpp_pipeline branch. - Drop the run_npu2_peano_cpp.lit + run_npu2_peano_elf_cpp.lit drivers (they exercised the now-removed path). Verified on NPU2: existing run_npu2_peano.lit + run_npu2_peano_elf.lit both PASS through the rewritten script (3 runs each). This is the validation lap for the unified-path proposal. The remaining 6 tests (37, 44, 46, 48, 53, 54) follow the same pattern and will be migrated in subsequent commits.
Same shape as test 45: both transform_aie2{,p}.mlir scripts (NPU1, NPU2)
become thin wrappers around two air-matmul-codegen invocations with
intermediate par-to-herd + herd-vectorize. Drop --use-cpp-pipeline +
--arch plumbing in run.py. Drop run_npu2_peano_cpp.lit + elf variant.
NPU2 validated: existing run_npu2_peano.lit + run_npu2_peano_elf.lit
both PASS through the rewritten script. NPU1 not validated locally; CI
will exercise via run_npu1_peano.lit.
…_pass NPU2 transform_aie2p.mlir becomes a thin wrapper around one air-matmul-codegen invocation (two-pack-level flow). Drop the cpp pipeline branch and the cpp-pipeline lits. NPU1 transform_aie2.mlir is left as the legacy hand-rolled path (the cpp pipeline string only covered NPU2; no NPU1 orchestrator options were exercised before). Verified on NPU2: run_npu2_peano.lit + run_npu2_peano_elf.lit both PASS.
Same shape as test 45/44; vec-prep cast type is i32 (i8 acc -> i32) instead of f32. Drop --use-cpp-pipeline plumbing in run.py + run_npu2_peano_cpp.lit. NPU2 validated: PASS!
bf16-out variant: sets fuse-output-truncf-first=true (pre-step) and vec-prep-hoist-cast-pairs=true (cleans up f32-to-bf16 trunc/extf pairs around iter_arg). Drop --use-cpp-pipeline plumbing + cpp lit. NPU2 PASS.
bf16-out non-tile-aligned variant: same shape as test 48; transform script defaults to k-l2-tile=16. run.py keeps a small re.sub block to rewrite k-l2-tile + outer-k-tile-factor when --k-l2-tile differs. Drop --use-cpp-pipeline plumbing + the cpp lit. NPU2 PASS.
…5/44/46/48/53
Test 54 (f32-in/out + BFP16 emulation) joins tests 45/44/46/48/53 with a
unified transform-script path that delegates matmul codegen to the C++
air-matmul-codegen orchestrator via transform.apply_registered_pass.
Drop --use-cpp-pipeline branch + run_npu2_peano_cpp.lit.
Profiling test 54 surfaced a -32% perf regression that affected ALL the
unified scripts in this PR: replacing `func.func(canonicalize,cse,fold-
memref-alias-ops)` (full passes that iterate to fixed point + DCE) with
transform.apply_patterns (one-shot pattern application) lost critical
canonicalization. Fix: switch the cleanup blocks in tests 45/44/46/48/
53/54 to chain three transform.apply_registered_pass invocations
("canonicalize", "cse", "fold-memref-alias-ops") on a fresh func handle
each, matching the original pass semantics.
Verified on NPU2:
- Test 46 cpp profile: 6182 -> 6119 gflops (-1.02%, within +/-5% noise)
- Test 54 cpp profile: 116.7 -> 115.8 gflops (-0.79%)
- Correctness sweep tests 44, 45, 46, 48, 53 all PASS
The two-line rc-then-exit shape was a leftover from the prior --profile-iters plumbing (which checked rc==0 before benchmarking). With --profile-iters gone, rc was single-use and the indirection served no purpose. Collapse across all 8 test/xrt run.py drivers.
Finish unifying the NPU1 paths that were left as legacy in the prior pass: test 37's transform_aie2.mlir (two-pack mmul=4x4x8) and test 39's transform.mlir (single-pack mmul=4x4x8). Drop test 39's now-dead --use-cpp-pipeline branch. Both target NPU1 (aie2/Phoenix) which I cannot validate locally (aiecc compile of NPU1 IR hangs on Strix host, known issue). Scripts parse cleanly; CI run_npu1_peano.lit drivers will exercise the actual HW path. Options derived from the legacy script's tile/pack/permute sequence (test 37) and from the previously-existing cpp pipeline string in run.py (test 39, which I deleted in the same commit).
…_pass" This reverts commit 494fec7.
OperationEquivalence::isEquivalentTo (introduced in 773953c) is strict on operand SSA equality, so it fails to merge two vector.transfer_read ops whose indices are computed by distinct-but-identical affine.apply ops, or come from two scf.for iter_args initialized to the same value. The transform-op path in AIRLinalgCodegen.cpp::areIdenticalReads still uses the value-aware matcher and correctly catches both cases. Concrete impact: the orchestrator's vec-prep left an extra B-tile vector.transfer_read in the inner loop of the bf16 llama matmul (M=128, K=2048, N=512/2048/8192). After hoist-vector-transfer-pointers lifted the index into a redundant iter_arg, Peano's MC encoder hit 'getmBMsOpValue: Register not in mBMs' UNREACHABLE on the resulting VLD_x_pstm_nrm_imm_pseudo. mlir-aie@b37dc33's placer changes exposed the regression in CI; older mlir-aie scheduled around it. Inline the AIRLinalgCodegen.cpp::areIdenticalReads body (base + index-by-index areEquivalentIndices + vector type) and extend areEquivalentIndices to recognise affine.apply ops with identical map and operands. With the fix, core_7_5.opt.ll for run_npu2_llama_8x4_kv is byte-identical to the transform-script output on origin/main, and all four llama lit tests (qo, kv, gate_up, down) compile cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR replaces per-test transform-dialect matmul scripts with a single registered MLIR pass (air-matmul-codegen) that orchestrates tiling/packing/bufferization/vectorization as a fixed-order C++ pipeline driven by flat pass options, and updates several XRT tests/examples to invoke the new orchestrator.
Changes:
- Add the
air-matmul-codegenorchestrator pass (+ supporting phase implementations and headers) and register/build it. - Add lit tests covering copy-tiling and pack/transpose behavior.
- Update multiple matmul XRT tests/examples to use
transform.apply_registered_pass(or a PassManager pipeline) to drive codegen via the C++ orchestrator.
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test/xrt/54_matmul_padding_f32_bf16_emulation/transform_aie2p.mlir | Replace legacy transform script with orchestrator pass invocations + cleanup passes. |
| test/xrt/54_matmul_padding_f32_bf16_emulation/run.py | Rewrite transform script options (k-l2-tile / outer-k factor / epilogue-tile) before running transform. |
| test/xrt/53_matmul_padding_bf16/transform_aie2p.mlir | Replace legacy transform script with orchestrator pass invocations + cleanup passes. |
| test/xrt/53_matmul_padding_bf16/run.py | Rewrite orchestrator options in script; add debug print-and-exit flag. |
| test/xrt/48_triton_matmul_ver4_strix_4x4_bf16_output/run.py | Clarify script role; add optional final-IR dump env hook. |
| test/xrt/46_triton_matmul_ver4_strix_8x4_i8_i8_i32/run.py | Update CLI help and comments to reflect orchestrator-backed script. |
| test/xrt/45_triton_matmul_ver4_strix_8x4/run.py | Update comments/help to reflect orchestrator-backed script. |
| test/xrt/44_triton_matmul_ver4_vector_ptr_opt/transform_aie2p.mlir | Replace large annotated script with orchestrator-based transform script (AIE2P). |
| test/xrt/44_triton_matmul_ver4_vector_ptr_opt/transform_aie2.mlir | Replace large annotated script with orchestrator-based transform script (AIE2). |
| test/xrt/44_triton_matmul_ver4_vector_ptr_opt/run.py | Update CLI help and comments to reflect orchestrator-backed script. |
| test/xrt/39_triton_matmul_ver3_vectorized/run.py | Add --use-cpp-pipeline to run the C++ orchestrator pipeline directly. |
| test/xrt/37_matmul_transform_4x4_bf16/transform_aie2p.mlir | Replace legacy transform script with a single orchestrator invocation (two-pack flow). |
| test/xrt/37_matmul_transform_4x4_bf16/run.py | Update comments to describe orchestrator-backed vs legacy script roles. |
| programming_examples/matrix_multiplication/i8/run.py | Replace hand-written vectorization transform IR with orchestrator vectorize/vec-prep pipeline. |
| programming_examples/matrix_multiplication/bf16/run.py | Replace hand-written vectorization transform IR with orchestrator vectorize/vec-prep pipeline. |
| mlir/test/Transform/AIRMatmulTileL3ToL2Copies/tile_copies_basic.mlir | New lit test for memref.copy→linalg.copy conversion + K-tiling + loop markers. |
| mlir/test/Transform/AIRMatmulPackAndTranspose/pack_basic.mlir | New lit test for pack + transpose permutations and canonicalization behavior. |
| mlir/lib/Util/Util.cpp | Add air::findOpWithAttr implementation. |
| mlir/lib/Util/MatmulCodegenConfig.cpp | Add carrier-attribute reader/writer utilities for matmul codegen configs. |
| mlir/lib/Util/CMakeLists.txt | Build MatmulCodegenConfig.cpp into AIRUtil. |
| mlir/lib/Transform/Passes.cpp | Register the new air-matmul-codegen pass. |
| mlir/lib/Transform/CMakeLists.txt | Add new matmul codegen source files to the Transform library. |
| mlir/lib/Transform/AIRMatmulVectorizePasses.cpp | Implement vectorization/vec-prep phases as free functions for the orchestrator. |
| mlir/lib/Transform/AIRMatmulTilePasses.cpp | Implement tiling phases (launch/K/core/prologue+epilogue) as free functions. |
| mlir/lib/Transform/AIRMatmulTileL3ToL2Copies.cpp | Implement L3→L2 copy tiling as a free function invoked by orchestrator. |
| mlir/lib/Transform/AIRMatmulPackAndTranspose.cpp | Implement pack+transpose phase and optional output bufferization. |
| mlir/lib/Transform/AIRMatmulCodegen.cpp | New orchestrator pass wiring phases and canonicalize/CSE/one-shot-bufferize. |
| mlir/lib/Transform/AIRMatmulBufferizationPasses.cpp | Implement output/L1 bufferization and post-bufferize cleanup steps. |
| mlir/lib/Transform/AIRLinalgBufferize.cpp | Refactor hoist-static-alloc helper into a non-templated utility function. |
| mlir/include/air/Util/Util.h | Declare findOpWithAttr and templated findOpOfTypeWithAttr. |
| mlir/include/air/Util/MatmulCodegenConfig.h | Declare carrier-attribute schema and helper APIs. |
| mlir/include/air/Transform/Passes.td | Add TableGen definition/options/docs for air-matmul-codegen. |
| mlir/include/air/Transform/Passes.h | Export new matmul codegen/phase headers from pass umbrella header. |
| mlir/include/air/Transform/PassDetail.h | Add generated pass defs for new passes. |
| mlir/include/air/Transform/AIRMatmulVectorizePasses.h | Header for orchestrator vectorization/vec-prep free functions. |
| mlir/include/air/Transform/AIRMatmulTilePasses.h | Header for orchestrator tiling free functions. |
| mlir/include/air/Transform/AIRMatmulTileL3ToL2Copies.h | Header for orchestrator L3→L2 copy tiling free function. |
| mlir/include/air/Transform/AIRMatmulPackAndTranspose.h | Header for orchestrator pack+transpose free function. |
| mlir/include/air/Transform/AIRMatmulCodegenHelpers.h | New shared helper entry points used by both transform ops and C++ passes. |
| mlir/include/air/Transform/AIRMatmulCodegen.h | Header declaring the orchestrator pass factory functions. |
| mlir/include/air/Transform/AIRMatmulBufferizationPasses.h | Header for orchestrator bufferization/cleanup free functions. |
| mlir/include/air/Transform/AIRLinalgBufferize.h | Export hoistStaticAllocsInFunc utility. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Eight Copilot review comments: * MatmulCodegenConfig.cpp: switch the carrier dictionary to the setDiscardableAttr / getDiscardableAttr APIs the header documented. setAttr / getAttrOfType happened to route the namespaced attr to discardable storage, but the explicit APIs make intent obvious and match getDiscardableAttrs() snapshots elsewhere. * AIRMatmulPackAndTranspose.cpp: the snapshot/restore pair around linalg::pack already used getDiscardableAttrs() to read; the restore now uses setDiscardableAttr to symmetrically write so carrier metadata survives discardable-attr snapshotting downstream. * Util.h findOpWithAttr / findOpOfTypeWithAttr: comment said the helpers searched for a discardable attribute, but the implementation uses Operation::hasAttr (matches both inherent and discardable). Fix the comment to describe actual behavior; both helpers are used for unit-attr markers (regular) and discardable carrier configs alike. * AIRMatmulCodegen Phase L: phase L runs module-scoped one-shot bufferize inside a per-func loop, so on a multi-func module the first call would bufferize all funcs and subsequent calls would see already-memref IR for their tensor-based phases. All current callers build a single-func kernel; emit a clear error when the module has more than one func.func and one-shot-bufferize is enabled, instead of silently misbehaving. * AIRMatmulTilePasses runTileKAndFusePacksImpl: stop silently clamping k-iter-index with std::min; emit an error for an out-of-range value. * AIRMatmulVectorizePasses tileWithScfFor: pad with zeros only when the caller passes fewer sizes than the iteration domain rank. If the caller passes more, emit an error rather than truncating or letting setTileSizes misbehave. * Passes.td air-matmul-codegen description: clarify that phase N (vec-prep) always runs and is a cheap no-op on pre-vectorize IR (the individual steps walk for vector ops). The previous text suggested an "N=false" toggle that does not exist. * test/xrt/53_matmul_padding_bf16/run.py: drop unused `import os`. Validation: incremental ninja build + ninja install OK; check-air-mlir 381 passing, 7 expected-fail, 1 pre-existing failure (Transform/AIRTransform/AIRBufferize/air_transform_payload.mlir, also fails on origin/main); local NPU2 smoke test of bf16 llama_8x4_kv and i8 run4x4 with --compile-mode compile-only both pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hapes
The bf16/i8/i16 matmul generators previously hard-asserted M % tile_m == 0
and N % tile_n == 0, so any unaligned GEMM shape (e.g. typical model
layers) had to use a separate hand-written example (test/xrt/53,
test/xrt/54). Move that capability into the standard generator so a
single canonical path handles aligned and padded shapes alike.
Per example, in __main__:
* Compute M_padded = ceil(args.m / (tile_m * herd_m)) * (tile_m * herd_m)
and likewise for N_padded.
* Pass the padded shape to build_module so the IR is constructed for a
full-launch-tile grid.
* When padding is needed, walk the resulting IR for the single
air.launch op and attach
air.actual_sizes = array<i64: M_actual, N_actual, 1>.
The aircc pipeline already runs air-split-launch-for-padding (see
tools/aircc/aircc.cpp:1028); when the attribute is absent the pass
is a no-op, so aligned shapes keep producing byte-identical IR to
before this change.
* Allocate input_a / input_b at the padded shape, fill the
[0:M, :] / [:, 0:N] interior with random data, leave the tail zero.
XRTRunner sees a (M_padded, N_padded) output buffer; sampled-output
indices stay in [0, M) x [0, N) so we only validate the interior.
Inside build_module:
* Replace the affine.apply that computed launch-iv * (tile_m * herd_m)
with arith.muli on the launch block ID. inferTileSize in
mlir/lib/Transform/AIRSplitLaunchForPadding.cpp only recognises
arith.muli of the launch ID; with affine.apply the pass would emit
"could not infer tile sizes from launch body offset computations"
and fail.
* Replace the M %% tile_m / N %% tile_n asserts with the padded
invariants M %% (tile_m * herd_m) == 0 / N %% (tile_n * herd_n) == 0,
which the new __main__ guarantees.
Validation: bf16/i8/i16 each compile cleanly for both an aligned shape
(M=N=K=256) and an unaligned shape (M=N=200, K=256) on aie2p with
direct codegen. bf16 also compiles M=100, N=500, K=2048 on the 8x4
herd (padded to 128x512), confirming padding works at LLAMA-class
scales through the same generator. The existing run_npu2_*.lit driver
for the bf16 llama_8x4_kv shape (M=128, N=512, K=2048, all aligned)
still compiles to the same payload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the per-test transform-dialect scripts driving matmul tiling/bufferization/vectorization with a single C++ MLIR pass taking a flat options struct. Goal: parametric, generally-applicable, debuggable, individually testable; supersedes the per-test
transform_aie2*.mlirscripts.Status: draft / WIP — pipeline is functional and HW-validated end-to-end on NPU2 for 9 matmul examples (7 NPU2 HW-validated, 1 NPU1 partial, 1 NPU1 deferred).
Scope of this PR: the
air-matmul-codegenorchestrator + carrier-attribute infrastructure as the external API. All configuration is specified externally via the orchestrator's options. Automatic shape→config heuristics are deferred to a follow-up PR.What's in this PR
Single public pass surface
After several rounds of consolidation, the matmul codegen pipeline is now exposed as one registered pass (
air-matmul-codegen) with ~38 flat options, plus the general-utilityair-fold-unit-extent-dims(used standalone outside the matmul flow). All internal phases — launch tile, packs, K-tiles, core tile, bufferization to L1/L2 allocs, prologue/epilogue, one-shot-bufferize, tile-for-vectorize, vec-prep — are free-function bodies invoked in fixed order from the orchestrator. Markers between phases are orchestrator-internal constants, not configurable.Pipeline pass count (matmul-codegen surface): 21 (initial) → 9 (after stage groupings) → 1.
Phase order (each gated by its config)
The orchestrator runs at ModuleOp scope so it can schedule one-shot-bufferize internally. Empty list option / 0 /
falseskips the corresponding phase. Pack-output bufferization to L1 auto-targets the LAST pack (L1 in two-pack flows, L2 in single-pack flows);bufferize-last-pack-output=falseis available to inspect raw pack semantics (used by the pack_basic lit test).Driver pipeline rewrites (9 matmul examples + 2 lit tests)
programming_examples/matrix_multiplication/bf16programming_examples/matrix_multiplication/i8test/xrt/37_matmul_transform_4x4_bf16test/xrt/44_triton_matmul_ver4_vector_ptr_opttest/xrt/45_triton_matmul_ver4_strix_8x4test/xrt/46_triton_matmul_ver4_strix_8x4_i8_i8_i32test/xrt/48_triton_matmul_ver4_strix_4x4_bf16_outputtest/xrt/53_matmul_padding_bf16test/xrt/54_matmul_padding_f32_bf16_emulationtest/xrt/39_triton_matmul_ver3_vectorizedaiecchangs on Strix machine. Needs NPU1 HW to validate.test/xrt/12_matmul_transform_1x4_bf16transform.structured.pad(padding, not packing) + customtest.exeharness; padding-mode orchestrator support is a separate PRPer-test driver pipeline shape:
scf-forall-to-parallel+air-par-to-herd+air-herd-vectorize— first runs A–M (do-vec-prep=false), second runs only N.Other code cleanup
findMarkedOp/findMarkedForLoopintoair::findOpWithAttr/air::findOpOfTypeWithAttr<OpTy>inair/Util/Util.h.ConvertMemrefCopyToLinalgCopyPatternwith the existingrunConvertMemrefCopyToLinalgCopyhelper.Option<std::string>+ custom int-list parser with tablegenListOption<int64_t>.tileAsForallResulthelper for thescf::tileUsingSCF{ForallOp}+replaceOpsequence shared across three internal phases.hoistInterveningDepswith upstreammlir::computeTopologicalSorting.Carrier attribute (
air.matmul_codegen_config)Kept in the codebase for future heuristic use. The orchestrator currently does NOT read it (per-phase options are passed explicitly by the caller). External producers (autotuners, tools, future heuristic) can set the attribute without changing the pipeline; reintroducing read-paths is a one-line change in each phase impl.
Test results
check-air-mlir: 381/388 effective passing — one long-standing failure (Transform/AIRTransform/AIRBufferize/air_transform_payload.mlir) unrelated to this PR.What's NOT in this PR (follow-ups)
--use-cpp-pipelinepath; orchestrator pipeline runs cleanly butaiecchangs on the local Strix when targeting NPU1 IR. Needs Phoenix HW + investigation. Test 12 not yet migrated (padding flow).transform.structured.padinstead of pack. Adding a padding option + params toair-matmul-codegenis a separate PR.air.matmul_codegen_configfrom matmul shape + target — separate PR.tile_cores(needs modellingair-collapse-herd).driver.pyinvocation of the cpp pipeline (lives in the Triton-XDNA repo).runHoistLoopInvariantTransfers(~150 LOC) with upstreamlinalg::hoistRedundantVectorTransfers— investigated locally; semantics diverge for the multi-pair-aliasing-same-memref case where upstream conservatively refuses while ourareEquivalentIndices-driven check correctly hoists. Needs upstream extension or a hybrid wrapper.runEliminateRedundantVectorTransfers/runFuseTruncfLinalgupstream-replacement micro-trims (~60 LOC combined, marginal).