Context
PR landing alongside this issue (Stage C #4 first PR) deletes the column-affinity 3rd-stage swap optimization in L2MemrefToMemTileMap (mlir/lib/Conversion/AIRToAIEPass.cpp). The deletion is correct but causes a perf regression: workloads with strong column-affinity patterns now use round-robin-only memtile placement, producing cross-column DMA routing.
The proper replacement is to defer L2 memtile placement to mlir-aie's SequentialPlacer, which is now flow-aware via Xilinx/mlir-aie#3055 (buildFlowAdjacency + placeNonCoreTileByCentroid). The placer can pick memtile columns at the centroid of consumer-core columns — equivalent to AIR's column-affinity heuristic but driven by the placer's full topology view.
What's blocking the proper fix
Pipeline ordering:
outlineAIEMemtiles runs first — emits memtile aie.tile ops with col fully constrained from the segment's owned columns.
L2MemrefToMemTileMap + AllocL2BuffersPattern run mid-pipeline — attach aie.buffer ops to those memtiles.
aie.flow ops materialize LATER (after memcpy lowering).
- The placer's
placeNonCoreTileByCentroid requires aie.flow ops to exist.
To use the placer's flow-adjacency, AIR has to either:
- (a) Use unresolved
aie.logical_tile<MemTile>(?, ?) throughout the whole pipeline (every downstream tile.getCol() call needs to handle TileLike, not TileOp), OR
- (b) Run a deferred placement pass at the END of the per-device pipeline that resolves all unplaced memtiles.
Proposed multi-PR plan
PR 1: outlineAIEMemtiles emits unconstrained memtiles
- Emit
aie.logical_tile<MemTile>(?, ?) instead of (col, ?).
- Don't run the placer in
outlineAIEMemtiles; leave memtiles unresolved.
- The
__L2_tmp anchor buffer (currently used to keep the memtile alive against DCE) needs reworking — possibly attach to the LogicalTileOp directly.
PR 2: AllocL2BuffersPattern works with TileLike
aie.buffer ops attach to LogicalTileOp results (the placer's buildFlowAdjacency accepts TileLike endpoints).
- Bucket-grouping in
L2MemrefToMemTileMap stays — emits one logical memtile per bucket instead of pre-deciding which physical memtile.
PR 3: Defer placer invocation; downstream cleanups
- Add a placer pass call after memcpy lowering generates
aie.flow ops, before any tile.getCol() consumer runs on memtiles.
- Audit downstream code that calls
tile.getCol() on memtiles; make it TileLike-safe or move it after the deferred placement.
Open questions
- How does this interact with the M5 design rejection (compute tiles keep explicit
(col, row))? Compute tiles stay constrained; only memtiles become unconstrained. Should be compatible.
- Multi-herd-per-segment: existing l2_memtile_column_affinity.mlir test has a 4x1 herd with per-column channels. The placer's flow-adjacency should handle the placement correctly given consumer cores are placed at
(col, row=3) etc.
- Test for proper placer-driven placement: revert
l2_memtile_column_affinity.mlir to the pre-deletion expected output (column-affinity placement). After PR 3, the placer should produce that output naturally.
References
Context
PR landing alongside this issue (Stage C #4 first PR) deletes the column-affinity 3rd-stage swap optimization in
L2MemrefToMemTileMap(mlir/lib/Conversion/AIRToAIEPass.cpp). The deletion is correct but causes a perf regression: workloads with strong column-affinity patterns now use round-robin-only memtile placement, producing cross-column DMA routing.The proper replacement is to defer L2 memtile placement to mlir-aie's
SequentialPlacer, which is now flow-aware via Xilinx/mlir-aie#3055 (buildFlowAdjacency+placeNonCoreTileByCentroid). The placer can pick memtile columns at the centroid of consumer-core columns — equivalent to AIR's column-affinity heuristic but driven by the placer's full topology view.What's blocking the proper fix
Pipeline ordering:
outlineAIEMemtilesruns first — emits memtileaie.tileops with col fully constrained from the segment's owned columns.L2MemrefToMemTileMap+AllocL2BuffersPatternrun mid-pipeline — attachaie.bufferops to those memtiles.aie.flowops materialize LATER (after memcpy lowering).placeNonCoreTileByCentroidrequiresaie.flowops to exist.To use the placer's flow-adjacency, AIR has to either:
aie.logical_tile<MemTile>(?, ?)throughout the whole pipeline (every downstreamtile.getCol()call needs to handleTileLike, notTileOp), ORProposed multi-PR plan
PR 1: outlineAIEMemtiles emits unconstrained memtiles
aie.logical_tile<MemTile>(?, ?)instead of(col, ?).outlineAIEMemtiles; leave memtiles unresolved.__L2_tmpanchor buffer (currently used to keep the memtile alive against DCE) needs reworking — possibly attach to the LogicalTileOp directly.PR 2: AllocL2BuffersPattern works with TileLike
aie.bufferops attach toLogicalTileOpresults (the placer'sbuildFlowAdjacencyacceptsTileLikeendpoints).L2MemrefToMemTileMapstays — emits one logical memtile per bucket instead of pre-deciding which physical memtile.PR 3: Defer placer invocation; downstream cleanups
aie.flowops, before anytile.getCol()consumer runs on memtiles.tile.getCol()on memtiles; make itTileLike-safe or move it after the deferred placement.Open questions
(col, row))? Compute tiles stay constrained; only memtiles become unconstrained. Should be compatible.(col, row=3)etc.l2_memtile_column_affinity.mlirto the pre-deletion expected output (column-affinity placement). After PR 3, the placer should produce that output naturally.References