[RFC #1567] Stage C #4 follow-up: defer L2 memtile placement to placer (path B)

## Context

PR landing alongside this issue (Stage C #4 first PR) deletes the column-affinity 3rd-stage swap optimization in `L2MemrefToMemTileMap` (`mlir/lib/Conversion/AIRToAIEPass.cpp`). The deletion is correct but causes a perf regression: workloads with strong column-affinity patterns now use round-robin-only memtile placement, producing cross-column DMA routing.

The proper replacement is to defer L2 memtile placement to mlir-aie's `SequentialPlacer`, which is now flow-aware via Xilinx/mlir-aie#3055 (`buildFlowAdjacency` + `placeNonCoreTileByCentroid`). The placer can pick memtile columns at the centroid of consumer-core columns — equivalent to AIR's column-affinity heuristic but driven by the placer's full topology view.

## What's blocking the proper fix

Pipeline ordering:

1. `outlineAIEMemtiles` runs first — emits memtile `aie.tile` ops with col fully constrained from the segment's owned columns.
2. `L2MemrefToMemTileMap` + `AllocL2BuffersPattern` run mid-pipeline — attach `aie.buffer` ops to those memtiles.
3. `aie.flow` ops materialize LATER (after memcpy lowering).
4. The placer's `placeNonCoreTileByCentroid` requires `aie.flow` ops to exist.

To use the placer's flow-adjacency, AIR has to either:
- (a) Use unresolved `aie.logical_tile<MemTile>(?, ?)` throughout the whole pipeline (every downstream `tile.getCol()` call needs to handle `TileLike`, not `TileOp`), OR
- (b) Run a deferred placement pass at the END of the per-device pipeline that resolves all unplaced memtiles.

## Proposed multi-PR plan

**PR 1: outlineAIEMemtiles emits unconstrained memtiles**
- Emit `aie.logical_tile<MemTile>(?, ?)` instead of `(col, ?)`.
- Don't run the placer in `outlineAIEMemtiles`; leave memtiles unresolved.
- The `__L2_tmp` anchor buffer (currently used to keep the memtile alive against DCE) needs reworking — possibly attach to the LogicalTileOp directly.

**PR 2: AllocL2BuffersPattern works with TileLike**
- `aie.buffer` ops attach to `LogicalTileOp` results (the placer's `buildFlowAdjacency` accepts `TileLike` endpoints).
- Bucket-grouping in `L2MemrefToMemTileMap` stays — emits one logical memtile per bucket instead of pre-deciding which physical memtile.

**PR 3: Defer placer invocation; downstream cleanups**
- Add a placer pass call after memcpy lowering generates `aie.flow` ops, before any `tile.getCol()` consumer runs on memtiles.
- Audit downstream code that calls `tile.getCol()` on memtiles; make it `TileLike`-safe or move it after the deferred placement.

## Open questions

- How does this interact with the M5 design rejection (compute tiles keep explicit `(col, row)`)? Compute tiles stay constrained; only memtiles become unconstrained. Should be compatible.
- Multi-herd-per-segment: existing l2_memtile_column_affinity.mlir test has a 4x1 herd with per-column channels. The placer's flow-adjacency should handle the placement correctly given consumer cores are placed at `(col, row=3)` etc.
- Test for proper placer-driven placement: revert `l2_memtile_column_affinity.mlir` to the pre-deletion expected output (column-affinity placement). After PR 3, the placer should produce that output naturally.

## References

- RFC: #1567 (Stage C #4)
- Stage C #4 deletion PR: (this PR — link once opened)
- mlir-aie flow-adjacency: Xilinx/mlir-aie#3055 (merged)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC #1567] Stage C #4 follow-up: defer L2 memtile placement to placer (path B) #1602

Context

What's blocking the proper fix

Proposed multi-PR plan

Open questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC #1567] Stage C #4 follow-up: defer L2 memtile placement to placer (path B) #1602

Description

Context

What's blocking the proper fix

Proposed multi-PR plan

Open questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions