Skip to content

[RFC #1567] Stage C #4 follow-up: defer L2 memtile placement to placer (path B) #1602

@erwei-xilinx

Description

@erwei-xilinx

Context

PR landing alongside this issue (Stage C #4 first PR) deletes the column-affinity 3rd-stage swap optimization in L2MemrefToMemTileMap (mlir/lib/Conversion/AIRToAIEPass.cpp). The deletion is correct but causes a perf regression: workloads with strong column-affinity patterns now use round-robin-only memtile placement, producing cross-column DMA routing.

The proper replacement is to defer L2 memtile placement to mlir-aie's SequentialPlacer, which is now flow-aware via Xilinx/mlir-aie#3055 (buildFlowAdjacency + placeNonCoreTileByCentroid). The placer can pick memtile columns at the centroid of consumer-core columns — equivalent to AIR's column-affinity heuristic but driven by the placer's full topology view.

What's blocking the proper fix

Pipeline ordering:

  1. outlineAIEMemtiles runs first — emits memtile aie.tile ops with col fully constrained from the segment's owned columns.
  2. L2MemrefToMemTileMap + AllocL2BuffersPattern run mid-pipeline — attach aie.buffer ops to those memtiles.
  3. aie.flow ops materialize LATER (after memcpy lowering).
  4. The placer's placeNonCoreTileByCentroid requires aie.flow ops to exist.

To use the placer's flow-adjacency, AIR has to either:

  • (a) Use unresolved aie.logical_tile<MemTile>(?, ?) throughout the whole pipeline (every downstream tile.getCol() call needs to handle TileLike, not TileOp), OR
  • (b) Run a deferred placement pass at the END of the per-device pipeline that resolves all unplaced memtiles.

Proposed multi-PR plan

PR 1: outlineAIEMemtiles emits unconstrained memtiles

  • Emit aie.logical_tile<MemTile>(?, ?) instead of (col, ?).
  • Don't run the placer in outlineAIEMemtiles; leave memtiles unresolved.
  • The __L2_tmp anchor buffer (currently used to keep the memtile alive against DCE) needs reworking — possibly attach to the LogicalTileOp directly.

PR 2: AllocL2BuffersPattern works with TileLike

  • aie.buffer ops attach to LogicalTileOp results (the placer's buildFlowAdjacency accepts TileLike endpoints).
  • Bucket-grouping in L2MemrefToMemTileMap stays — emits one logical memtile per bucket instead of pre-deciding which physical memtile.

PR 3: Defer placer invocation; downstream cleanups

  • Add a placer pass call after memcpy lowering generates aie.flow ops, before any tile.getCol() consumer runs on memtiles.
  • Audit downstream code that calls tile.getCol() on memtiles; make it TileLike-safe or move it after the deferred placement.

Open questions

  • How does this interact with the M5 design rejection (compute tiles keep explicit (col, row))? Compute tiles stay constrained; only memtiles become unconstrained. Should be compatible.
  • Multi-herd-per-segment: existing l2_memtile_column_affinity.mlir test has a 4x1 herd with per-column channels. The placer's flow-adjacency should handle the placement correctly given consumer cores are placed at (col, row=3) etc.
  • Test for proper placer-driven placement: revert l2_memtile_column_affinity.mlir to the pre-deletion expected output (column-affinity placement). After PR 3, the placer should produce that output naturally.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions