Skip to content

Commit 3aa1c92

Browse files
committed
[Path B] AIR emits unhinted LTOs; defer placement to aie-place-tiles
DEPENDS ON: mlir-aie #3068 (adds the merge-logical-tiles pass option to aie-place-tiles). Until #3068 lands and the mlir-aie pin is bumped, this commit will fail in aircc with "failed to parse pass pipeline" because aie-place-tiles won't recognize merge-logical-tiles=false. Replaces the AIR-side placement-equivalent logic that PR Xilinx#1609 had been carrying with one mlir-aie pass option: ShimDMAAllocator::allocNewDmaChannel Before: walked (col, channel) pairs starting at the herd's compute col, picked the first unused pair, and emitted a hinted aie.logical_tile<ShimNOCTile>(col, ?). This mirrored what aie-place-tiles would compute on its own — the col hint existed both to communicate placement to airrt-to-npu and to forbid the placer from merging LTOs at different cols. After: buckets memcpy ops by compute col (allocation_info_t.col) and emits an unhinted aie.logical_tile<ShimNOCTile>(?, ?) per bucket, packing up to shim_dma_channels per direction into one LTO. The placer assigns the physical col; merge-logical-tiles=false (set by aircc, see below) prevents the placer from collapsing AIR's pre-aggregated LTOs. Drops: dma_columns field, (col, channel) rotation, findLTOAtCol, same-col scoping in packet-flow reuse. AIRToAIEPass.cpp memtile emission Before: aie.logical_tile<MemTile>(col, ?) per segment col. After: aie.logical_tile<MemTile>(?, ?) per segment col. The placer assigns cols based on flow connectivity to placed cores; merge-logical-tiles=false keeps each memtile slot on its own physical memtile. allocateLockOp Before: walked all locks owned by any LTO of the same TileLike type (or same-col after the late fix in 0e9e3a8) and unioned their IDs to avoid post-collapse collisions. After: walks only locks owned by THIS tile. Since merge-logical-tiles=false guarantees distinct LTOs never collapse, each LTO's lock-ID space is independent. aircc airToAiePipeline Adds aie.device(aie-place-tiles{merge-logical-tiles=false}) after air-merge-unrolled-devices. The saved aieModule is already placed, so aiecc's runPlacementPipeline no-ops via its hasLogicalTileOps guard — place-tiles runs once total. Net diff vs prior PR HEAD: ~105 ins / 177 del in AIR (-72 LoC).
1 parent 0e9e3a8 commit 3aa1c92

4 files changed

Lines changed: 105 additions & 177 deletions

File tree

mlir/include/air/Conversion/AIRToAIESchedulingUtils.h

Lines changed: 7 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -184,32 +184,18 @@ class ShimDMAAllocator : public DMAAllocator {
184184

185185
public:
186186
// Per-shim DMA channel count (2 MM2S + 2 S2MM on all current targets).
187-
// Used by allocNewDmaChannel for round-robin channel-index assignment;
188-
// the placer's per-tile DMA channel budget then spreads logical shim
189-
// tiles across physical shim columns so channel demand per column is
190-
// honored.
187+
// Caps how many channels AIR may pack onto one shim LTO before opening
188+
// a new LTO; aie-place-tiles (with merge-ltos=false) then maps each LTO
189+
// to its own physical shim col.
191190
int shim_dma_channels;
192191

193-
// ShimNOC-capable physical cols on this device, in increasing order.
194-
// allocNewDmaChannel uses this for capacity-aware col rotation: when the
195-
// current candidate col already has its DMA channels exhausted, the next
196-
// col in the list is tried. This pre-Path-B behavior keeps AIR's col hint
197-
// in agreement with the placement aie-place-tiles will pick (the placer
198-
// respects the hint, but only insofar as channel capacity permits).
199-
std::vector<int> dma_columns;
200-
201192
ShimDMAAllocator(AIE::DeviceOp device);
202193

203194
// Allocate a new shim DMA channel. The shim tile is emitted as an
204-
// unconstrained aie.logical_tile<ShimNOCTile>(?, ?); mlir-aie's
205-
// aie-place-tiles pass picks the physical column from flow adjacency to
206-
// placed core peers and respects per-shim DMA channel capacity. The col
207-
// and row int args record the OTHER side (compute side) of the flow
208-
// for airrt metadata; they have nothing to do with the shim's eventual
209-
// physical placement. (RFC #1567: subsumes the deletion of the
210-
// `colAllocConstraint == "same_column"` heuristic, formerly attempted
211-
// standalone in #1605 — that PR couldn't compile multi-column workloads
212-
// because shim tiles were still pre-pinned via createTileViaPlacer.)
195+
// unconstrained aie.logical_tile<ShimNOCTile>(?, ?). aie-place-tiles
196+
// assigns the physical column from flow adjacency to placed core peers.
197+
// The col and row int args record the OTHER side (compute side) of the
198+
// flow for airrt metadata.
213199
FailureOr<allocation_info_t>
214200
allocNewDmaChannel(air::MemcpyInterface &memcpyOp, int col, int row,
215201
std::vector<Operation *> &dma_ops);

mlir/lib/Conversion/AIRToAIEPass.cpp

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -842,15 +842,17 @@ LogicalResult outlineAIEMemtiles(OpBuilder &builder, AIE::DeviceOp aie_device,
842842
return false;
843843
};
844844

845+
// Emit one unhinted memtile LTO per logical memtile slot the segment
846+
// needs; aie-place-tiles assigns the col. The merge-ltos=false pass
847+
// option (set by aircc) keeps each LTO on its own physical memtile.
845848
SmallVector<AIE::LogicalTileOp> logicalMemTiles;
846-
auto *ctx = builder.getContext();
847849
for (auto x = 0; x < seg_size_x; x++) {
848850
auto phys_x = x + col_offset;
849851
if (!colHasMemTile(phys_x))
850852
continue;
851-
auto colAttr = IntegerAttr::get(IntegerType::get(ctx, 32), phys_x);
852853
logicalMemTiles.push_back(AIE::LogicalTileOp::create(
853-
builder, aie_device.getLoc(), AIE::AIETileType::MemTile, colAttr,
854+
builder, aie_device.getLoc(), AIE::AIETileType::MemTile,
855+
/*col=*/IntegerAttr(),
854856
/*row=*/IntegerAttr(),
855857
/*allocation_scheme=*/StringAttr()));
856858
}

mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp

Lines changed: 84 additions & 153 deletions
Original file line numberDiff line numberDiff line change
@@ -88,40 +88,17 @@ AIE::LockOp air::allocateLockOp(AIE::DeviceOp aie_device, AIE::TileLike tile,
8888
AIE::LockOp lock = nullptr;
8989
std::set<int> ids;
9090
Operation *tileOp = tile.getOperation();
91-
bool tileIsLogical = isa<AIE::LogicalTileOp>(tileOp);
92-
// For logical tiles, multiple distinct LTOs can collapse onto the same
93-
// physical aie.tile during aie-place-tiles only when they share the same
94-
// (col, tile_type) — different cols always resolve to different physical
95-
// tiles. Reserve IDs across same-col same-type LTOs so post-collapse
96-
// assignments don't collide. Reserving across ALL same-type LTOs (across
97-
// every col) blows the per-tile lock budget in workloads like
98-
// bf16_cascade where 8 memtile LTOs each need 10 locks: union'd IDs
99-
// become 0..79, but the per-tile max is 63.
100-
AIE::AIETileType tileType = tile.getTileType();
101-
std::optional<int32_t> tileCol;
102-
if (tileIsLogical)
103-
tileCol = cast<AIE::LogicalTileOp>(tileOp).getCol();
91+
// Each (logical or physical) tile owns its own lock-ID space. The
92+
// aie-place-tiles pass is invoked with merge-ltos=false from aircc, so
93+
// distinct LTOs never collapse onto a shared physical tile — no need
94+
// to reserve IDs across other LTOs.
10495
aie_device.walk([&](AIE::LockOp l) {
105-
auto lockTileOp = l.getTile().getDefiningOp();
106-
bool ownerMatches = (lockTileOp == tileOp);
107-
if (!ownerMatches && tileIsLogical) {
108-
auto otherLT = dyn_cast_if_present<AIE::LogicalTileOp>(lockTileOp);
109-
if (otherLT && otherLT.getTileType() == tileType) {
110-
// Only reserve across LTOs that COULD share a physical tile post-
111-
// collapse: same col hint (or both unhinted, since aie-place-tiles
112-
// may put both at the same col). Differently-hinted LTOs always
113-
// resolve to different cols.
114-
auto otherCol = otherLT.getCol();
115-
if (tileCol == otherCol)
116-
ownerMatches = true;
117-
}
118-
}
119-
if (!ownerMatches)
96+
if (l.getTile().getDefiningOp() != tileOp)
12097
return;
12198
if (!l.getLockID().has_value())
12299
return;
123100
auto i = l.getLockIDValue();
124-
if (lockTileOp == tileOp && i == id)
101+
if (i == id)
125102
lock = l;
126103
ids.insert(i);
127104
});
@@ -980,10 +957,6 @@ air::TileDMAAllocator::getBuffer(uint64_t, AIE::TileOp tile,
980957
air::ShimDMAAllocator::ShimDMAAllocator(AIE::DeviceOp device)
981958
: air::DMAAllocator(device, air::MemorySpace::L3) {
982959
shim_dma_channels = 2;
983-
const auto &tm = device.getTargetModel();
984-
for (int i = 0, e = tm.columns(); i < e; i++)
985-
if (tm.isShimNOCTile(i, 0))
986-
dma_columns.push_back(i);
987960
}
988961

989962
FailureOr<air::allocation_info_t>
@@ -1023,135 +996,88 @@ air::ShimDMAAllocator::allocNewDmaChannel(air::MemcpyInterface &memcpyOp,
1023996
dma_ops_get_id.push_back(-1);
1024997
}
1025998

1026-
// For packet-flow ops, reuse an existing packet-flow allocation (in the
1027-
// same direction AND on a shim LTO whose col hint matches the compute
1028-
// col) to multiplex via packet IDs at the shim DMA level. Each new entry
1029-
// shares the same logical tile and channel; downstream shim_dma_allocation
1030-
// metadata is generated per-entry. Reusing across compute cols would
1031-
// funnel every herd's packet flows onto a single shim — the packet
1032-
// routing pipeline can't disambiguate that many IDs on one port.
1033-
if (isPacketFlowOp) {
1034-
for (auto &t : *allocs) {
1035-
bool isPacketAlloc = false;
1036-
for (auto o : t.memcpyOps) {
1037-
auto mc = dyn_cast_if_present<air::MemcpyInterface>(o);
1038-
if (!mc)
999+
// Bucket key: compute col. All flows from the same herd col share an
1000+
// unhinted shim LTO. aie-place-tiles assigns the physical col; the
1001+
// merge-ltos=false pass option (set by aircc) keeps each LTO on its
1002+
// own physical tile.
1003+
auto walkBucketLTOs = [&](auto fn) {
1004+
llvm::SmallPtrSet<Operation *, 8> seen;
1005+
for (auto *side : {&mm2s_allocs, &s2mm_allocs}) {
1006+
for (auto &t : *side) {
1007+
if (t.col != col)
10391008
continue;
1040-
auto ct = air::getChannelType(mc);
1041-
if (succeeded(ct) && ct.value() == "npu_dma_packet") {
1042-
isPacketAlloc = true;
1043-
break;
1044-
}
1045-
}
1046-
if (!isPacketAlloc)
1047-
continue;
1048-
// Restrict reuse to allocs whose tile is the LTO at this compute
1049-
// col. Without this guard, a second compute col's packet flow would
1050-
// glom onto the first col's shim alloc (because we accept any
1051-
// packet alloc), producing one shim with N packet IDs instead of
1052-
// N shims with 1 packet ID each — which the routing pass rejects
1053-
// with "false packet id match".
1054-
if (col >= 0) {
10551009
auto lt = dyn_cast<AIE::LogicalTileOp>(t.dma_tile.getOperation());
1056-
if (!lt)
1010+
if (!lt || lt.getTileType() != AIE::AIETileType::ShimNOCTile)
10571011
continue;
1058-
auto ltCol = lt.getCol();
1059-
if (!ltCol || (int)*ltCol != col)
1012+
if (!seen.insert(lt.getOperation()).second)
10601013
continue;
1014+
if (fn(lt))
1015+
return;
10611016
}
1062-
AIE::DMAChannel aie_chan = {dir, t.dma_channel.channel};
1063-
allocs->push_back({t.dma_tile,
1017+
}
1018+
};
1019+
1020+
auto channelsUsedOn = [&](AIE::LogicalTileOp lt) {
1021+
std::set<int> used;
1022+
for (auto *side : {&mm2s_allocs, &s2mm_allocs})
1023+
for (auto &t : *side)
1024+
if (t.dma_tile.getOperation() == lt.getOperation() &&
1025+
t.dma_channel.direction == dir)
1026+
used.insert((int)t.dma_channel.channel);
1027+
return used;
1028+
};
1029+
1030+
// For packet flows: reuse the bucket's existing packet channel if any.
1031+
if (isPacketFlowOp) {
1032+
AIE::LogicalTileOp packetLT = nullptr;
1033+
int packetCh = -1;
1034+
walkBucketLTOs([&](AIE::LogicalTileOp lt) {
1035+
for (auto *side : {&mm2s_allocs, &s2mm_allocs}) {
1036+
for (auto &t : *side) {
1037+
if (t.dma_tile.getOperation() != lt.getOperation())
1038+
continue;
1039+
if (t.dma_channel.direction != dir)
1040+
continue;
1041+
for (auto o : t.memcpyOps) {
1042+
auto mc = dyn_cast_if_present<air::MemcpyInterface>(o);
1043+
if (!mc)
1044+
continue;
1045+
auto ct = air::getChannelType(mc);
1046+
if (succeeded(ct) && ct.value() == "npu_dma_packet") {
1047+
packetLT = lt;
1048+
packetCh = (int)t.dma_channel.channel;
1049+
return true;
1050+
}
1051+
}
1052+
}
1053+
}
1054+
return false;
1055+
});
1056+
if (packetLT) {
1057+
AIE::DMAChannel aie_chan = {dir, packetCh};
1058+
allocs->push_back({packetLT,
10641059
col,
10651060
row,
10661061
aie_chan,
1067-
t.dma_channel.channel,
1062+
packetCh,
10681063
/*packet_flow_id=*/-1,
10691064
dma_ops_get_id,
10701065
{memcpyOp.getOperation()}});
10711066
return allocs->back();
10721067
}
10731068
}
10741069

1075-
// Capacity-aware (col, channel) selection — restored to the pre-Path-B
1076-
// semantics. The original allocNewDmaChannel walked
1077-
// (compute_col, ch=0) -> (compute_col, ch=1) -> (next_col, ch=0) -> ...
1078-
// and stopped at the first unused (col, channel) pair. With Path B the
1079-
// tile is now an aie.logical_tile<ShimNOCTile>(col, ?) (the placer picks
1080-
// the row), but the col hint must match what the placer will satisfy:
1081-
// otherwise downstream airrt-to-npu reads a hint that disagrees with the
1082-
// placer's eventual physical col, and NPU instructions target the wrong
1083-
// shim. We mirror the original loop so each LTO's col hint is the col
1084-
// a capacity-aware placer would pick on its own.
1085-
AIE::TileLike tileLT = nullptr;
1086-
int dma_channel = -1;
1087-
1088-
auto isUsedAtColCh = [&](int candidateCol, int ch) -> bool {
1089-
for (auto *side : {&mm2s_allocs, &s2mm_allocs}) {
1090-
for (auto &t : *side) {
1091-
if (t.dma_channel.direction != dir)
1092-
continue;
1093-
if ((int)t.dma_channel.channel != ch)
1094-
continue;
1095-
auto cand = dyn_cast<AIE::LogicalTileOp>(t.dma_tile.getOperation());
1096-
if (!cand)
1097-
continue;
1098-
if (cand.getTileType() != AIE::AIETileType::ShimNOCTile)
1099-
continue;
1100-
auto candCol = cand.getCol();
1101-
if (candCol && (int)*candCol == candidateCol)
1102-
return true;
1103-
}
1070+
// Find a bucket LTO with a free channel in this direction; else open
1071+
// a new unhinted shim LTO.
1072+
AIE::LogicalTileOp tileLT = nullptr;
1073+
walkBucketLTOs([&](AIE::LogicalTileOp lt) {
1074+
if ((int)channelsUsedOn(lt).size() < shim_dma_channels) {
1075+
tileLT = lt;
1076+
return true;
11041077
}
11051078
return false;
1106-
};
1107-
auto findLTOAtCol = [&](int candidateCol) -> AIE::LogicalTileOp {
1108-
for (auto *side : {&mm2s_allocs, &s2mm_allocs}) {
1109-
for (auto &t : *side) {
1110-
auto cand = dyn_cast<AIE::LogicalTileOp>(t.dma_tile.getOperation());
1111-
if (!cand)
1112-
continue;
1113-
if (cand.getTileType() != AIE::AIETileType::ShimNOCTile)
1114-
continue;
1115-
auto candCol = cand.getCol();
1116-
if (candCol && (int)*candCol == candidateCol)
1117-
return cand;
1118-
}
1119-
}
1120-
return nullptr;
1121-
};
1122-
1123-
// Find the first (col, channel) pair not yet used. Start at compute col
1124-
// (so shim sits near its core) and rotate through ShimNOC cols.
1125-
int chosenCol = -1;
1126-
int chosenCh = -1;
1127-
if (!dma_columns.empty()) {
1128-
int startIdx = 0;
1129-
if (col >= 0) {
1130-
auto it = std::find(dma_columns.begin(), dma_columns.end(), col);
1131-
if (it != dma_columns.end())
1132-
startIdx = it - dma_columns.begin();
1133-
}
1134-
for (int hops = 0; hops < (int)dma_columns.size() && chosenCol < 0;
1135-
hops++) {
1136-
int c = dma_columns[(startIdx + hops) % dma_columns.size()];
1137-
for (int ch = 0; ch < shim_dma_channels; ch++) {
1138-
if (!isUsedAtColCh(c, ch)) {
1139-
chosenCol = c;
1140-
chosenCh = ch;
1141-
break;
1142-
}
1143-
}
1144-
}
1145-
}
1146-
if (chosenCol < 0)
1147-
return memcpyOp.emitOpError("out of shim DMA channels");
1148-
1149-
// Reuse the existing LTO at chosenCol if one is there; otherwise create
1150-
// a new LTO. Reusing keeps the per-physical-shim aie.shim_dma op
1151-
// aggregated (one shim_dma per tile rather than several).
1152-
if (auto existing = findLTOAtCol(chosenCol)) {
1153-
tileLT = existing;
1154-
} else {
1079+
});
1080+
if (!tileLT) {
11551081
OpBuilder b(device);
11561082
b.setInsertionPointToStart(device.getBody());
11571083
for (auto &op : device.getBody()->getOperations()) {
@@ -1160,19 +1086,24 @@ air::ShimDMAAllocator::allocNewDmaChannel(air::MemcpyInterface &memcpyOp,
11601086
else
11611087
break;
11621088
}
1163-
auto *ctx = b.getContext();
1164-
IntegerAttr colAttr =
1165-
IntegerAttr::get(IntegerType::get(ctx, 32), chosenCol);
11661089
tileLT = AIE::LogicalTileOp::create(b, device.getLoc(),
1167-
AIE::AIETileType::ShimNOCTile, colAttr,
1090+
AIE::AIETileType::ShimNOCTile,
1091+
/*col=*/IntegerAttr(),
11681092
/*row=*/IntegerAttr(),
11691093
/*allocation_scheme=*/StringAttr());
11701094
}
1171-
dma_channel = chosenCh;
11721095

1173-
// The col/row int args here record the other side (compute side) of the
1174-
// flow for airrt metadata; they have nothing to do with the shim's
1175-
// eventual physical placement.
1096+
auto usedChans = channelsUsedOn(tileLT);
1097+
int dma_channel = -1;
1098+
for (int ch = 0; ch < shim_dma_channels; ch++) {
1099+
if (!usedChans.count(ch)) {
1100+
dma_channel = ch;
1101+
break;
1102+
}
1103+
}
1104+
if (dma_channel < 0)
1105+
return memcpyOp.emitOpError("out of shim DMA channels");
1106+
11761107
return air::DMAAllocator::allocNewDmaChannel(memcpyOp, tileLT, dma_channel,
11771108
col, row, dma_ops_get_id);
11781109
}

tools/aircc/aircc.cpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1083,6 +1083,15 @@ static LogicalResult runAieCompilation() {
10831083
os << " stack-size=" << stackSize.getValue();
10841084
os << "}";
10851085
os << ",air-merge-unrolled-devices";
1086+
#if AIR_ENABLE_AIE
1087+
// AIR emits unhinted shim/memtile aie.logical_tile ops. Run
1088+
// aie-place-tiles here so the saved aieModule already has physical
1089+
// aie.tile ops; aiecc's runPlacementPipeline will see no logical
1090+
// tiles and no-op via its hasLogicalTileOps guard.
1091+
// merge-logical-tiles=false keeps the placer from collapsing AIR's
1092+
// pre-aggregated logical tiles onto shared physical tiles.
1093+
os << ",aie.device(aie-place-tiles{merge-logical-tiles=false})";
1094+
#endif
10861095
os << ")";
10871096
}
10881097

0 commit comments

Comments
 (0)