From b6c08d925354f2e92758f197be28f3c827cdbc58 Mon Sep 17 00:00:00 2001 From: Erwei Wang Date: Sun, 3 May 2026 17:10:16 +0000 Subject: [PATCH 1/6] [multi-gpu] Phase 1: namespace channel_type, add cross-rank attrs, doc plan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Step toward multi-GPU messaging support per docs/MultiGPUPlan.md. Pure IR/dialect changes — no lowering yet. ## channel_type namespace rename (Option 1) Existing channel_type values gain a `npu_` prefix to make backend scope explicit: - `dma_stream` → `npu_dma_stream` (default) - `dma_packet` → `npu_dma_packet` - `cascade` → `npu_cascade` - `mmio` → `npu_mmio` Mechanical rename across 33 files (verifier, transform/conversion passes, all .mlir tests, Python programming examples). ## New channel_type for GPU multi-rank messaging - `gpu_symmetric_heap`: cross-rank channel through the symmetric heap runtime (runtime_lib/airgpu/symmetric_heap.{h,cpp}). Verifier requires put/get sites to be inside an `air.rank` scope. ## air.dma_memcpy_nd cross-rank addressing - New optional integer attributes `src_rank` / `dst_rank` name a peer rank in the enclosing `air.rank` scope. - Verifier requires: - an enclosing `air.rank` scope - the peer-side memref's `memref.alloc` (when directly available) to carry the `air.symmetric` attribute - Backward-compatible builder so existing call sites compile unchanged. ## air.symmetric memref attribute A unit attribute on `memref.alloc` indicating the allocation is backed by the symmetric heap. Documented in docs/AIRComputeModel.md §2.7. ## Documentation - New docs/MultiGPUPlan.md: full design and 7-phase implementation plan - docs/AIRComputeModel.md: §2.4 cross-rank addressing, §2.7 air.symmetric, §2.5 channel_type table updated, §5 summary table updated ## Tests - mlir/test/Dialect/AIR/air_cross_rank_dma.mlir (new): positive round-trip for src_rank/dst_rank, air.symmetric memref, gpu_symmetric_heap channel put/get inside air.rank - mlir/test/Dialect/AIR/air_channel_invalid.mlir: gpu_symmetric_heap put/get outside air.rank rejected; updated unsupported channel_type error message - mlir/test/Dialect/AIR/air_memcpy_invalid.mlir: src_rank/dst_rank outside air.rank rejected; missing air.symmetric on alloc rejected All 21 mlir/test/Dialect/AIR/ tests pass; GPU dma_copy and 4k_4k_mul e2e tests pass on MI300A. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/AIRComputeModel.md | 65 +++++++++++-- mlir/include/air/Dialect/AIR/AIR.td | 93 ++++++++++++++----- mlir/lib/Conversion/AIRToAIEPass.cpp | 46 ++++----- .../Conversion/AIRToAIESchedulingUtils.cpp | 50 +++++----- mlir/lib/Conversion/ConvertToAIRPass.cpp | 6 +- mlir/lib/Dialect/AIR/IR/AIRDialect.cpp | 91 ++++++++++++++---- mlir/lib/Transform/AIRDmaToChannel.cpp | 12 +-- mlir/lib/Transform/AIRHerdPlacementPass.cpp | 2 +- mlir/lib/Transform/AIRLinalgCodegen.cpp | 2 +- mlir/lib/Transform/AIRMiscPasses.cpp | 2 +- mlir/lib/Util/Util.cpp | 2 +- .../Conversion/AIRToAIE/air_channel_mmio.mlir | 18 ++-- .../AIRToAIE/air_channel_mmio_invalid.mlir | 26 +++--- .../air_channel_to_locks_core_to_core.mlir | 12 +-- .../AIRToAIE/air_shimcpy_to_npu.mlir | 18 ++-- .../bad_shim_packet_flow_npu_1col.mlir | 2 +- .../good_shim_packet_flow_npu_4col.mlir | 2 +- .../segment_unroll_packet_flow_ids.mlir | 4 +- .../shared_shim_channel_packet_ids.mlir | 4 +- .../AIRToAIE/shim_packet_flow_npu.mlir | 4 +- .../AIRToAIE/shim_pkt_channel_sharing.mlir | 6 +- .../ConvertToAIR/scf_parallel_to_herd.mlir | 2 +- mlir/test/Dialect/AIR/air_canonicalize.mlir | 2 +- mlir/test/Dialect/AIR/air_channel.mlir | 25 +++-- .../test/Dialect/AIR/air_channel_invalid.mlir | 38 ++++++-- mlir/test/Dialect/AIR/air_cross_rank_dma.mlir | 75 +++++++++++++++ mlir/test/Dialect/AIR/air_memcpy_invalid.mlir | 50 ++++++++++ .../fuse_channels.mlir | 6 +- .../dma_to_channel_auto_packet.mlir | 4 +- .../dma_to_channel_auto_packet_broadcast.mlir | 32 +++---- ...ma_to_channel_auto_packet_single_herd.mlir | 4 +- .../dma_to_channel_no_auto_packet.mlir | 2 +- .../AIRHerdPlacement/cascade_placement.mlir | 4 +- .../AIRMiscPasses/air_collapse_herd.mlir | 6 +- .../AIRMiscPasses/air_split_l2_memref.mlir | 12 +-- .../cascade_reduction/cascade_reduction.py | 2 +- .../channel_3d_segment_unroll.py | 2 +- .../dual_herd_packet_switch.py | 2 +- .../channel_examples/mmio/mmio.py | 6 +- .../flash_attention/dataflow_based/attn.py | 2 +- .../kernel_fusion_based/attn_npu1.py | 6 +- .../kernel_fusion_based/attn_npu2.py | 6 +- programming_examples/herd_dataflow/air.mlir | 4 +- programming_examples/herd_dataflow/run.py | 6 +- .../bf16_cascade/matvec_cascade.py | 2 +- 45 files changed, 536 insertions(+), 231 deletions(-) create mode 100644 mlir/test/Dialect/AIR/air_cross_rank_dma.mlir diff --git a/docs/AIRComputeModel.md b/docs/AIRComputeModel.md index 0c45c3ff4..a9688776e 100644 --- a/docs/AIRComputeModel.md +++ b/docs/AIRComputeModel.md @@ -621,7 +621,7 @@ dimensions depend on the target backend: The compiler may **reshape** the iteration space (e.g., collapse a 2D herd into a 1D arrangement) via the `AIRCollapseHerdPass`. Reshaping is inhibited automatically when the herd body uses cascade channels (`channel_type = - "cascade"`), because cascade connections are topology-dependent and cannot + "npu_cascade"`), because cascade connections are topology-dependent and cannot survive reindexing. Explicit placement attributes (`x_loc`, `y_loc`, `x_size`, `y_size`) on the enclosing segment also constrain the legal shapes by fixing the tile footprint. The pass accepts a `max-col-size` option to @@ -670,13 +670,29 @@ address spaces of the operand memrefs and mapped to the appropriate hardware mec An empty `[offsets]`, `[sizes]`, or `[strides]` list for a side means the entire memref is addressed with unit strides. +#### Cross-rank addressing (multi-GPU) + +Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the +enclosing `air.rank` scope. When present, the corresponding memref is +interpreted as living on rank R's symmetric heap rather than on the local +process. The verifier requires the op to be enclosed by an `air.rank` and the +referenced memref to be `air.symmetric`-tagged (see §2.7). The GPU backend +(`air-to-rocdl`) lowers cross-rank DMAs through `mgpuGetHeapBases()`-based +peer addressing; the NPU backend does not support these attributes. + +``` +// Read 1024 floats from rank 0's symmetric buffer into local L1. +air.dma_memcpy_nd (%local[][][], %sym[][][]) {src_rank = 0 : i64} + : (memref<1024xf32, 2>, memref<1024xf32, 0>) +``` + --- ### 2.5 `air.channel`, `air.channel.put`, `air.channel.get` ``` // Channel declaration — at module scope -air.channel @name [dim₀, dim₁, …] {channel_type = "dma_stream", depth = } +air.channel @name [dim₀, dim₁, …] {channel_type = "npu_dma_stream", depth = } // Synchronous put/get — block until the transfer completes air.channel.put @name[indices] (src[offsets][sizes][strides]) : (type_src) @@ -696,13 +712,17 @@ them independently and to introduce double-buffering. A channel may be an array (e.g., `[4, 4]` for a 4×4 array). The `indices` operand on `put`/`get` selects the specific channel within the array. -The `channel_type` attribute controls the underlying mechanism: +The `channel_type` attribute controls the underlying mechanism. Values are +namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels +use the `gpu_` prefix. | Value | Mechanism | |-------|-----------| -| `"dma_stream"` (default) | DMA engines with streaming (circuit-switched) interconnect | -| `"dma_packet"` | DMA engines with packet-switched interconnect | -| `"cascade"` | Core-to-core cascade connections between adjacent tiles | +| `"npu_dma_stream"` (default) | NPU: DMA engines with streaming (circuit-switched) interconnect | +| `"npu_dma_packet"` | NPU: DMA engines with packet-switched interconnect | +| `"npu_cascade"` | NPU: Core-to-core cascade connections between adjacent tiles | +| `"npu_mmio"` | NPU: Host-side MMIO blockwrites delivering a constant payload into a tile-local L1 buffer | +| `"gpu_symmetric_heap"` | GPU: Cross-rank messaging through the symmetric heap runtime (XGMI peer-mapped VMem). Requires an enclosing `air.rank` scope. | The `broadcast_shape` attribute enables one-to-many communication following NumPy broadcasting rules. @@ -796,6 +816,27 @@ in the async dependency graph. --- +### 2.7 `air.symmetric` memref attribute (multi-GPU) + +A `memref.alloc` op may carry the unit attribute `air.symmetric` to indicate +that the allocation should be backed by the **symmetric heap** runtime. Every +rank in the enclosing `air.rank` scope performs the same allocation in lockstep, +so each rank has a memref of the same size at the same offset within the heap. +Cross-rank addressing (via `air.dma_memcpy_nd` `src_rank`/`dst_rank` attributes +or `air.channel` with `channel_type = "gpu_symmetric_heap"`) refers to peer +ranks' symmetric memrefs at the same logical offset. + +``` +%buf = memref.alloc() {air.symmetric} : memref<1024xf32> +``` + +The GPU lowering routes such allocations through `mgpuSymmetricAlloc` +(`runtime_lib/airgpu/gpu_runtime.cpp`) instead of plain `mgpuMemAlloc`. +Peer ranks' base pointers are obtained via `mgpuGetHeapBases()`. The NPU +backend does not interpret this attribute. + +--- + ## 3. NPU (AIE) Backend Mapping On AMD Versal AI Engine (AIE) and Ryzen AI NPU targets the three-level hierarchy maps @@ -999,7 +1040,13 @@ See [buildingGPU.md](buildingGPU.md) for build instructions and the complete | L1 (space 2) | 32 KB tile-local data memory | Thread-private VGPRs / scratch | | L2 (space 1) | Memory tiles / URAMs | LDS (shared memory, ~64 KB / CU) | | L3 (space 0) | DDR via NOC | HBM via global memory | -| `dma_memcpy_nd` | AIE Shim/Tile DMA engines | SCF load/store loops | -| `channel` (`dma_stream`) | Streaming AXI-S switch | — (not yet mapped to GPU) | -| Synchronization | AIE locks | `gpu.barrier` | +| `dma_memcpy_nd` (intra-rank) | AIE Shim/Tile DMA engines | SCF load/store loops | +| `dma_memcpy_nd` (cross-rank, `src_rank`/`dst_rank`) | — | Symmetric heap peer addressing (planned) | +| `channel` (`npu_dma_stream`) | Streaming AXI-S switch | n/a | +| `channel` (`npu_dma_packet`) | Packet-switched AXI-S overlay | n/a | +| `channel` (`npu_cascade`) | Core cascade interface | n/a | +| `channel` (`npu_mmio`) | Host MMIO blockwrite | n/a | +| `channel` (`gpu_symmetric_heap`) | n/a | Symmetric heap peer addressing (planned) | +| `air.symmetric` memref alloc | n/a | `mgpuSymmetricAlloc` (planned) | +| Synchronization | AIE locks | `gpu.barrier` (intra-rank), `mgpuBarrier` (cross-rank) | | `!air.token` (dependency) | AIE runtime completion signals | GPU stream/event dependencies | diff --git a/mlir/include/air/Dialect/AIR/AIR.td b/mlir/include/air/Dialect/AIR/AIR.td index 4832355c4..475ab8bf7 100644 --- a/mlir/include/air/Dialect/AIR/AIR.td +++ b/mlir/include/air/Dialect/AIR/AIR.td @@ -477,7 +477,9 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd", Variadic:$src_sizes, Variadic:$src_strides, OptionalAttr:$pad_before, - OptionalAttr:$pad_after + OptionalAttr:$pad_after, + OptionalAttr:$src_rank, + OptionalAttr:$dst_rank ); let results = (outs Optional:$async_token); let assemblyFormat = [{ @@ -487,7 +489,14 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd", `(` type($dst) `,` type($src) `)` }]; let description = [{ - dma operator + N-dimensional strided bulk copy between two memrefs. + + Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the + enclosing `air.rank` scope. When present, the corresponding memref is + interpreted as living on rank R's symmetric heap rather than on the local + process. These attributes are only valid for `air.symmetric`-tagged memref + allocations and require an enclosing `air.rank`. They are currently only + supported by the GPU lowering (`air-to-rocdl`). }]; let extraClassDeclaration = [{ Value getSrcMemref() { return getSrc(); } @@ -501,7 +510,31 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd", bool hasPadding() { return getPadBefore().has_value(); } + bool hasCrossRank() { + return getSrcRank().has_value() || getDstRank().has_value(); + } }]; + let builders = [ + // Backward-compatible builder: defaults src_rank/dst_rank to absent. + OpBuilder<(ins "::mlir::TypeRange":$resultTypes, + "::mlir::ValueRange":$async_dependencies, + "::mlir::Value":$dst, + "::mlir::ValueRange":$dst_offsets, + "::mlir::ValueRange":$dst_sizes, + "::mlir::ValueRange":$dst_strides, + "::mlir::Value":$src, + "::mlir::ValueRange":$src_offsets, + "::mlir::ValueRange":$src_sizes, + "::mlir::ValueRange":$src_strides, + "::mlir::DenseI32ArrayAttr":$pad_before, + "::mlir::DenseI32ArrayAttr":$pad_after), [{ + build($_builder, $_state, resultTypes, async_dependencies, dst, + dst_offsets, dst_sizes, dst_strides, src, + src_offsets, src_sizes, src_strides, pad_before, pad_after, + /*src_rank=*/IntegerAttr(), + /*dst_rank=*/IntegerAttr()); + }]> + ]; let hasCanonicalizer = 1; let hasVerifier = 1; } @@ -535,7 +568,7 @@ def air_WaitAllOp: air_Op<"wait_all", [air_AsyncOpInterface]> { def air_ChannelOp : air_Op<"channel", [Symbol]>, Arguments<(ins SymbolNameAttr:$sym_name, DefaultValuedAttr:$size, - DefaultValuedAttr:$channel_type)> { + DefaultValuedAttr:$channel_type)> { let assemblyFormat = [{ $sym_name $size attr-dict }]; @@ -543,18 +576,22 @@ def air_ChannelOp : air_Op<"channel", [Symbol]>, let description = [{ Operation to represent a communication channel as a point-to-point connection between two memrefs. The array following the channel name symbol represents the channel's dimensional sizes. Default - size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled + size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled by the `channel_type` attribute. ### Channel Types - The `channel_type` attribute is a string that determines the mechanism used for data movement: - - **"dma_stream"** (default): + The `channel_type` attribute is a string that determines the mechanism used for data movement. + Values are namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels use + the `gpu_` prefix. + + NPU (AIE) channel types: + - **"npu_dma_stream"** (default): Use DMA engines to send and receive data, with routing performed over a streaming interconnect. - - **"dma_packet"**: + - **"npu_dma_packet"**: Use DMA engines to send and receive data, with routing performed over a packet-switched network. - - **"cascade"**: + - **"npu_cascade"**: Use processor cores to send and receive data via cascade connections between adjacent tiles. - - **"mmio"**: + - **"npu_mmio"**: Use host-side MMIO writes (e.g. `aiex.npu.blockwrite`) issued from the runtime sequence to deliver a constant payload directly into a tile-local L1 buffer. No DMA channel, no shim allocation, no flow is reserved. @@ -565,32 +602,44 @@ def air_ChannelOp : air_Op<"channel", [Symbol]>, `memref.get_global`. The consumer-side `get` lowers to a no-op because the L1 buffer is already populated when the core begins executing. + GPU channel types: + - **"gpu_symmetric_heap"**: + Cross-GPU messaging through the symmetric heap runtime + (`runtime_lib/airgpu/symmetric_heap.{h,cpp}`). The channel must be enclosed + by an `air.rank` op; the put/get sites use rank indices to address peer + heaps. Lowered by `air-to-rocdl` to thread-cooperative loops over peer-mapped + VMem buffers, with synchronization via in-heap notify flags or `mgpuBarrier`. + ### Broadcasting - If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute + If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute annotates the output sizes after broadcasting. Broadcasting follows NumPy's broadcasting rules. Example: ```mlir - // An array of 4 x 4 streaming DMA channels - air.channel @channel_0 [4, 4] {channel_type = "dma_stream"} + // An array of 4 x 4 streaming DMA channels (NPU) + air.channel @channel_0 [4, 4] {channel_type = "npu_dma_stream"} - // A streaming DMA channel broadcasting to 4 destinations - air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_stream"} + // A streaming DMA channel broadcasting to 4 destinations (NPU) + air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_stream"} - // An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations. + // An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations (NPU). // Broadcasting follows NumPy's rules. - air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "dma_stream"} + air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "npu_dma_stream"} - // A packet-switched DMA channel - air.channel @channel_3 [] {channel_type = "dma_packet"} + // A packet-switched DMA channel (NPU) + air.channel @channel_3 [] {channel_type = "npu_dma_packet"} - // A cascade channel using core-to-core cascade connections - air.channel @channel_4 [] {channel_type = "cascade"} + // A cascade channel using core-to-core cascade connections (NPU) + air.channel @channel_4 [] {channel_type = "npu_cascade"} // An MMIO channel: the put writes a constant from host into L1 of each - // get's destination tile via runtime-sequence blockwrites - air.channel @channel_5 [] {channel_type = "mmio"} + // get's destination tile via runtime-sequence blockwrites (NPU) + air.channel @channel_5 [] {channel_type = "npu_mmio"} + + // A cross-GPU channel through the symmetric heap (GPU). Must appear inside + // an air.rank scope; the indices on put/get encode the peer rank. + air.channel @channel_6 [] {channel_type = "gpu_symmetric_heap"} ``` }]; let extraClassDeclaration = [{ diff --git a/mlir/lib/Conversion/AIRToAIEPass.cpp b/mlir/lib/Conversion/AIRToAIEPass.cpp index 71842b7f2..6591e63f6 100644 --- a/mlir/lib/Conversion/AIRToAIEPass.cpp +++ b/mlir/lib/Conversion/AIRToAIEPass.cpp @@ -2542,7 +2542,7 @@ struct SpecializeChannelBundlePattern // host-side puts (they sit outside the device, where this pattern's // rewrites don't reach), leaving them to fail later as // "no matching device-side air.channel.get". - if (channel.getChannelType() == "mmio") + if (channel.getChannelType() == "npu_mmio") return failure(); std::vector channelPuts = @@ -4017,7 +4017,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { bool isShimFlow = f.MM2S_alloc.getDmaTile().isShimNOCorPLTile() || f.S2MM_alloc[i].getDmaTile().isShimNOCorPLTile(); - if (f.memcpyResourceType == "dma_packet") { + if (f.memcpyResourceType == "npu_dma_packet") { // Use appropriate flow map based on whether flow involves shim tiles if (isShimFlow) { // Device-host flows use global shim flow ID @@ -4059,12 +4059,12 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { // assignment. intraDeviceFlowID = std::max(intraDeviceFlowID, flowID); } - } else if (f.memcpyResourceType == "dma_stream") + } else if (f.memcpyResourceType == "npu_dma_stream") getFlowOp(aie_device, f.MM2S_alloc.getDmaTile(), AIE::WireBundle::DMA, (uint32_t)f.MM2S_alloc.dma_channel.channel, f.S2MM_alloc[i].getDmaTile(), AIE::WireBundle::DMA, (uint32_t)f.S2MM_alloc[i].dma_channel.channel); - else if (f.memcpyResourceType == "cascade") { + else if (f.memcpyResourceType == "npu_cascade") { getCascadeFlowOp(aie_device, f.MM2S_alloc.getDmaTile(), AIE::WireBundle::DMA, (uint32_t)f.MM2S_alloc.dma_channel.channel, @@ -5408,19 +5408,19 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { return aieDmaBdOp; } - // Converts an air.channel.put/get operation with channel_type = "cascade" + // Converts an air.channel.put/get operation with channel_type = "npu_cascade" // into aie.get/put_cascade + vector.transfer_read/write sequence. // The conversion flattens the entire memref into a 1-D vector to match // the cascade data format expected by the AIE put/get_cascade ops. LogicalResult ConvertCascadeChannelIfToAIE(RewriterBase &rewriter, air::ChannelInterface op) { - // Match only if the associated channel has channel_type = "cascade". + // Match only if the associated channel has channel_type = "npu_cascade". auto chan = air::getChannelDeclarationThroughSymbol(op); if (!chan) return op->emitOpError("cannot resolve channel symbol"); - if (chan.getChannelType().str() != "cascade") - return op->emitOpError("channel_type is not cascade"); + if (chan.getChannelType().str() != "npu_cascade") + return op->emitOpError("channel_type is not npu_cascade"); Location loc = op.getLoc(); Value memref = op.getMemref(); @@ -5492,13 +5492,13 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { FailureOr TileCascadeChannelIfUsingScfFor( RewriterBase &rewriter, air::ChannelInterface op, unsigned cascadeWidth) { - // Match only if the associated channel has channel_type = "cascade". + // Match only if the associated channel has channel_type = "npu_cascade". auto chan = air::getChannelDeclarationThroughSymbol(op); if (!chan) return op->emitOpError("cannot resolve channel symbol"); - if (chan.getChannelType().str() != "cascade") - return op->emitOpError("channel_type is not cascade"); + if (chan.getChannelType().str() != "npu_cascade") + return op->emitOpError("channel_type is not npu_cascade"); Location loc = op.getLoc(); Value memref = op.getMemref(); @@ -5610,7 +5610,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { // Lower mmio-typed channels into runtime-sequence MMIO writes. // - // For each `air.channel @c [...] {channel_type = "mmio"}`: + // For each `air.channel @c [...] {channel_type = "npu_mmio"}`: // * each `air.channel.get @c` inside an `aie.core` is replaced by an // erase — the destination L1 `aie.buffer` is populated by the host // before the core runs, so the get is a no-op; @@ -5636,7 +5636,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { SmallVector mmioChannels; auto collectMMIO = [&](Operation *root) { root->walk([&](air::ChannelOp chan) { - if (chan.getChannelType() == "mmio") + if (chan.getChannelType() == "npu_mmio") if (!llvm::is_contained(mmioChannels, chan)) mmioChannels.push_back(chan); }); @@ -5802,7 +5802,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { StringRef kind) -> LogicalResult { if (constIndices(indices)) return success(); - return op->emitOpError("channel_type=\"mmio\" non-broadcast ") + return op->emitOpError("channel_type=\"npu_mmio\" non-broadcast ") << kind << " requires compile-time constant indices"; }; for (auto put : hostPuts) @@ -5832,7 +5832,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { memref::GetGlobalOp getGlobalOp = getSourceGlobal(src); if (!getGlobalOp) return put.emitOpError( - "channel_type=\"mmio\" put requires source memref defined by " + "channel_type=\"npu_mmio\" put requires source memref defined by " "memref.get_global of a constant memref.global"); StringAttr origName = getGlobalOp.getNameAttr().getAttr(); @@ -5844,7 +5844,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { : nullptr); if (!moduleGlobal) return getGlobalOp.emitOpError( - "channel_type=\"mmio\" lowering: cannot find memref.global " + "channel_type=\"npu_mmio\" lowering: cannot find memref.global " "for the put source at module scope"); auto initOpt = moduleGlobal.getInitialValue(); @@ -5852,7 +5852,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { initOpt ? dyn_cast(*initOpt) : nullptr; if (!initDense) return put.emitOpError( - "channel_type=\"mmio\" source memref.global must have a " + "channel_type=\"npu_mmio\" source memref.global must have a " "DenseElementsAttr initializer"); unsigned matchCount = 0; @@ -5862,7 +5862,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { AIE::BufferOp bufferOp = getDefiningBuffer(get.getMemref()); if (!bufferOp) return get.emitOpError( - "channel_type=\"mmio\" get destination does not resolve to " + "channel_type=\"npu_mmio\" get destination does not resolve to " "an aie.buffer (must be an L1 allocation)"); // Element type and total element count must match between source @@ -5872,13 +5872,13 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { auto srcMemTy = cast(getGlobalOp.getType()); if (bufMemTy.getElementType() != srcMemTy.getElementType()) return get.emitOpError( - "channel_type=\"mmio\" source/destination element type " + "channel_type=\"npu_mmio\" source/destination element type " "mismatch (source: ") << srcMemTy.getElementType() << ", destination: " << bufMemTy.getElementType() << ")"; if (bufMemTy.getNumElements() != srcMemTy.getNumElements()) return get.emitOpError( - "channel_type=\"mmio\" source/destination element count " + "channel_type=\"npu_mmio\" source/destination element count " "mismatch (source: ") << srcMemTy.getNumElements() << ", destination: " << bufMemTy.getNumElements() << ")"; @@ -5891,13 +5891,13 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { if (auto existing = bufferOp.getInitialValue()) return bufferOp.emitOpError( - "channel_type=\"mmio\" destination aie.buffer already has an " + "channel_type=\"npu_mmio\" destination aie.buffer already has an " "initial_value; cannot stamp two sources into one buffer"); bufferOp.setInitialValueAttr(reshapedInit); ++matchCount; } if (matchCount == 0) - return put.emitOpError("channel_type=\"mmio\" put has no matching " + return put.emitOpError("channel_type=\"npu_mmio\" put has no matching " "device-side air.channel.get"); } @@ -6237,7 +6237,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { for (auto &alloc : tileDmaAlloc.s2mm_allocs) alloc.memcpyOps.clear(); - // Lower channel_type="mmio" puts/gets into runtime-sequence blockwrites + // Lower channel_type="npu_mmio" puts/gets into runtime-sequence blockwrites // before the generic erase loop below removes the underlying air ops. // Only meaningful for the ChannelInterface specialization; for the // DmaMemcpyNd specialization there are no air.channel ops to convert. diff --git a/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp b/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp index a7da49eef..8d429249b 100644 --- a/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp +++ b/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp @@ -599,7 +599,7 @@ bool xilinx::air::allocation_info_t::foundPacketFlowAllocInTile(int32_t col, continue; auto chanTypeRes = air::getChannelType(memcpy_op); if (succeeded(chanTypeRes)) - return chanTypeRes.value().str() == "dma_packet"; + return chanTypeRes.value().str() == "npu_dma_packet"; } return false; } @@ -865,7 +865,7 @@ air::TileDMAAllocator::simpleDmaChannelAlloc(air::MemcpyInterface &memcpyOp, bool isPacketFlowOp = false; auto chanTypeRes = getChannelType(memcpyOp); if (succeeded(chanTypeRes)) { - isPacketFlowOp = chanTypeRes.value().str() == "dma_packet"; + isPacketFlowOp = chanTypeRes.value().str() == "npu_dma_packet"; } // Search for existing dma channel allocation @@ -942,7 +942,7 @@ FailureOr air::ShimDMAAllocator::allocNewDmaChannel( bool isPacketFlowOp = false; auto chanTypeRes = getChannelType(memcpyOp); if (succeeded(chanTypeRes)) { - isPacketFlowOp = chanTypeRes.value().str() == "dma_packet"; + isPacketFlowOp = chanTypeRes.value().str() == "npu_dma_packet"; } // Search for existing dma channel allocation @@ -1158,7 +1158,7 @@ air::MemTileDMAAllocator::simpleDmaChannelAlloc(air::MemcpyInterface &memcpyOp, bool isPacketFlowOp = false; auto chanTypeRes = getChannelType(memcpyOp); if (succeeded(chanTypeRes)) { - isPacketFlowOp = chanTypeRes.value().str() == "dma_packet"; + isPacketFlowOp = chanTypeRes.value().str() == "npu_dma_packet"; } // Search for existing dma channel allocation @@ -1379,7 +1379,7 @@ air::MemcpyBundleAsFlow::pushBackMemcpyOpToBundle(air::DmaMemcpyNdOp memcpyOp) { S2MM_memspace = *dstMS; MM2S.push_back(memcpyOp.getOperation()); MM2S_memspace = *srcMS; - memcpyResourceType = "dma_stream"; + memcpyResourceType = "npu_dma_stream"; return success(); } @@ -1391,7 +1391,7 @@ air::MemcpyBundleAsFlow::pushBackMemcpyOpToBundle(air::ChannelGetOp memcpyOp) { // broadcast/index-matching logic below, which assumes hardware fanout. // Record the resource type (so downstream code can skip mmio bundles) // and return — the dedicated mmio lowering pass handles the rest. - if (chan.getChannelType() == "mmio") { + if (chan.getChannelType() == "npu_mmio") { air_flow_op = chan.getOperation(); S2MM[alloc_id].push_back(memcpyOp.getOperation()); auto getMS = air::getMemorySpace( @@ -1399,7 +1399,7 @@ air::MemcpyBundleAsFlow::pushBackMemcpyOpToBundle(air::ChannelGetOp memcpyOp) { if (!getMS) return memcpyOp->emitOpError("unrecognized memory space on memref"); S2MM_memspace = *getMS; - memcpyResourceType = "mmio"; + memcpyResourceType = "npu_mmio"; return success(); } if (chan->hasAttr("broadcast_shape")) { @@ -1470,7 +1470,7 @@ air::MemcpyBundleAsFlow::MemcpyBundleAsFlow(air::DmaMemcpyNdOp dmaMemcpyOp) { std::vector()); S2MM = v1; S2MM_alloc = std::vector(numS2MMAllocs); - memcpyResourceType = "dma_stream"; + memcpyResourceType = "npu_dma_stream"; } air::MemcpyBundleAsFlow::MemcpyBundleAsFlow(air::ChannelOp chan) { @@ -1509,7 +1509,7 @@ LogicalResult air::simpleDMAChannelAllocation( // not DMA. They consume no DMA channel, BD, or routing resource and // bypass allocation entirely. Their put/get pairs are converted by a // dedicated late pass (see lowerAIRMMIOChannelOps). - if (f.memcpyResourceType == "mmio") + if (f.memcpyResourceType == "npu_mmio") continue; if (f.MM2S_memspace == air::MemorySpace::L1) { for (auto o : f.MM2S) { @@ -1524,13 +1524,13 @@ LogicalResult air::simpleDMAChannelAllocation( int y = tile.getRow(); FailureOr alloc_res; - if (f.memcpyResourceType == "dma_stream" || - f.memcpyResourceType == "dma_packet") { + if (f.memcpyResourceType == "npu_dma_stream" || + f.memcpyResourceType == "npu_dma_packet") { alloc_res = tile_dma_alloc.simpleDmaChannelAlloc( memcpyOpIf, x, y, f.MM2S_alloc.dma_channel.channel); if (failed(alloc_res)) return failure(); - } else if (f.memcpyResourceType == "cascade") { + } else if (f.memcpyResourceType == "npu_cascade") { alloc_res = core_cascade_alloc.coreCascadeAlloc(memcpyOpIf); if (failed(alloc_res)) return failure(); @@ -1555,13 +1555,13 @@ LogicalResult air::simpleDMAChannelAllocation( int y = tile.getRow(); FailureOr alloc_res; - if (f.memcpyResourceType == "dma_stream" || - f.memcpyResourceType == "dma_packet") { + if (f.memcpyResourceType == "npu_dma_stream" || + f.memcpyResourceType == "npu_dma_packet") { alloc_res = tile_dma_alloc.simpleDmaChannelAlloc( memcpyOpIf, x, y, f.S2MM_alloc[i].dma_channel.channel); if (failed(alloc_res)) return failure(); - } else if (f.memcpyResourceType == "cascade") { + } else if (f.memcpyResourceType == "npu_cascade") { alloc_res = core_cascade_alloc.coreCascadeAlloc(memcpyOpIf); if (failed(alloc_res)) return failure(); @@ -1576,15 +1576,15 @@ LogicalResult air::simpleDMAChannelAllocation( } for (auto &f : memcpy_flows) { // MMIO channels are not allocated to any DMA resource at L2 either. - if (f.memcpyResourceType == "mmio") + if (f.memcpyResourceType == "npu_mmio") continue; if (f.MM2S_memspace == air::MemorySpace::L2) { for (auto o : f.MM2S) { auto memcpyOpIf = cast(o); // Report error if the data movement lowers to neither dma stream // (aie.flow) nor dma packet flow (aie.packet_flow). - if (f.memcpyResourceType != "dma_stream" && - f.memcpyResourceType != "dma_packet") + if (f.memcpyResourceType != "npu_dma_stream" && + f.memcpyResourceType != "npu_dma_packet") return memcpyOpIf->emitOpError("only supports dma_stream or " "dma_packet connections at L2 memory"); auto alloc_res = memtile_dma_alloc.simpleDmaChannelAlloc(memcpyOpIf); @@ -1599,8 +1599,8 @@ LogicalResult air::simpleDMAChannelAllocation( auto memcpyOpIf = cast(o); // Report error if the data movement lowers to neither dma stream // (aie.flow) nor dma packet flow (aie.packet_flow). - if (f.memcpyResourceType != "dma_stream" && - f.memcpyResourceType != "dma_packet") + if (f.memcpyResourceType != "npu_dma_stream" && + f.memcpyResourceType != "npu_dma_packet") return memcpyOpIf->emitOpError( "only supports dma_stream or dma_packet connections at L2 " "memory"); @@ -1614,7 +1614,7 @@ LogicalResult air::simpleDMAChannelAllocation( } for (auto &f : memcpy_flows) { // MMIO channels are not allocated to any shim DMA resource. - if (f.memcpyResourceType == "mmio") + if (f.memcpyResourceType == "npu_mmio") continue; if (f.MM2S_memspace == air::MemorySpace::L3) { for (size_t i = 0; i < f.S2MM.size(); i++) { @@ -1622,8 +1622,8 @@ LogicalResult air::simpleDMAChannelAllocation( auto memcpyOpIf = cast(o); // Report error if the data movement lowers to neither dma stream // (aie.flow) nor dma packet flow (aie.packet_flow). - if (f.memcpyResourceType != "dma_stream" && - f.memcpyResourceType != "dma_packet") + if (f.memcpyResourceType != "npu_dma_stream" && + f.memcpyResourceType != "npu_dma_packet") return memcpyOpIf->emitOpError( "only supports dma_stream or dma_packet connections at L3 " "memory"); @@ -1650,8 +1650,8 @@ LogicalResult air::simpleDMAChannelAllocation( auto memcpyOpIf = cast(o); // Report error if the data movement lowers to neither dma stream // (aie.flow) nor dma packet flow (aie.packet_flow). - if (f.memcpyResourceType != "dma_stream" && - f.memcpyResourceType != "dma_packet") + if (f.memcpyResourceType != "npu_dma_stream" && + f.memcpyResourceType != "npu_dma_packet") return memcpyOpIf->emitOpError("only supports dma_stream or " "dma_packet connections at L3 memory"); if (!f.MM2S_alloc.getDmaTile()) diff --git a/mlir/lib/Conversion/ConvertToAIRPass.cpp b/mlir/lib/Conversion/ConvertToAIRPass.cpp index ed41b3d38..ed6f9b593 100644 --- a/mlir/lib/Conversion/ConvertToAIRPass.cpp +++ b/mlir/lib/Conversion/ConvertToAIRPass.cpp @@ -781,7 +781,7 @@ separateScfParallel(scf::ParallelOp op, unsigned innerNumLoops, // Create a new air.channel symbol in the module for the cascade pipeline. // The symbol name is unique in the module, and the channel is tagged with -// the "cascade" attribute. +// the "npu_cascade" attribute. air::ChannelOp createCascadeChannelOp(OpBuilder &builder, ModuleOp module, Location loc, SmallVector channel_bundle_sizes) { @@ -797,10 +797,10 @@ createCascadeChannelOp(OpBuilder &builder, ModuleOp module, Location loc, o = o->getNextNode(); builder.setInsertionPoint(o); - // Create the channel op with the given bundle sizes and "cascade" tag. + // Create the channel op with the given bundle sizes and "npu_cascade" tag. auto channel_op = air::ChannelOp::create( builder, loc, cname, builder.getI64ArrayAttr(channel_bundle_sizes), - builder.getStringAttr("cascade")); + builder.getStringAttr("npu_cascade")); return channel_op; } diff --git a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp index b0fe068a1..2fc448a0a 100644 --- a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp +++ b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp @@ -2811,6 +2811,36 @@ LogicalResult air::DmaMemcpyNdOp::verify() { return emitOpError("padding values must be <= 65535"); } } + + // Cross-rank addressing requires an enclosing air.rank scope, and the + // peer-side memref (when it is a direct memref.alloc result) must carry + // the air.symmetric attribute. + if (hasCrossRank()) { + Operation *p = (*this)->getParentOp(); + while (p && !isa(p)) + p = p->getParentOp(); + if (!p) + return emitOpError("src_rank/dst_rank attributes require an enclosing " + "air.rank scope"); + + auto requireSymmetricAlloc = [&](Value v, StringRef side) -> LogicalResult { + auto alloc = v.getDefiningOp(); + if (!alloc) + return success(); // Block args / non-alloc sources are user-trusted. + if (!alloc->hasAttr("air.symmetric")) + return emitOpError() << side + << " memref is referenced cross-rank but its " + "memref.alloc lacks the \"air.symmetric\" " + "attribute"; + return success(); + }; + if (getSrcRank().has_value() && + failed(requireSymmetricAlloc(getSrc(), "src"))) + return failure(); + if (getDstRank().has_value() && + failed(requireSymmetricAlloc(getDst(), "dst"))) + return failure(); + } return success(); } @@ -3075,9 +3105,9 @@ static LogicalResult ComposeMemrefOpOnChannelOp(OpT op, if (!chan) // If the channel declaration cannot be resolved, signal a failure. return failure(); - // If the channel is of type "cascade", try to fold memref.cast but skip full + // If the channel is of type "npu_cascade", try to fold memref.cast but skip full // composition - if (chan.getChannelType() == "cascade") + if (chan.getChannelType() == "npu_cascade") return FoldMemrefCastOnChannelOp(op, rewriter); // Init. memref type and offsets from memref's defining op's input type @@ -3157,18 +3187,29 @@ LogicalResult air::ChannelPutOp::verify() { "must not be temporal scf.for induction variables"; } - // For channel_type="mmio", the put runs from the host runtime sequence + // For channel_type="npu_mmio", the put runs from the host runtime sequence // and writes into a tile-local L1 buffer. Its source memref must // therefore live in L3 (host memory). Allow lookup-failure to silently // pass — that's a separate diagnostic surface. if (auto chan = resolveChannelDecl(*this)) { - if (chan.getChannelType() == "mmio") { + if (chan.getChannelType() == "npu_mmio") { auto memrefTy = dyn_cast(getMemref().getType()); if (memrefTy && memrefTy.getMemorySpaceAsInt() != 0) - return emitOpError() << "channel_type=\"mmio\" put source must be " + return emitOpError() << "channel_type=\"npu_mmio\" put source must be " "in L3 (memory_space=0), got memory_space=" << memrefTy.getMemorySpaceAsInt(); } + // For channel_type="gpu_symmetric_heap", the put must be inside an + // air.rank scope (cross-rank addressing requires a multi-rank world). + if (chan.getChannelType() == "gpu_symmetric_heap") { + Operation *p = (*this)->getParentOp(); + while (p && !isa(p)) + p = p->getParentOp(); + if (!p) + return emitOpError() + << "channel_type=\"gpu_symmetric_heap\" put requires an " + "enclosing air.rank scope"; + } } auto padBefore = getPadBefore(); @@ -3246,17 +3287,29 @@ LogicalResult air::ChannelGetOp::verify() { "must not be temporal scf.for induction variables"; } - // For channel_type="mmio", the destination must be a tile-local L1 buffer - // (memory_space=2): the host writes into it via runtime-sequence + // For channel_type="npu_mmio", the destination must be a tile-local L1 + // buffer (memory_space=2): the host writes into it via runtime-sequence // blockwrites before the consuming core starts. L2/L3 destinations have // no representation in the lowered IR. if (auto chan = resolveChannelDecl(*this)) { - if (chan.getChannelType() == "mmio") { + if (chan.getChannelType() == "npu_mmio") { auto memrefTy = dyn_cast(getMemref().getType()); if (memrefTy && memrefTy.getMemorySpaceAsInt() != 2) - return emitOpError() << "channel_type=\"mmio\" get destination must be " - "in L1 (memory_space=2), got memory_space=" - << memrefTy.getMemorySpaceAsInt(); + return emitOpError() + << "channel_type=\"npu_mmio\" get destination must be " + "in L1 (memory_space=2), got memory_space=" + << memrefTy.getMemorySpaceAsInt(); + } + // For channel_type="gpu_symmetric_heap", the get must be inside an + // air.rank scope. + if (chan.getChannelType() == "gpu_symmetric_heap") { + Operation *p = (*this)->getParentOp(); + while (p && !isa(p)) + p = p->getParentOp(); + if (!p) + return emitOpError() + << "channel_type=\"gpu_symmetric_heap\" get requires an " + "enclosing air.rank scope"; } } @@ -3290,14 +3343,18 @@ void air::ChannelGetOp::getCanonicalizationPatterns(RewritePatternSet &patterns, // LogicalResult air::ChannelOp::verify() { - // Allow-list of supported channel_type values. Adding a new transport - // requires both an enum entry here and a lowering branch in air-to-aie. + // Allow-list of supported channel_type values. Values are namespaced by + // backend: NPU (AIE) channels use the "npu_" prefix, GPU channels use the + // "gpu_" prefix. Adding a new transport requires both an entry here and a + // lowering branch in the appropriate conversion pass. StringRef chanType = getChannelType(); - if (chanType != "dma_stream" && chanType != "dma_packet" && - chanType != "cascade" && chanType != "mmio") + if (chanType != "npu_dma_stream" && chanType != "npu_dma_packet" && + chanType != "npu_cascade" && chanType != "npu_mmio" && + chanType != "gpu_symmetric_heap") return emitOpError() << "unsupported channel_type \"" << chanType - << "\"; expected one of \"dma_stream\", " - "\"dma_packet\", \"cascade\", or \"mmio\""; + << "\"; expected one of \"npu_dma_stream\", " + "\"npu_dma_packet\", \"npu_cascade\", " + "\"npu_mmio\", or \"gpu_symmetric_heap\""; if (isBroadcast()) { auto bundle_size = getSize(); diff --git a/mlir/lib/Transform/AIRDmaToChannel.cpp b/mlir/lib/Transform/AIRDmaToChannel.cpp index 522d9809a..a68705a3d 100644 --- a/mlir/lib/Transform/AIRDmaToChannel.cpp +++ b/mlir/lib/Transform/AIRDmaToChannel.cpp @@ -495,7 +495,7 @@ createChannelOp(OpBuilder builder, ModuleOp module, std::string cname, auto channel_op = air::ChannelOp::create( builder, loc, cname, builder.getI64ArrayAttr(channel_bundle_sizes), - builder.getStringAttr("dma_stream")); + builder.getStringAttr("npu_dma_stream")); builder.restoreInsertionPoint(insertionCheckpoint); @@ -1606,10 +1606,10 @@ struct DmaToChannelPass : public air::impl::DmaToChannelBase { // mmio channels are runtime-sequence MMIO writes, not shim DMA, so // they neither contribute to per-column shim pressure nor are // eligible for dma_packet upgrade. - if (chanOp.getChannelType() == "mmio") + if (chanOp.getChannelType() == "npu_mmio") continue; - bool isAlreadyPacket = (chanOp.getChannelType() == "dma_packet"); + bool isAlreadyPacket = (chanOp.getChannelType() == "npu_dma_packet"); auto channelName = chanOp.getSymName(); // Check if this channel has a herd-side endpoint in this segment. @@ -1722,7 +1722,7 @@ struct DmaToChannelPass : public air::impl::DmaToChannelBase { << pressure << " exceeds shim DMA limit of " << shimChannelsPerCol << ")"; for (auto chanOp : channels) { - chanOp.setChannelType(StringAttr::get(context, "dma_packet")); + chanOp.setChannelType(StringAttr::get(context, "npu_dma_packet")); } }; @@ -1733,9 +1733,9 @@ struct DmaToChannelPass : public air::impl::DmaToChannelBase { << inputChannels.size() + outputChannels.size() << " shim-bound channels to dma_packet"; for (auto chanOp : inputChannels) - chanOp.setChannelType(StringAttr::get(context, "dma_packet")); + chanOp.setChannelType(StringAttr::get(context, "npu_dma_packet")); for (auto chanOp : outputChannels) - chanOp.setChannelType(StringAttr::get(context, "dma_packet")); + chanOp.setChannelType(StringAttr::get(context, "npu_dma_packet")); return; } diff --git a/mlir/lib/Transform/AIRHerdPlacementPass.cpp b/mlir/lib/Transform/AIRHerdPlacementPass.cpp index 87b181483..ecc30eefa 100644 --- a/mlir/lib/Transform/AIRHerdPlacementPass.cpp +++ b/mlir/lib/Transform/AIRHerdPlacementPass.cpp @@ -374,7 +374,7 @@ class AIRHerdPlacementPass // Collect cascade channel declarations std::map cascadeChannels; module.walk([&](air::ChannelOp channelOp) { - if (channelOp.getChannelType() == "cascade") { + if (channelOp.getChannelType() == "npu_cascade") { cascadeChannels[channelOp.getSymName()] = channelOp; } }); diff --git a/mlir/lib/Transform/AIRLinalgCodegen.cpp b/mlir/lib/Transform/AIRLinalgCodegen.cpp index e9d8d5b45..3236d8f17 100644 --- a/mlir/lib/Transform/AIRLinalgCodegen.cpp +++ b/mlir/lib/Transform/AIRLinalgCodegen.cpp @@ -1205,7 +1205,7 @@ FailureOr static pipelineReduceLinalgOp( auto cname = createChannelName(module); b.setInsertionPointToStart(module.getBody()); auto channel_op = air::ChannelOp::create( - b, loc, cname, b.getI64ArrayAttr({1}), b.getStringAttr("dma_stream")); + b, loc, cname, b.getI64ArrayAttr({1}), b.getStringAttr("npu_dma_stream")); b.setInsertionPoint(stageBlock->getTerminator()); SmallVector src_offsets; SmallVector src_sizes; diff --git a/mlir/lib/Transform/AIRMiscPasses.cpp b/mlir/lib/Transform/AIRMiscPasses.cpp index 814927de6..77f26ac98 100644 --- a/mlir/lib/Transform/AIRMiscPasses.cpp +++ b/mlir/lib/Transform/AIRMiscPasses.cpp @@ -1147,7 +1147,7 @@ static bool segmentUsesCascade(air::HerdOp herd) { auto result = container->walk([&](air::ChannelInterface chanOp) { auto channelDecl = air::getChannelDeclarationThroughSymbol(chanOp); - if (channelDecl && channelDecl.getChannelType() == "cascade") + if (channelDecl && channelDecl.getChannelType() == "npu_cascade") return WalkResult::interrupt(); return WalkResult::advance(); }); diff --git a/mlir/lib/Util/Util.cpp b/mlir/lib/Util/Util.cpp index ac690ffcc..07aed55da 100644 --- a/mlir/lib/Util/Util.cpp +++ b/mlir/lib/Util/Util.cpp @@ -522,7 +522,7 @@ FailureOr air::getChannelType(air::MemcpyInterface memcpyIfOp) { auto chanIfOp = dyn_cast_if_present(memcpyIfOp.getOperation()); if (!chanIfOp) - return StringRef("dma_stream"); + return StringRef("npu_dma_stream"); auto chanOp = getChannelDeclarationThroughSymbol(chanIfOp); if (chanOp) { return chanOp.getChannelType(); diff --git a/mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir b/mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir index 040456b5e..cc0b248e9 100644 --- a/mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir +++ b/mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir @@ -5,7 +5,7 @@ // //===----------------------------------------------------------------------===// -// Positive tests for channel_type="mmio" in air-to-aie. Each split has +// Positive tests for channel_type="npu_mmio" in air-to-aie. Each split has // its own CHECK prefix so directives don't leak across boundaries. // Negative cases live in `air_channel_mmio_invalid.mlir`. // @@ -34,7 +34,7 @@ // CHECK-SIMPLE-NOT: aiex.npu.blockwrite memref.global "private" @const_data : memref<8xi32> = dense<42> -air.channel @mmio_chan [] {channel_type = "mmio"} +air.channel @mmio_chan [] {channel_type = "npu_mmio"} func.func @mmio_simple() { %src = memref.get_global @const_data : memref<8xi32> %c1 = arith.constant 1 : index @@ -66,8 +66,8 @@ func.func @mmio_simple() { // CHECK-MIXED-NOT: air.channel.get @mmio_chan2 memref.global "private" @mmio_const : memref<8xi32> = dense<7> -air.channel @mmio_chan2 [] {channel_type = "mmio"} -air.channel @dma_chan [] {channel_type = "dma_stream"} +air.channel @mmio_chan2 [] {channel_type = "npu_mmio"} +air.channel @dma_chan [] {channel_type = "npu_dma_stream"} func.func @mixed(%dma_src: memref<16xi32>) { %src = memref.get_global @mmio_const : memref<8xi32> %c1 = arith.constant 1 : index @@ -106,7 +106,7 @@ func.func @mixed(%dma_src: memref<16xi32>) { // CHECK-BCAST-NOT: aiex.npu.blockwrite memref.global "private" @const_q : memref<8xi32> = dense<5> -air.channel @bcast_mmio [1] {channel_type = "mmio", broadcast_shape = [2]} +air.channel @bcast_mmio [1] {channel_type = "npu_mmio", broadcast_shape = [2]} func.func @bcast() { %src = memref.get_global @const_q : memref<8xi32> %c1 = arith.constant 1 : index @@ -146,7 +146,7 @@ func.func @bcast() { memref.global "private" @c0 : memref<8xi32> = dense<10> memref.global "private" @c1 : memref<8xi32> = dense<20> -air.channel @qm [2] {channel_type = "mmio"} +air.channel @qm [2] {channel_type = "npu_mmio"} func.func @indexed() { %g0 = memref.get_global @c0 : memref<8xi32> %g1 = memref.get_global @c1 : memref<8xi32> @@ -180,7 +180,7 @@ func.func @indexed() { // CHECK-BF16-NOT: aiex.npu.blockwrite memref.global "private" @qbf16 : memref<2x2xbf16> = dense<1.5> -air.channel @qbf16_chan [] {channel_type = "mmio"} +air.channel @qbf16_chan [] {channel_type = "npu_mmio"} func.func @bf16_payload() { %src = memref.get_global @qbf16 : memref<2x2xbf16> %c1 = arith.constant 1 : index @@ -208,7 +208,7 @@ func.func @bf16_payload() { // CHECK-BF16NS-NOT: aiex.npu.blockwrite memref.global "private" @qbf16ns : memref<2x2xbf16> = dense<[[1.5, 2.5], [3.5, 4.5]]> -air.channel @qbf16ns_chan [] {channel_type = "mmio"} +air.channel @qbf16ns_chan [] {channel_type = "npu_mmio"} func.func @bf16_nonsplat() { %src = memref.get_global @qbf16ns : memref<2x2xbf16> %c1 = arith.constant 1 : index @@ -237,7 +237,7 @@ func.func @bf16_nonsplat() { // CHECK-I8-NOT: aiex.npu.blockwrite memref.global "private" @c8s : memref<4xi8> = dense<66> -air.channel @c8s_chan [] {channel_type = "mmio"} +air.channel @c8s_chan [] {channel_type = "npu_mmio"} func.func @i8_splat() { %src = memref.get_global @c8s : memref<4xi8> %c1 = arith.constant 1 : index diff --git a/mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir b/mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir index 8a372db9d..df5decf6a 100644 --- a/mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir +++ b/mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir @@ -5,7 +5,7 @@ // //===----------------------------------------------------------------------===// -// Negative tests for channel_type="mmio". Each split runs under `not` +// Negative tests for channel_type="npu_mmio". Each split runs under `not` // so FileCheck sees only that split's diagnostic. // RUN: not air-opt %s -split-input-file -air-to-aie="row-offset=2 col-offset=0 device=npu1" 2>&1 | FileCheck %s @@ -13,8 +13,8 @@ // The source data is stamped onto the destination L1 buffer's // initial_value, so the put source must be a compile-time constant // memref.global. -// CHECK: channel_type="mmio" put requires source memref defined by memref.get_global -air.channel @mmio_nc [] {channel_type = "mmio"} +// CHECK: channel_type="npu_mmio" put requires source memref defined by memref.get_global +air.channel @mmio_nc [] {channel_type = "npu_mmio"} func.func @mmio_nonconst(%h: memref<8xi32>) { %c1 = arith.constant 1 : index air.launch (%lx) in (%sx = %c1) args(%a = %h) : memref<8xi32> { @@ -35,9 +35,9 @@ func.func @mmio_nonconst(%h: memref<8xi32>) { // Non-broadcast mmio with non-constant index can't match any get; // would silently erase the put. Reject up front. -// CHECK: channel_type="mmio" non-broadcast put requires compile-time constant indices +// CHECK: channel_type="npu_mmio" non-broadcast put requires compile-time constant indices memref.global "private" @nci_const : memref<8xi32> = dense<1> -air.channel @nci_chan [1] {channel_type = "mmio"} +air.channel @nci_chan [1] {channel_type = "npu_mmio"} func.func @mmio_nonconst_index(%n: index) { %src = memref.get_global @nci_const : memref<8xi32> %c1 = arith.constant 1 : index @@ -59,9 +59,9 @@ func.func @mmio_nonconst_index(%n: index) { // ----- // Constant-index put with no matching get would be silently erased. -// CHECK: channel_type="mmio" put has no matching device-side air.channel.get +// CHECK: channel_type="npu_mmio" put has no matching device-side air.channel.get memref.global "private" @nm_const : memref<8xi32> = dense<2> -air.channel @nm_chan [2] {channel_type = "mmio"} +air.channel @nm_chan [2] {channel_type = "npu_mmio"} func.func @mmio_no_match() { %src = memref.get_global @nm_const : memref<8xi32> %c1 = arith.constant 1 : index @@ -85,9 +85,9 @@ func.func @mmio_no_match() { // The destination L1 buffer's element type must match the source so the // initializer is type-compatible. -// CHECK: channel_type="mmio" source/destination element type mismatch +// CHECK: channel_type="npu_mmio" source/destination element type mismatch memref.global "private" @i32_src : memref<4xi32> = dense<7> -air.channel @typemis_chan [] {channel_type = "mmio"} +air.channel @typemis_chan [] {channel_type = "npu_mmio"} func.func @mmio_type_mismatch() { %src = memref.get_global @i32_src : memref<4xi32> %c1 = arith.constant 1 : index @@ -108,9 +108,9 @@ func.func @mmio_type_mismatch() { // ----- // Source/destination must agree on total element count. -// CHECK: channel_type="mmio" source/destination element count mismatch +// CHECK: channel_type="npu_mmio" source/destination element count mismatch memref.global "private" @short_src : memref<4xi32> = dense<7> -air.channel @sizemis_chan [] {channel_type = "mmio"} +air.channel @sizemis_chan [] {channel_type = "npu_mmio"} func.func @mmio_size_mismatch() { %src = memref.get_global @short_src : memref<4xi32> %c1 = arith.constant 1 : index @@ -132,9 +132,9 @@ func.func @mmio_size_mismatch() { // initial_value is set by the lowering, so the source memref.global // needs a DenseElementsAttr initializer to copy from. -// CHECK: channel_type="mmio" source memref.global must have a DenseElementsAttr initializer +// CHECK: channel_type="npu_mmio" source memref.global must have a DenseElementsAttr initializer memref.global "private" @uninit_bf16 : memref<2x2xbf16> -air.channel @uninit_chan [] {channel_type = "mmio"} +air.channel @uninit_chan [] {channel_type = "npu_mmio"} func.func @mmio_uninitialized_global() { %src = memref.get_global @uninit_bf16 : memref<2x2xbf16> %c1 = arith.constant 1 : index diff --git a/mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir b/mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir index 2a1820874..6fcf2d20e 100644 --- a/mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir +++ b/mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir @@ -241,7 +241,7 @@ func.func @one_to_two() { #set = affine_set<()[s0] : (s0 - 3 == 0)> #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)> -air.channel @channel_0 [3] {channel_type = "cascade"} +air.channel @channel_0 [3] {channel_type = "npu_cascade"} air.channel @channel_1 [1] air.channel @channel_2 [1] func.func @cascade(%arg0: memref<2048xi32>, %arg1: memref<2048xi32>) { @@ -393,7 +393,7 @@ func.func @cascade(%arg0: memref<2048xi32>, %arg1: memref<2048xi32>) { #set = affine_set<()[s0] : (s0 - 3 == 0)> #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)> module { - air.channel @channel_0 [3] {channel_type = "cascade"} + air.channel @channel_0 [3] {channel_type = "npu_cascade"} air.channel @channel_1 [1] air.channel @channel_2 [1] func.func @cascade2(%arg0: memref<1x1x2048xi32>, %arg1: memref<1x1x2048xi32>) { @@ -496,7 +496,7 @@ module { // Test 2D memref flattening for cascade #set_2d = affine_set<()[s0] : (s0 - 1 == 0)> module { - air.channel @cascade_2d [1] {channel_type = "cascade"} + air.channel @cascade_2d [1] {channel_type = "npu_cascade"} func.func @cascade_2d_flatten(%arg0: memref<32x64xi32>) { %c1 = arith.constant 1 : index %0 = air.launch async (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0) : memref<32x64xi32> attributes {id = 1 : i32} { @@ -543,7 +543,7 @@ module { // Test 4D memref flattening for cascade #set_4d = affine_set<()[s0] : (s0 - 1 == 0)> module { - air.channel @cascade_4d [1] {channel_type = "cascade"} + air.channel @cascade_4d [1] {channel_type = "npu_cascade"} func.func @cascade_4d_flatten(%arg0: memref<2x4x8x32xi32>) { %c1 = arith.constant 1 : index %0 = air.launch async (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0) : memref<2x4x8x32xi32> attributes {id = 1 : i32} { @@ -589,7 +589,7 @@ module { // Test bf16 cascade flattening (different tile size due to element width) #set_bf16 = affine_set<()[s0] : (s0 - 1 == 0)> module { - air.channel @cascade_bf16 [1] {channel_type = "cascade"} + air.channel @cascade_bf16 [1] {channel_type = "npu_cascade"} func.func @cascade_bf16_flatten(%arg0: memref<32x32xbf16>) { %c1 = arith.constant 1 : index %0 = air.launch async (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0) : memref<32x32xbf16> attributes {id = 1 : i32} { @@ -684,7 +684,7 @@ module { #set = affine_set<()[s0] : (s0 - 3 == 0)> #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)> module { - air.channel @channel_0 [3] {channel_type = "cascade"} + air.channel @channel_0 [3] {channel_type = "npu_cascade"} air.channel @channel_1 [1] air.channel @channel_2 [1] func.func @cascade3(%arg0: memref<1x1x2048xi32>, %arg1: memref<1x1x2048xi32>) { diff --git a/mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir b/mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir index b4813e7b4..e5c723abb 100644 --- a/mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir +++ b/mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir @@ -1596,14 +1596,14 @@ module { #set5 = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 3 >= 0, s1 - 2 == 0)> module { air.channel @L3ToL2Chan1 [1, 4] - air.channel @L2ToL1Chan1_0 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_packet"} - air.channel @L2ToL1Chan1_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_packet"} - air.channel @L2ToL1Chan1_2 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_packet"} - air.channel @L2ToL1Chan1_3 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_packet"} - air.channel @L2ToL1Chan2_0 [1, 1] {broadcast_shape = [4, 1], channel_type = "dma_packet"} - air.channel @L2ToL1Chan2_1 [1, 1] {broadcast_shape = [4, 1], channel_type = "dma_packet"} - air.channel @L2ToL1Chan2_2 [1, 1] {broadcast_shape = [4, 1], channel_type = "dma_packet"} - air.channel @L2ToL1Chan2_3 [1, 1] {broadcast_shape = [4, 1], channel_type = "dma_packet"} + air.channel @L2ToL1Chan1_0 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"} + air.channel @L2ToL1Chan1_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"} + air.channel @L2ToL1Chan1_2 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"} + air.channel @L2ToL1Chan1_3 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"} + air.channel @L2ToL1Chan2_0 [1, 1] {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"} + air.channel @L2ToL1Chan2_1 [1, 1] {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"} + air.channel @L2ToL1Chan2_2 [1, 1] {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"} + air.channel @L2ToL1Chan2_3 [1, 1] {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"} func.func @func20(%arg0: memref<128x64xbf16>, %arg1: memref<64x3072xbf16>, %arg2: memref<3072x64xbf16>, %arg3: memref<128x3072xbf16>, %arg4: memref<128x64xbf16>) { %c1 = arith.constant 1 : index %0 = air.launch async (%arg5, %arg6) in (%arg7=%c1, %arg8=%c1) args(%arg9=%arg0, %arg10=%arg1) : memref<128x64xbf16>, memref<64x3072xbf16> attributes {id = 1 : i32} { @@ -1781,7 +1781,7 @@ module { // RACECONDFIX: @func21 module { - air.channel @L1ToL3Pkt [1, 1] {channel_type = "dma_packet"} + air.channel @L1ToL3Pkt [1, 1] {channel_type = "npu_dma_packet"} func.func @func21(%arg0: memref<64xbf16>) { %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index diff --git a/mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir b/mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir index a34ede4d5..d6c87875e 100644 --- a/mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir +++ b/mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir @@ -22,7 +22,7 @@ module { air.channel @channel_13 [1, 1] {broadcast_shape = [1, 4]} air.channel @channel_14 [1, 1] {broadcast_shape = [1, 4]} air.channel @channel_15 [1, 1] {broadcast_shape = [1, 4]} - air.channel @channel_2 [4, 1] {channel_type = "dma_packet"} + air.channel @channel_2 [4, 1] {channel_type = "npu_dma_packet"} func.func @func2(%arg0: memref<512x512xbf16>) { %c2 = arith.constant 2 : index %0 = air.launch async (%arg3, %arg4) in (%arg5=%c2, %arg6=%c2) args(%arg7=%arg0) : memref<512x512xbf16> attributes {id = 1 : i32} { diff --git a/mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir b/mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir index c4d24f0e1..ac6af7d8a 100644 --- a/mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir +++ b/mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir @@ -26,7 +26,7 @@ module { air.channel @channel_13 [1, 1] {broadcast_shape = [1, 4]} air.channel @channel_14 [1, 1] {broadcast_shape = [1, 4]} air.channel @channel_15 [1, 1] {broadcast_shape = [1, 4]} - air.channel @channel_2 [4, 1] {channel_type = "dma_packet"} + air.channel @channel_2 [4, 1] {channel_type = "npu_dma_packet"} func.func @func2(%arg0: memref<512x512xbf16>) { %c2 = arith.constant 2 : index %0 = air.launch async (%arg3, %arg4) in (%arg5=%c2, %arg6=%c2) args(%arg7=%arg0) : memref<512x512xbf16> attributes {id = 1 : i32} { diff --git a/mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir b/mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir index 19e53f61c..abad216f8 100644 --- a/mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir +++ b/mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir @@ -34,8 +34,8 @@ module { // Intra-device channels for L1-L2 communication (packet flow type) - air.channel @chan_intra_a [2, 1] {channel_type = "dma_packet"} - air.channel @chan_intra_b [2, 1] {channel_type = "dma_packet"} + air.channel @chan_intra_a [2, 1] {channel_type = "npu_dma_packet"} + air.channel @chan_intra_b [2, 1] {channel_type = "npu_dma_packet"} func.func @test_packet_flow_id_reset(%arg0: memref<128xbf16>) { %0 = air.launch async () in () args(%input=%arg0) : memref<128xbf16> attributes {id = 1 : i32} { diff --git a/mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir b/mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir index c6d7870af..e4a78c488 100644 --- a/mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir +++ b/mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir @@ -25,8 +25,8 @@ module { // Two dma_packet channels from L3 to L1, sharing the same shim column. - air.channel @chan_a [1, 1] {channel_type = "dma_packet"} - air.channel @chan_b [1, 1] {channel_type = "dma_packet"} + air.channel @chan_a [1, 1] {channel_type = "npu_dma_packet"} + air.channel @chan_b [1, 1] {channel_type = "npu_dma_packet"} func.func @test_shared_shim_packet_ids(%arg0: memref<64xbf16>, %arg1: memref<64xbf16>) { %0 = air.launch async () in () args(%in0=%arg0, %in1=%arg1) : memref<64xbf16>, memref<64xbf16> attributes {id = 1 : i32} { diff --git a/mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir b/mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir index 772133841..8e92bb45f 100644 --- a/mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir +++ b/mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir @@ -23,7 +23,7 @@ // CHECK: @func0 // CHECK: air.channel.put @channel_0[] {{.*}} metadataArray = [{base = "air_channel_0", index = 0 : i32}], packet = #aie.packet_info #map2 = affine_map<(d0) -> (d0)> -air.channel @channel_0 [1, 1] {channel_type = "dma_packet"} +air.channel @channel_0 [1, 1] {channel_type = "npu_dma_packet"} air.channel @channel_1 [1, 1] air.channel @channel_2 [1, 1] air.channel @channel_3 [1, 1] @@ -83,7 +83,7 @@ func.func @func0(%arg0 : memref<64xi32>, %arg1 : memref<64xi32>) -> () { // CHECK: air.channel.put async @channel_0[] {{.*}} metadataArray = [{base = "air_channel_0", index = 0 : i32}], packet = #aie.packet_info #map = affine_map<(d0) -> (d0)> module { - air.channel @channel_0 [1, 1] {channel_type = "dma_packet"} + air.channel @channel_0 [1, 1] {channel_type = "npu_dma_packet"} air.channel @channel_1 [1, 1] air.channel @channel_2 [1, 1] air.channel @channel_3 [1, 1] diff --git a/mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir b/mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir index 8cd6485ec..cf950a44b 100644 --- a/mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir +++ b/mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir @@ -24,9 +24,9 @@ // CHECK: aie.shim_dma_allocation @air_pkt_in_2({{.*}}, MM2S, 0) module { - air.channel @pkt_in_0 [1, 1] {channel_type = "dma_packet"} - air.channel @pkt_in_1 [1, 1] {channel_type = "dma_packet"} - air.channel @pkt_in_2 [1, 1] {channel_type = "dma_packet"} + air.channel @pkt_in_0 [1, 1] {channel_type = "npu_dma_packet"} + air.channel @pkt_in_1 [1, 1] {channel_type = "npu_dma_packet"} + air.channel @pkt_in_2 [1, 1] {channel_type = "npu_dma_packet"} air.channel @to_core [1, 1] air.channel @from_core [1, 1] air.channel @out [1, 1] diff --git a/mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir b/mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir index 7fbabd53a..41eb0f9df 100644 --- a/mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir +++ b/mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir @@ -388,7 +388,7 @@ module { // CHECK: [[$SET0:#set[0-9]*]] = affine_set<()[s0] : (s0 - 3 == 0)> // CHECK: [[$SET1:#set[0-9]+]] = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)> -// CHECK: air.channel @channel_0 [3] {channel_type = "cascade"} +// CHECK: air.channel @channel_0 [3] {channel_type = "npu_cascade"} // CHECK-LABEL: scf_reduce // CHECK: air.herd @herd_0 tile (%[[arg0:.*]], %[[arg1:.*]]) in (%{{.*}}=%c4{{.*}}, %{{.*}}=%c1{{.*}}) // CHECK: %[[alloc_4:.*]] = memref.alloc() : memref<32xi32, 2 : i32> diff --git a/mlir/test/Dialect/AIR/air_canonicalize.mlir b/mlir/test/Dialect/AIR/air_canonicalize.mlir index 723f174e1..9cff8ba7d 100644 --- a/mlir/test/Dialect/AIR/air_canonicalize.mlir +++ b/mlir/test/Dialect/AIR/air_canonicalize.mlir @@ -1094,7 +1094,7 @@ func.func @channel_fold_reinterpret_cast_empty_access(%arg0: memref<*xbf16>) { // Test cascade channel type (should only fold memref.cast, not full composition) -air.channel @channel_cascade [3] {channel_type = "cascade"} +air.channel @channel_cascade [3] {channel_type = "npu_cascade"} // CHECK-LABEL: func.func @channel_cascade_fold_cast_only // CHECK-NOT: %[[CAST:.*]] = memref.cast %{{.*}} : memref<256x256xbf16> to memref<256x256xbf16, strided<[256, 1], offset: ?>> diff --git a/mlir/test/Dialect/AIR/air_channel.mlir b/mlir/test/Dialect/AIR/air_channel.mlir index e148282da..cb7a035a7 100644 --- a/mlir/test/Dialect/AIR/air_channel.mlir +++ b/mlir/test/Dialect/AIR/air_channel.mlir @@ -12,7 +12,7 @@ // CHECK: func.func @channel // CHECK: %[[V1:.*]] = air.channel.put async [{{.*}}] @channel_1[{{.*}}, {{.*}}] // CHECK: %[[V2:.*]] = air.channel.get async [{{.*}}] @channel_1[{{.*}}, {{.*}}] -air.channel @channel_1 [2,2] {channel_type = "dma_stream"} +air.channel @channel_1 [2,2] {channel_type = "npu_dma_stream"} func.func @channel() { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index @@ -35,7 +35,7 @@ func.func @channel() { // CHECK: func.func @fork // CHECK: %[[V1:.*]] = air.channel.put async [{{.*}}] @bcast[] ({{.*}}[{{.*}},{{.*}}] // CHECK: air.channel.get @bcast[{{.*}}, {{.*}}] ({{.*}}[] [] []) -air.channel @bcast [2,1] {channel_type = "dma_stream"} +air.channel @bcast [2,1] {channel_type = "npu_dma_stream"} func.func @fork() { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index @@ -55,7 +55,7 @@ func.func @fork() { // CHECK: func.func @distribute // CHECK: air.channel.put @merge[{{.*}}, {{.*}}] ({{.*}}[ // CHECK: %[[V2:.*]] = air.channel.get async [{{.*}}] @merge[] -air.channel @merge[2,2] {channel_type = "dma_stream"} +air.channel @merge[2,2] {channel_type = "npu_dma_stream"} func.func @distribute() { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index @@ -74,11 +74,11 @@ func.func @distribute() { return } -// CHECK: air.channel @packet_flow [2, 2] {channel_type = "dma_packet"} +// CHECK: air.channel @packet_flow [2, 2] {channel_type = "npu_dma_packet"} // CHECK: func.func @packet_flow_func // CHECK: %[[V1:.*]] = air.channel.put async [{{.*}}] @packet_flow[{{.*}}, {{.*}}] // CHECK: %[[V2:.*]] = air.channel.get async [{{.*}}] @packet_flow[{{.*}}, {{.*}}] -air.channel @packet_flow[2,2] {channel_type = "dma_packet"} +air.channel @packet_flow[2,2] {channel_type = "npu_dma_packet"} func.func @packet_flow_func() { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index @@ -98,7 +98,7 @@ func.func @packet_flow_func() { return } -// CHECK: air.channel @cascade [3] {channel_type = "cascade"} +// CHECK: air.channel @cascade [3] {channel_type = "npu_cascade"} // CHECK: func.func @cascade_func // CHECK: affine.if // CHECK: air.channel.put @cascade[%{{.*}}] @@ -110,7 +110,7 @@ func.func @packet_flow_func() { // CHECK: air.channel.get @cascade[%{{.*}}] #set = affine_set<()[s0] : (s0 == 0)> #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)> -air.channel @cascade [3] {channel_type = "cascade"} +air.channel @cascade [3] {channel_type = "npu_cascade"} func.func @cascade_func() { %c4 = arith.constant 4 : index %c1_0 = arith.constant 1 : index @@ -141,5 +141,12 @@ func.func @cascade_func() { // ----- // CHECK: air.channel @mmio_chan -// CHECK-SAME: channel_type = "mmio" -air.channel @mmio_chan [] {channel_type = "mmio"} +// CHECK-SAME: channel_type = "npu_mmio" +air.channel @mmio_chan [] {channel_type = "npu_mmio"} + +// ----- + +// Round-trip: gpu_symmetric_heap channel parses and prints correctly. +// CHECK: air.channel @sym_heap_chan +// CHECK-SAME: channel_type = "gpu_symmetric_heap" +air.channel @sym_heap_chan [] {channel_type = "gpu_symmetric_heap"} diff --git a/mlir/test/Dialect/AIR/air_channel_invalid.mlir b/mlir/test/Dialect/AIR/air_channel_invalid.mlir index 812af3462..51fb7d041 100644 --- a/mlir/test/Dialect/AIR/air_channel_invalid.mlir +++ b/mlir/test/Dialect/AIR/air_channel_invalid.mlir @@ -85,15 +85,15 @@ func.func @channel_get_temporal_for_iv(%m: memref<64xi32>) { // ----- // Test: unsupported channel_type string is rejected by the verifier. -// expected-error @+1 {{'air.channel' op unsupported channel_type "ddr_stream"; expected one of "dma_stream", "dma_packet", "cascade", or "mmio"}} +// expected-error @+1 {{'air.channel' op unsupported channel_type "ddr_stream"; expected one of "npu_dma_stream", "npu_dma_packet", "npu_cascade", "npu_mmio", or "gpu_symmetric_heap"}} air.channel @bad_chan_type [] {channel_type = "ddr_stream"} // ----- // Test: mmio channel put source must be in L3 (memory_space=0). -air.channel @mmio_bad_put [] {channel_type = "mmio"} +air.channel @mmio_bad_put [] {channel_type = "npu_mmio"} func.func @mmio_put_wrong_memspace(%m: memref<8xi32, 2>) { - // expected-error @+1 {{'air.channel.put' op channel_type="mmio" put source must be in L3 (memory_space=0), got memory_space=2}} + // expected-error @+1 {{'air.channel.put' op channel_type="npu_mmio" put source must be in L3 (memory_space=0), got memory_space=2}} air.channel.put @mmio_bad_put[] (%m[] [] []) : (memref<8xi32, 2>) return } @@ -101,9 +101,9 @@ func.func @mmio_put_wrong_memspace(%m: memref<8xi32, 2>) { // ----- // Test: mmio channel get destination must be in L1 (memory_space=2). -air.channel @mmio_bad_get [] {channel_type = "mmio"} +air.channel @mmio_bad_get [] {channel_type = "npu_mmio"} func.func @mmio_get_wrong_memspace(%m: memref<8xi32, 1>) { - // expected-error @+1 {{'air.channel.get' op channel_type="mmio" get destination must be in L1 (memory_space=2), got memory_space=1}} + // expected-error @+1 {{'air.channel.get' op channel_type="npu_mmio" get destination must be in L1 (memory_space=2), got memory_space=1}} air.channel.get @mmio_bad_get[] (%m[] [] []) : (memref<8xi32, 1>) return } @@ -111,9 +111,9 @@ func.func @mmio_get_wrong_memspace(%m: memref<8xi32, 1>) { // ----- // Test: mmio put with L2 source is also rejected (only L3 is allowed). -air.channel @mmio_bad_put_l2 [] {channel_type = "mmio"} +air.channel @mmio_bad_put_l2 [] {channel_type = "npu_mmio"} func.func @mmio_put_l2(%m: memref<8xi32, 1>) { - // expected-error @+1 {{'air.channel.put' op channel_type="mmio" put source must be in L3 (memory_space=0), got memory_space=1}} + // expected-error @+1 {{'air.channel.put' op channel_type="npu_mmio" put source must be in L3 (memory_space=0), got memory_space=1}} air.channel.put @mmio_bad_put_l2[] (%m[] [] []) : (memref<8xi32, 1>) return } @@ -121,9 +121,29 @@ func.func @mmio_put_l2(%m: memref<8xi32, 1>) { // ----- // Test: mmio get with L3 destination is rejected (only L1 is allowed). -air.channel @mmio_bad_get_l3 [] {channel_type = "mmio"} +air.channel @mmio_bad_get_l3 [] {channel_type = "npu_mmio"} func.func @mmio_get_l3(%m: memref<8xi32>) { - // expected-error @+1 {{'air.channel.get' op channel_type="mmio" get destination must be in L1 (memory_space=2), got memory_space=0}} + // expected-error @+1 {{'air.channel.get' op channel_type="npu_mmio" get destination must be in L1 (memory_space=2), got memory_space=0}} air.channel.get @mmio_bad_get_l3[] (%m[] [] []) : (memref<8xi32>) return } + +// ----- + +// Test: gpu_symmetric_heap put outside an air.rank scope is rejected. +air.channel @sym_chan_put [] {channel_type = "gpu_symmetric_heap"} +func.func @sym_put_no_rank(%m: memref<128xf32>) { + // expected-error @+1 {{'air.channel.put' op channel_type="gpu_symmetric_heap" put requires an enclosing air.rank scope}} + air.channel.put @sym_chan_put[] (%m[] [] []) : (memref<128xf32>) + return +} + +// ----- + +// Test: gpu_symmetric_heap get outside an air.rank scope is rejected. +air.channel @sym_chan_get [] {channel_type = "gpu_symmetric_heap"} +func.func @sym_get_no_rank(%m: memref<128xf32>) { + // expected-error @+1 {{'air.channel.get' op channel_type="gpu_symmetric_heap" get requires an enclosing air.rank scope}} + air.channel.get @sym_chan_get[] (%m[] [] []) : (memref<128xf32>) + return +} diff --git a/mlir/test/Dialect/AIR/air_cross_rank_dma.mlir b/mlir/test/Dialect/AIR/air_cross_rank_dma.mlir new file mode 100644 index 000000000..3ad0138ac --- /dev/null +++ b/mlir/test/Dialect/AIR/air_cross_rank_dma.mlir @@ -0,0 +1,75 @@ +//===- air_cross_rank_dma.mlir ----------------------------------*- MLIR -*-===// +// +// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved. +// SPDX-License-Identifier: MIT +// +//===----------------------------------------------------------------------===// + +// Round-trip tests for air.dma_memcpy_nd with src_rank/dst_rank attributes +// and for memref.alloc with the air.symmetric attribute. The cross-rank +// attributes require an enclosing air.rank scope. +// +// RUN: air-opt %s | FileCheck %s + +// CHECK-LABEL: func.func @test_dma_with_src_rank +func.func @test_dma_with_src_rank() { + %c2 = arith.constant 2 : index + // CHECK: air.rank + air.rank (%rx) in (%sx = %c2) { + // CHECK: %[[BUF:.*]] = memref.alloc() {air.symmetric} : memref<128xf32> + %buf = memref.alloc() {air.symmetric} : memref<128xf32> + %local = memref.alloc() : memref<128xf32, 2> + // CHECK: air.dma_memcpy_nd + // CHECK-SAME: src_rank = 0 + air.dma_memcpy_nd (%local[] [] [], %buf[] [] []) {src_rank = 0 : i64} + : (memref<128xf32, 2>, memref<128xf32>) + } + return +} + +// CHECK-LABEL: func.func @test_dma_with_dst_rank +func.func @test_dma_with_dst_rank() { + %c2 = arith.constant 2 : index + air.rank (%rx) in (%sx = %c2) { + %buf = memref.alloc() {air.symmetric} : memref<128xf32> + %local = memref.alloc() : memref<128xf32, 2> + // CHECK: air.dma_memcpy_nd + // CHECK-SAME: dst_rank = 1 + air.dma_memcpy_nd (%buf[] [] [], %local[] [] []) {dst_rank = 1 : i64} + : (memref<128xf32>, memref<128xf32, 2>) + } + return +} + +// CHECK-LABEL: func.func @test_dma_with_both_ranks +func.func @test_dma_with_both_ranks() { + %c2 = arith.constant 2 : index + air.rank (%rx) in (%sx = %c2) { + %src = memref.alloc() {air.symmetric} : memref<128xf32> + %dst = memref.alloc() {air.symmetric} : memref<128xf32> + // CHECK: air.dma_memcpy_nd + // CHECK-SAME: dst_rank = 1 + // CHECK-SAME: src_rank = 0 + air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) + {src_rank = 0 : i64, dst_rank = 1 : i64} + : (memref<128xf32>, memref<128xf32>) + } + return +} + +// CHECK: air.channel @sym_chan +// CHECK-SAME: channel_type = "gpu_symmetric_heap" +air.channel @sym_chan [] {channel_type = "gpu_symmetric_heap"} + +// CHECK-LABEL: func.func @test_sym_channel_put_get_in_rank +func.func @test_sym_channel_put_get_in_rank() { + %c2 = arith.constant 2 : index + air.rank (%rx) in (%sx = %c2) { + %buf = memref.alloc() : memref<128xf32> + // CHECK: air.channel.put @sym_chan + air.channel.put @sym_chan[] (%buf[] [] []) : (memref<128xf32>) + // CHECK: air.channel.get @sym_chan + air.channel.get @sym_chan[] (%buf[] [] []) : (memref<128xf32>) + } + return +} diff --git a/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir b/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir index 2fba20890..f82d215d3 100644 --- a/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir +++ b/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir @@ -56,3 +56,53 @@ func.func @channel_get_dst_mismatch(%m: memref<64xi32>) { air.channel.get @channel_get_test[] (%m[%c0] [%c64, %c64] [%c1]) : (memref<64xi32>) return } + +// ----- + +// Test: src_rank requires an enclosing air.rank scope. +func.func @dma_src_rank_no_enclosing_rank(%dst: memref<128xf32, 2>, %src: memref<128xf32>) { + // expected-error @+1 {{'air.dma_memcpy_nd' op src_rank/dst_rank attributes require an enclosing air.rank scope}} + air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {src_rank = 0 : i64} + : (memref<128xf32, 2>, memref<128xf32>) + return +} + +// ----- + +// Test: dst_rank requires an enclosing air.rank scope. +func.func @dma_dst_rank_no_enclosing_rank(%dst: memref<128xf32>, %src: memref<128xf32, 2>) { + // expected-error @+1 {{'air.dma_memcpy_nd' op src_rank/dst_rank attributes require an enclosing air.rank scope}} + air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {dst_rank = 1 : i64} + : (memref<128xf32>, memref<128xf32, 2>) + return +} + +// ----- + +// Test: src_rank requires the source memref.alloc to carry the air.symmetric attribute. +func.func @dma_src_rank_alloc_not_symmetric() { + %c2 = arith.constant 2 : index + air.rank (%rx) in (%sx = %c2) { + %src = memref.alloc() : memref<128xf32> + %dst = memref.alloc() : memref<128xf32, 2> + // expected-error @+1 {{'air.dma_memcpy_nd' op src memref is referenced cross-rank but its memref.alloc lacks the "air.symmetric" attribute}} + air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {src_rank = 0 : i64} + : (memref<128xf32, 2>, memref<128xf32>) + } + return +} + +// ----- + +// Test: dst_rank requires the destination memref.alloc to carry the air.symmetric attribute. +func.func @dma_dst_rank_alloc_not_symmetric() { + %c2 = arith.constant 2 : index + air.rank (%rx) in (%sx = %c2) { + %dst = memref.alloc() : memref<128xf32> + %src = memref.alloc() : memref<128xf32, 2> + // expected-error @+1 {{'air.dma_memcpy_nd' op dst memref is referenced cross-rank but its memref.alloc lacks the "air.symmetric" attribute}} + air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {dst_rank = 1 : i64} + : (memref<128xf32>, memref<128xf32, 2>) + } + return +} diff --git a/mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir b/mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir index 11d304bc1..c96fc096b 100644 --- a/mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir +++ b/mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir @@ -1942,7 +1942,7 @@ module { // ----- // Prevent fusing channels with different channel_type attributes. -// Channels with "dma_stream" and "dma_packet" should not be fused. +// Channels with "npu_dma_stream" and "npu_dma_packet" should not be fused. // CHECK-LABEL: func14 // CHECK: air.launch @@ -1967,8 +1967,8 @@ module { // AGGL1: air.channel.get{{.*}}@channel_packet module { - air.channel @channel_stream [1, 1] {channel_type = "dma_stream"} - air.channel @channel_packet [1, 1] {channel_type = "dma_packet"} + air.channel @channel_stream [1, 1] {channel_type = "npu_dma_stream"} + air.channel @channel_packet [1, 1] {channel_type = "npu_dma_packet"} func.func @func14(){ %c1 = arith.constant 1 : index air.launch (%arg3, %arg4) in (%arg5=%c1, %arg6=%c1) { diff --git a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir index 793834127..611ca41cc 100644 --- a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir +++ b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir @@ -12,8 +12,8 @@ // RUN: air-opt %s -air-dma-to-channel | FileCheck %s -// CHECK-COUNT-4: air.channel {{.*}} {channel_type = "dma_packet"} -// CHECK-NOT: channel_type = "dma_packet" +// CHECK-COUNT-4: air.channel {{.*}} {channel_type = "npu_dma_packet"} +// CHECK-NOT: channel_type = "npu_dma_packet" module { func.func @dual_herd_overflow(%arg0: memref<1024xbf16>, diff --git a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir index a2df4fb82..123d32179 100644 --- a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir +++ b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir @@ -22,7 +22,7 @@ // CHECK: air.channel @channel_2 [1, 1] {broadcast_shape = [8, 1]} // CHECK: air.channel @channel_3 [1, 1] {broadcast_shape = [8, 1]} // CHECK: air.channel @channel_4 [8, 4] -// CHECK-NOT: channel_type = "dma_packet" +// CHECK-NOT: channel_type = "npu_dma_packet" // CHECK-LABEL: func.func @broadcast_4x8_no_upgrade #set_ty0 = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 7 >= 0, s1 == 0)> @@ -89,12 +89,12 @@ module { // Each broadcast channel has broadcast_shape=[2,1]. Pressure: // ceil(6/2) = 3 > 2 (per-column limit). All 6 inputs upgraded to dma_packet. -// CHECK: air.channel @channel_0 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"} -// CHECK: air.channel @channel_1 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"} -// CHECK: air.channel @channel_2 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"} -// CHECK: air.channel @channel_3 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"} -// CHECK: air.channel @channel_4 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"} -// CHECK: air.channel @channel_5 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"} +// CHECK: air.channel @channel_0 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_1 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_2 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_3 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_4 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_5 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"} // CHECK-LABEL: func.func @broadcast_6x2_upgrade #set2_ty0 = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 1 >= 0, s1 == 0)> @@ -179,11 +179,11 @@ module { // 2 non-broadcast inputs + 3 broadcast inputs spanning 4 columns. // Pressure: 2 + ceil(3/4) = 2 + 1 = 3 > 2. All 5 inputs upgraded. -// CHECK: air.channel @channel_0 {{.*}} {channel_type = "dma_packet"} -// CHECK: air.channel @channel_1 {{.*}} {channel_type = "dma_packet"} -// CHECK: air.channel @channel_2 {{.*}} {broadcast_shape = [4, 1], channel_type = "dma_packet"} -// CHECK: air.channel @channel_3 {{.*}} {broadcast_shape = [4, 1], channel_type = "dma_packet"} -// CHECK: air.channel @channel_4 {{.*}} {broadcast_shape = [4, 1], channel_type = "dma_packet"} +// CHECK: air.channel @channel_0 {{.*}} {channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_1 {{.*}} {channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_2 {{.*}} {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_3 {{.*}} {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_4 {{.*}} {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"} // CHECK-LABEL: func.func @mixed_broadcast_upgrade #set3_bcast = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 3 >= 0, s1 == 0)> @@ -262,7 +262,7 @@ module { // CHECK: air.channel @channel_2 [1, 1] {broadcast_shape = [2, 1]} // CHECK: air.channel @channel_3 [1, 1] {broadcast_shape = [2, 1]} // CHECK: air.channel @channel_4 [8, 4] -// CHECK-NOT: channel_type = "dma_packet" +// CHECK-NOT: channel_type = "npu_dma_packet" // CHECK-LABEL: func.func @mixed_spans_no_upgrade #set4_wide0 = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 7 >= 0, s1 == 0)> @@ -331,9 +331,9 @@ module { // 3 > 2 -> upgrade. Tests that row-only broadcasts aren't incorrectly // discounted. -// CHECK: air.channel @channel_0 {{.*}} {broadcast_shape = [1, 4], channel_type = "dma_packet"} -// CHECK: air.channel @channel_1 {{.*}} {broadcast_shape = [1, 4], channel_type = "dma_packet"} -// CHECK: air.channel @channel_2 {{.*}} {broadcast_shape = [1, 4], channel_type = "dma_packet"} +// CHECK: air.channel @channel_0 {{.*}} {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_1 {{.*}} {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_2 {{.*}} {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"} // CHECK-LABEL: func.func @row_broadcast_upgrade #set5_row0 = affine_set<()[s0, s1] : (s0 == 0, s1 >= 0, -s1 + 3 >= 0)> diff --git a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir index 631cad453..a0def63eb 100644 --- a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir +++ b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir @@ -10,8 +10,8 @@ // RUN: air-opt %s -air-dma-to-channel 2>&1 | FileCheck %s -// CHECK-COUNT-3: air.channel {{.*}} {channel_type = "dma_packet"} -// CHECK-NOT: channel_type = "dma_packet" +// CHECK-COUNT-3: air.channel {{.*}} {channel_type = "npu_dma_packet"} +// CHECK-NOT: channel_type = "npu_dma_packet" module { func.func @single_herd_overflow(%arg0: memref<1024xbf16>, diff --git a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir index 68c8b2854..3a2bfbb19 100644 --- a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir +++ b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir @@ -11,7 +11,7 @@ // RUN: air-opt %s -air-dma-to-channel -split-input-file | FileCheck %s // None of these cases should produce dma_packet channels. -// CHECK-NOT: channel_type = "dma_packet" +// CHECK-NOT: channel_type = "npu_dma_packet" // Test 1: Two 1x1 herds with 1 input each = 2 input channels. // Capacity = 2 channels/col * 1 col = 2. 2 <= 2 => no upgrade. diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir index 3d1119d72..2a113b54a 100644 --- a/mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir +++ b/mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir @@ -43,8 +43,8 @@ // CHECK: air.herd @conv1x1_skip {{.*}} attributes {{{.*}}x_loc = 1 : i64, y_loc = 3 : i64} module { - air.channel @L1ToL1_Conv3x3AToSkip [1] {channel_type = "cascade"} - air.channel @L1ToL1_Conv3x3BToSkip [1] {channel_type = "cascade"} + air.channel @L1ToL1_Conv3x3AToSkip [1] {channel_type = "npu_cascade"} + air.channel @L1ToL1_Conv3x3BToSkip [1] {channel_type = "npu_cascade"} air.channel @L1ToL1_Conv1ToConv3x3 [1, 1] {broadcast_shape = [2 : index, 1 : index]} air.channel @L2ToL1_ActIn [1, 1] air.channel @L1ToL2_ActOut [1, 1] diff --git a/mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir b/mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir index 496a05a7e..02d7cbfd1 100644 --- a/mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir +++ b/mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir @@ -104,7 +104,7 @@ func.func @test2() -> () { // MAXCOL-DAG: %[[CST2:.*]] = arith.constant 2 : index // MAXCOL: air.herd @cascade_herd tile (%{{.*}}, %{{.*}}) in (%{{.*}}=%[[CST2]], %{{.*}}=%[[CST2]]) -air.channel @cascade_chan [3] {channel_type = "cascade"} +air.channel @cascade_chan [3] {channel_type = "npu_cascade"} func.func @test_intra_herd_cascade_skip() -> () { %c2 = arith.constant 2 : index air.herd @cascade_herd tile (%x, %y) in (%sx=%c2, %sy=%c2) { @@ -131,7 +131,7 @@ func.func @test_intra_herd_cascade_skip() -> () { // MAXCOL-DAG: %[[CST2:.*]] = arith.constant 2 : index // MAXCOL: air.herd @cascade_producer tile (%{{.*}}, %{{.*}}) in (%{{.*}}=%[[CST2]], %{{.*}}=%[[CST2]]) -air.channel @inter_cascade [1] {channel_type = "cascade"} +air.channel @inter_cascade [1] {channel_type = "npu_cascade"} func.func @test_inter_herd_cascade_skip() -> () { %c2 = arith.constant 2 : index air.herd @cascade_producer tile (%x, %y) in (%sx=%c2, %sy=%c2) { @@ -157,7 +157,7 @@ func.func @test_inter_herd_cascade_skip() -> () { // MAXCOL: air.herd @no_cascade_herd tile (%{{.*}}, %{{.*}}) in (%{{.*}}=%[[CST2]], %{{.*}}=%[[CST2]]) // MAXCOL: air.herd @cascade_herd_put tile (%{{.*}}, %{{.*}}) in (%{{.*}}=%[[CST2]], %{{.*}}=%[[CST2]]) -air.channel @seg_cascade [1] {channel_type = "cascade"} +air.channel @seg_cascade [1] {channel_type = "npu_cascade"} func.func @test_non_cascade_herd_in_cascade_segment() -> () { %c1 = arith.constant 1 : index air.launch (%arg0) in (%arg1=%c1) { diff --git a/mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir b/mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir index f4234f6e2..5cc4f9705 100644 --- a/mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir +++ b/mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir @@ -2417,17 +2417,17 @@ module { // Verify that channel_type attribute is preserved after L2 memref splitting. // The split pass creates new channel declarations; non-default channel_type -// (e.g., "dma_packet") must be carried over from the original channel. +// (e.g., "npu_dma_packet") must be carried over from the original channel. -// CHECK: air.channel @channel_1 [1, 1] {channel_type = "dma_packet"} -// CHECK: air.channel @channel_0 [4, 4] {channel_type = "dma_packet"} -// CHECK: air.channel @channel_2 [4, 1] {channel_type = "dma_packet"} +// CHECK: air.channel @channel_1 [1, 1] {channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_0 [4, 4] {channel_type = "npu_dma_packet"} +// CHECK: air.channel @channel_2 [4, 1] {channel_type = "npu_dma_packet"} // CHECK-LABEL: func.func @test_preserve_channel_type #map = affine_map<()[s0] -> (s0 * 256)> #map1 = affine_map<()[s0] -> (s0 * 64)> -air.channel @channel_1 [1, 1] {channel_type = "dma_packet"} -air.channel @channel_0 [4, 4] {channel_type = "dma_packet"} +air.channel @channel_1 [1, 1] {channel_type = "npu_dma_packet"} +air.channel @channel_0 [4, 4] {channel_type = "npu_dma_packet"} func.func @test_preserve_channel_type(%arg0: memref<512x1024xbf16>, %arg1: memref<1024x512xbf16>, %arg2: memref<512x512xbf16>) { %c2 = arith.constant 2 : index %0 = air.launch async (%arg3, %arg4) in (%arg5=%c2, %arg6=%c2) args(%arg7=%arg2) : memref<512x512xbf16> attributes {id = 1 : i32} { diff --git a/programming_examples/cascade_reduction/cascade_reduction.py b/programming_examples/cascade_reduction/cascade_reduction.py index 10cb7ff92..287351736 100644 --- a/programming_examples/cascade_reduction/cascade_reduction.py +++ b/programming_examples/cascade_reduction/cascade_reduction.py @@ -54,7 +54,7 @@ def build_module(): # Channels: chan_in/chan_out use DMA (L3<->L1), chan_cascade uses # direct core-to-core cascade connections between adjacent tiles. channel("chan_in", size=[1]) - channel("chan_cascade", size=[NUM_TILES], channel_type="cascade") + channel("chan_cascade", size=[NUM_TILES], channel_type="npu_cascade") channel("chan_out", size=[1]) @FuncOp.from_py_func(l3MemrefTy, l3MemrefTy) diff --git a/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py b/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py index 2581f1652..805ffa74b 100644 --- a/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py +++ b/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py @@ -72,7 +72,7 @@ def build_module(): # Cascade channel: per-segment, two independent chains. channel( - "chan_cascade", size=[NUM_SEGMENTS, NUM_TILES, NUM_COLS], channel_type="cascade" + "chan_cascade", size=[NUM_SEGMENTS, NUM_TILES, NUM_COLS], channel_type="npu_cascade" ) # Output channel: one per cascade column across all segments. diff --git a/programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py b/programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py index f009436eb..2725cc521 100644 --- a/programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py +++ b/programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py @@ -11,7 +11,7 @@ # The compiler automatically: # 1. Converts DMAs to channels (air-dma-to-channel) # 2. Detects that 4 input channels exceed the 2-per-column shim DMA limit -# 3. Upgrades input channels to channel_type="dma_packet" for packet-switched +# 3. Upgrades input channels to channel_type="npu_dma_packet" for packet-switched # time-multiplexing of shim DMA MM2S ports import argparse diff --git a/programming_examples/channel_examples/mmio/mmio.py b/programming_examples/channel_examples/mmio/mmio.py index 6cb020df0..bf95e2f97 100644 --- a/programming_examples/channel_examples/mmio/mmio.py +++ b/programming_examples/channel_examples/mmio/mmio.py @@ -1,7 +1,7 @@ # Copyright (C) 2026, Advanced Micro Devices, Inc. # SPDX-License-Identifier: MIT -"""mmio_add: end-to-end demonstration of `channel_type="mmio"`. +"""mmio_add: end-to-end demonstration of `channel_type="npu_mmio"`. A single AIE tile computes `out[i] = a[i] + c[i]` where: * `a` is a host input vector delivered to L1 via shim DMA; @@ -12,7 +12,7 @@ This is the AIR-Python equivalent of the hand-written `/tmp/mmio_bench/aie_q_n*_w*.mlir` benchmark variants. It exercises the -new `channel_type="mmio"` lowering on real NPU2 hardware. +new `channel_type="npu_mmio"` lowering on real NPU2 hardware. """ import argparse @@ -61,7 +61,7 @@ def build_module(): # Channels: `mmio` for the constant, ordinary dma_stream for input/out. channel("a_in", size=[1]) # default: dma_stream - channel("c_mmio", size=[1], channel_type="mmio") + channel("c_mmio", size=[1], channel_type="npu_mmio") channel("o_out", size=[1]) @FuncOp.from_py_func(l3_ty, l3_ty) diff --git a/programming_examples/flash_attention/dataflow_based/attn.py b/programming_examples/flash_attention/dataflow_based/attn.py index d31f338c5..6af066d52 100644 --- a/programming_examples/flash_attention/dataflow_based/attn.py +++ b/programming_examples/flash_attention/dataflow_based/attn.py @@ -158,7 +158,7 @@ def external_func(name, inputs, outputs=None, link_with=None, visibility="privat Channel("L1ToL1Chan2", size=[1, num_cascade_stages]) Channel("L1ToL1Chan3", size=[1, num_cascade_stages]) chan_cascade = Channel("cascade", size=[1, num_cascade_stages - 1]) - chan_cascade.attributes["channel_type"] = StringAttr.get("cascade") + chan_cascade.attributes["channel_type"] = StringAttr.get("npu_cascade") # Main attention function @FuncOp.from_py_func( diff --git a/programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py b/programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py index 5d8423478..d475f6023 100644 --- a/programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py +++ b/programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py @@ -249,9 +249,9 @@ def external_func(name, inputs, outputs=None, link_with=None, visibility="privat Channel(f"VIn_{s}", size=[num_heads_per_unroll]) # Cascade: 2D per-segment (shared within each segment instance) - channel("cascade_gp", size=[NQ, NS - 1], channel_type="cascade") - channel("cascade_up", size=[NQ, NS - 1], channel_type="cascade") - channel("cascade_sp", size=[NQ, NS - 1], channel_type="cascade") + channel("cascade_gp", size=[NQ, NS - 1], channel_type="npu_cascade") + channel("cascade_up", size=[NQ, NS - 1], channel_type="npu_cascade") + channel("cascade_sp", size=[NQ, NS - 1], channel_type="npu_cascade") # Output: L1-to-L2 gather, then L2-to-L3 Channel("Gp2L2", size=[NQ, 1]) diff --git a/programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py b/programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py index f373f979d..dc1d44e41 100644 --- a/programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py +++ b/programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py @@ -239,9 +239,9 @@ def external_func(name, inputs, outputs=None, link_with=None, visibility="privat Channel(f"VIn_{s}", size=[num_heads_per_unroll]) # Cascade: 2D per-segment (shared within each segment instance) - channel("cascade_gp", size=[NQ, NS - 1], channel_type="cascade") - channel("cascade_up", size=[NQ, NS - 1], channel_type="cascade") - channel("cascade_sp", size=[NQ, NS - 1], channel_type="cascade") + channel("cascade_gp", size=[NQ, NS - 1], channel_type="npu_cascade") + channel("cascade_up", size=[NQ, NS - 1], channel_type="npu_cascade") + channel("cascade_sp", size=[NQ, NS - 1], channel_type="npu_cascade") # Output: L1-to-L2 gather, then L2-to-L3 Channel("Gp2L2", size=[NQ, 1]) diff --git a/programming_examples/herd_dataflow/air.mlir b/programming_examples/herd_dataflow/air.mlir index 249087c90..e92828b32 100644 --- a/programming_examples/herd_dataflow/air.mlir +++ b/programming_examples/herd_dataflow/air.mlir @@ -5,10 +5,10 @@ module { func.func private @add_3_bf16(memref<64x64xbf16, 2 : i32>, memref<64x64xbf16, 2 : i32>) attributes {link_with = "extern_func.o", llvm.emit_c_interface} // AIR channels model hardware FIFOs for inter-stage communication - air.channel @L2ToL1Chan1 [4, 1] // L2 to L1, input A; default channel_type is "dma_stream", representing data movement performed using DMA streaming interconnects + air.channel @L2ToL1Chan1 [4, 1] // L2 to L1, input A; default channel_type is "npu_dma_stream", representing data movement performed using DMA streaming interconnects air.channel @L2ToL1Chan2 [4, 1] // L2 to L1, input B air.channel @L1ToL1Chan1 [4, 1] // Between herd_0 and herd_1 - air.channel @L1ToL1Chan2 [4, 1] {channel_type = "cascade"} // Between herd_1 and herd_2; channel_type="cascade" means this channel is expected to map to cascade connections (peer-to-peer communication between compute tiles) + air.channel @L1ToL1Chan2 [4, 1] {channel_type = "npu_cascade"} // Between herd_1 and herd_2; channel_type="npu_cascade" means this channel is expected to map to cascade connections (peer-to-peer communication between compute tiles) air.channel @L1ToL2Chan1 [4, 1] // Output from herd_2 to L2 // Top-level function: runtime dispatch over a 4x1 iteration space (not necessarily hardware parallelism) diff --git a/programming_examples/herd_dataflow/run.py b/programming_examples/herd_dataflow/run.py index 177813e35..9e5b66120 100644 --- a/programming_examples/herd_dataflow/run.py +++ b/programming_examples/herd_dataflow/run.py @@ -102,13 +102,13 @@ def build_module(M_SIZE, N_SIZE): add_3_func.attributes["link_with"] = StringAttr.get("extern_func.o") # AIR channels model hardware FIFOs for inter-stage communication - # Default channel_type is "dma_stream", representing data movement performed using DMA streaming interconnects + # Default channel_type is "npu_dma_stream", representing data movement performed using DMA streaming interconnects channel("L2ToL1Chan1", size=[NUM_COLUMNS, 1]) # L2 to L1, input A channel("L2ToL1Chan2", size=[NUM_COLUMNS, 1]) # L2 to L1, input B channel("L1ToL1Chan1", size=[NUM_COLUMNS, 1]) # Between herd_0 and herd_1 channel( - "L1ToL1Chan2", size=[NUM_COLUMNS, 1], channel_type="cascade" - ) # Between herd_1 and herd_2; channel_type="cascade" means peer-to-peer communication between compute tiles + "L1ToL1Chan2", size=[NUM_COLUMNS, 1], channel_type="npu_cascade" + ) # Between herd_1 and herd_2; channel_type="npu_cascade" means peer-to-peer communication between compute tiles channel("L1ToL2Chan1", size=[NUM_COLUMNS, 1]) # Output from herd_2 to L2 @FuncOp.from_py_func(memrefMxN, memrefMxN, memrefMxN) diff --git a/programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py b/programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py index 6d60d5970..c35aadcc1 100644 --- a/programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py +++ b/programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py @@ -192,7 +192,7 @@ def build_module( # Cascade channel: per-column cascade links. # n_cascade tiles per column → n_cascade-1 links per column. - channel("chan_cascade", size=[herd_cols, n_cascade - 1], channel_type="cascade") + channel("chan_cascade", size=[herd_cols, n_cascade - 1], channel_type="npu_cascade") @FuncOp.from_py_func(memrefTyA, memrefTyB, memrefTyC) def matvec_cascade_bf16(arg0, arg1, arg2): From cb3a7a5199b1eeb027f599cd01edcecd58e69762 Mon Sep 17 00:00:00 2001 From: Erwei Wang Date: Wed, 6 May 2026 00:55:47 +0000 Subject: [PATCH 2/6] [multi-gpu] Phase 1: fix CI clang-format / black formatting Apply clang-format-17 reflow to three .cpp files (text-string wrapping across the renamed channel_type values "npu_mmio" / "npu_cascade" / "npu_dma_stream") and black reformat to one .py file (npu_cascade arg list now exceeds the line limit). These were reported by the lintAndFormat workflow on PR #1576; this commit folds them into Phase 1 so the diff CI saw is what's now in tree. Co-Authored-By: Claude Opus 4.7 (1M context) --- mlir/lib/Conversion/AIRToAIEPass.cpp | 20 ++++++++++--------- mlir/lib/Dialect/AIR/IR/AIRDialect.cpp | 4 ++-- mlir/lib/Transform/AIRLinalgCodegen.cpp | 5 +++-- .../channel_3d_segment_unroll.py | 4 +++- 4 files changed, 19 insertions(+), 14 deletions(-) diff --git a/mlir/lib/Conversion/AIRToAIEPass.cpp b/mlir/lib/Conversion/AIRToAIEPass.cpp index 6591e63f6..6ef9956e9 100644 --- a/mlir/lib/Conversion/AIRToAIEPass.cpp +++ b/mlir/lib/Conversion/AIRToAIEPass.cpp @@ -5871,15 +5871,15 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { auto bufMemTy = bufferOp.getType(); auto srcMemTy = cast(getGlobalOp.getType()); if (bufMemTy.getElementType() != srcMemTy.getElementType()) - return get.emitOpError( - "channel_type=\"npu_mmio\" source/destination element type " - "mismatch (source: ") + return get.emitOpError("channel_type=\"npu_mmio\" " + "source/destination element type " + "mismatch (source: ") << srcMemTy.getElementType() << ", destination: " << bufMemTy.getElementType() << ")"; if (bufMemTy.getNumElements() != srcMemTy.getNumElements()) - return get.emitOpError( - "channel_type=\"npu_mmio\" source/destination element count " - "mismatch (source: ") + return get.emitOpError("channel_type=\"npu_mmio\" " + "source/destination element count " + "mismatch (source: ") << srcMemTy.getNumElements() << ", destination: " << bufMemTy.getNumElements() << ")"; @@ -5891,14 +5891,16 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase { if (auto existing = bufferOp.getInitialValue()) return bufferOp.emitOpError( - "channel_type=\"npu_mmio\" destination aie.buffer already has an " + "channel_type=\"npu_mmio\" destination aie.buffer already has " + "an " "initial_value; cannot stamp two sources into one buffer"); bufferOp.setInitialValueAttr(reshapedInit); ++matchCount; } if (matchCount == 0) - return put.emitOpError("channel_type=\"npu_mmio\" put has no matching " - "device-side air.channel.get"); + return put.emitOpError( + "channel_type=\"npu_mmio\" put has no matching " + "device-side air.channel.get"); } // Erase all mmio puts (host-side ones have been replaced with diff --git a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp index 2fc448a0a..edb8af6a8 100644 --- a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp +++ b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp @@ -3105,8 +3105,8 @@ static LogicalResult ComposeMemrefOpOnChannelOp(OpT op, if (!chan) // If the channel declaration cannot be resolved, signal a failure. return failure(); - // If the channel is of type "npu_cascade", try to fold memref.cast but skip full - // composition + // If the channel is of type "npu_cascade", try to fold memref.cast but skip + // full composition if (chan.getChannelType() == "npu_cascade") return FoldMemrefCastOnChannelOp(op, rewriter); diff --git a/mlir/lib/Transform/AIRLinalgCodegen.cpp b/mlir/lib/Transform/AIRLinalgCodegen.cpp index 3236d8f17..f51fb116b 100644 --- a/mlir/lib/Transform/AIRLinalgCodegen.cpp +++ b/mlir/lib/Transform/AIRLinalgCodegen.cpp @@ -1204,8 +1204,9 @@ FailureOr static pipelineReduceLinalgOp( auto module = op->getParentOfType(); auto cname = createChannelName(module); b.setInsertionPointToStart(module.getBody()); - auto channel_op = air::ChannelOp::create( - b, loc, cname, b.getI64ArrayAttr({1}), b.getStringAttr("npu_dma_stream")); + auto channel_op = + air::ChannelOp::create(b, loc, cname, b.getI64ArrayAttr({1}), + b.getStringAttr("npu_dma_stream")); b.setInsertionPoint(stageBlock->getTerminator()); SmallVector src_offsets; SmallVector src_sizes; diff --git a/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py b/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py index 805ffa74b..68440ba53 100644 --- a/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py +++ b/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py @@ -72,7 +72,9 @@ def build_module(): # Cascade channel: per-segment, two independent chains. channel( - "chan_cascade", size=[NUM_SEGMENTS, NUM_TILES, NUM_COLS], channel_type="npu_cascade" + "chan_cascade", + size=[NUM_SEGMENTS, NUM_TILES, NUM_COLS], + channel_type="npu_cascade", ) # Output channel: one per cascade column across all segments. From 44d743d0449527b2c1572eec2c949c9ab51f121e Mon Sep 17 00:00:00 2001 From: Erwei Wang Date: Wed, 6 May 2026 04:24:18 +0000 Subject: [PATCH 3/6] [multi-gpu] Phase 1: address Copilot review comments MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Six Copilot comments on PR #1576: 1. AIRToAIESchedulingUtils.cpp: four diagnostic strings still said "dma_stream / dma_packet" after the rename to "npu_dma_stream / npu_dma_packet". Updated. 2. docs/AIRComputeModel.md (cross-rank DMA, §2.4): said the GPU backend lowers src_rank/dst_rank, contradicting the summary table that calls it "planned". Reworded as "planned: air-cross-rank-dma- to-mgpu" to match. 3. docs/AIRComputeModel.md (air.symmetric, §2.7): same inconsistency for mgpuSymmetricAlloc routing. Reworded as "planned: air-symmetric-alloc-to-mgpu". 4. AIR.td (DmaMemcpyNdOp description): same inconsistency. Reworded. 5. AIR.td (gpu_symmetric_heap channel_type description): claimed "Lowered by air-to-rocdl to thread-cooperative loops..." with no such lowering yet in tree. Reworded as "planned: air-gpu-channel-to-mgpu". 6. AIRDialect.cpp DmaMemcpyNdOp::verify: rank indices are non-negative. Added explicit `>= 0` check, plus matching verifier- negative tests in air_memcpy_invalid.mlir for both src_rank=-1 and dst_rank=-3. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/AIRComputeModel.md | 16 +++++----- mlir/include/air/Dialect/AIR/AIR.td | 11 ++++--- .../Conversion/AIRToAIESchedulingUtils.cpp | 18 ++++++----- mlir/lib/Dialect/AIR/IR/AIRDialect.cpp | 8 +++++ mlir/test/Dialect/AIR/air_memcpy_invalid.mlir | 30 +++++++++++++++++++ 5 files changed, 64 insertions(+), 19 deletions(-) diff --git a/docs/AIRComputeModel.md b/docs/AIRComputeModel.md index a9688776e..2c9f7b73f 100644 --- a/docs/AIRComputeModel.md +++ b/docs/AIRComputeModel.md @@ -676,9 +676,10 @@ Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the enclosing `air.rank` scope. When present, the corresponding memref is interpreted as living on rank R's symmetric heap rather than on the local process. The verifier requires the op to be enclosed by an `air.rank` and the -referenced memref to be `air.symmetric`-tagged (see §2.7). The GPU backend -(`air-to-rocdl`) lowers cross-rank DMAs through `mgpuGetHeapBases()`-based -peer addressing; the NPU backend does not support these attributes. +referenced memref to be `air.symmetric`-tagged (see §2.7). Lowering for the +GPU backend (planned: `air-cross-rank-dma-to-mgpu`) will expand these into +`mgpuGetHeapBases()`-based peer-VA arithmetic + `mgpuMemcpy`; the NPU +backend does not support these attributes. ``` // Read 1024 floats from rank 0's symmetric buffer into local L1. @@ -830,10 +831,11 @@ ranks' symmetric memrefs at the same logical offset. %buf = memref.alloc() {air.symmetric} : memref<1024xf32> ``` -The GPU lowering routes such allocations through `mgpuSymmetricAlloc` -(`runtime_lib/airgpu/gpu_runtime.cpp`) instead of plain `mgpuMemAlloc`. -Peer ranks' base pointers are obtained via `mgpuGetHeapBases()`. The NPU -backend does not interpret this attribute. +The GPU lowering (planned: `air-symmetric-alloc-to-mgpu`) will route such +allocations through `mgpuSymmetricAlloc` (`runtime_lib/airgpu/gpu_runtime.cpp`) +instead of plain `mgpuMemAlloc`. Peer ranks' base pointers are obtained at +runtime via `mgpuGetHeapBases()`. The NPU backend does not interpret this +attribute. --- diff --git a/mlir/include/air/Dialect/AIR/AIR.td b/mlir/include/air/Dialect/AIR/AIR.td index 475ab8bf7..19575bb3e 100644 --- a/mlir/include/air/Dialect/AIR/AIR.td +++ b/mlir/include/air/Dialect/AIR/AIR.td @@ -495,8 +495,9 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd", enclosing `air.rank` scope. When present, the corresponding memref is interpreted as living on rank R's symmetric heap rather than on the local process. These attributes are only valid for `air.symmetric`-tagged memref - allocations and require an enclosing `air.rank`. They are currently only - supported by the GPU lowering (`air-to-rocdl`). + allocations and require an enclosing `air.rank`. Lowering for these + attributes will be added by a future GPU pass (planned: `air-cross-rank- + dma-to-mgpu`); this PR introduces only the IR surface and verifier rules. }]; let extraClassDeclaration = [{ Value getSrcMemref() { return getSrc(); } @@ -607,8 +608,10 @@ def air_ChannelOp : air_Op<"channel", [Symbol]>, Cross-GPU messaging through the symmetric heap runtime (`runtime_lib/airgpu/symmetric_heap.{h,cpp}`). The channel must be enclosed by an `air.rank` op; the put/get sites use rank indices to address peer - heaps. Lowered by `air-to-rocdl` to thread-cooperative loops over peer-mapped - VMem buffers, with synchronization via in-heap notify flags or `mgpuBarrier`. + heaps. Lowering will be added by a future GPU pass (planned: + `air-gpu-channel-to-mgpu`) which expands put/get to peer-mapped + `mgpuMemcpy` calls plus a barrier; this PR introduces only the IR + surface and verifier rules. ### Broadcasting If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute diff --git a/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp b/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp index 8d429249b..617bf89ef 100644 --- a/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp +++ b/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp @@ -1585,8 +1585,9 @@ LogicalResult air::simpleDMAChannelAllocation( // (aie.flow) nor dma packet flow (aie.packet_flow). if (f.memcpyResourceType != "npu_dma_stream" && f.memcpyResourceType != "npu_dma_packet") - return memcpyOpIf->emitOpError("only supports dma_stream or " - "dma_packet connections at L2 memory"); + return memcpyOpIf->emitOpError( + "only supports npu_dma_stream or npu_dma_packet " + "connections at L2 memory"); auto alloc_res = memtile_dma_alloc.simpleDmaChannelAlloc(memcpyOpIf); if (failed(alloc_res) || !alloc_res->valid()) return failure(); @@ -1602,8 +1603,8 @@ LogicalResult air::simpleDMAChannelAllocation( if (f.memcpyResourceType != "npu_dma_stream" && f.memcpyResourceType != "npu_dma_packet") return memcpyOpIf->emitOpError( - "only supports dma_stream or dma_packet connections at L2 " - "memory"); + "only supports npu_dma_stream or npu_dma_packet " + "connections at L2 memory"); auto alloc_res = memtile_dma_alloc.simpleDmaChannelAlloc(memcpyOpIf); if (failed(alloc_res) || !alloc_res->valid()) return failure(); @@ -1625,8 +1626,8 @@ LogicalResult air::simpleDMAChannelAllocation( if (f.memcpyResourceType != "npu_dma_stream" && f.memcpyResourceType != "npu_dma_packet") return memcpyOpIf->emitOpError( - "only supports dma_stream or dma_packet connections at L3 " - "memory"); + "only supports npu_dma_stream or npu_dma_packet " + "connections at L3 memory"); if (!f.S2MM_alloc[i].getDmaTile()) return memcpyOpIf->emitOpError( "failed to get S2MM tile for L3 allocation."); @@ -1652,8 +1653,9 @@ LogicalResult air::simpleDMAChannelAllocation( // (aie.flow) nor dma packet flow (aie.packet_flow). if (f.memcpyResourceType != "npu_dma_stream" && f.memcpyResourceType != "npu_dma_packet") - return memcpyOpIf->emitOpError("only supports dma_stream or " - "dma_packet connections at L3 memory"); + return memcpyOpIf->emitOpError( + "only supports npu_dma_stream or npu_dma_packet " + "connections at L3 memory"); if (!f.MM2S_alloc.getDmaTile()) return memcpyOpIf->emitOpError( "failed to get MM2S tile for L3 allocation."); diff --git a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp index edb8af6a8..cce9a9294 100644 --- a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp +++ b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp @@ -2823,6 +2823,14 @@ LogicalResult air::DmaMemcpyNdOp::verify() { return emitOpError("src_rank/dst_rank attributes require an enclosing " "air.rank scope"); + // Rank indices are non-negative. + if (auto sr = getSrcRank()) + if (*sr < 0) + return emitOpError() << "src_rank must be >= 0, got " << *sr; + if (auto dr = getDstRank()) + if (*dr < 0) + return emitOpError() << "dst_rank must be >= 0, got " << *dr; + auto requireSymmetricAlloc = [&](Value v, StringRef side) -> LogicalResult { auto alloc = v.getDefiningOp(); if (!alloc) diff --git a/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir b/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir index f82d215d3..230a2d2c9 100644 --- a/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir +++ b/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir @@ -106,3 +106,33 @@ func.func @dma_dst_rank_alloc_not_symmetric() { } return } + +// ----- + +// Test: src_rank must be non-negative. +func.func @dma_src_rank_negative() { + %c2 = arith.constant 2 : index + air.rank (%rx) in (%sx = %c2) { + %dst = memref.alloc() : memref<128xf32, 2> + %src = memref.alloc() {air.symmetric} : memref<128xf32> + // expected-error @+1 {{'air.dma_memcpy_nd' op src_rank must be >= 0, got -1}} + air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {src_rank = -1 : i64} + : (memref<128xf32, 2>, memref<128xf32>) + } + return +} + +// ----- + +// Test: dst_rank must be non-negative. +func.func @dma_dst_rank_negative() { + %c2 = arith.constant 2 : index + air.rank (%rx) in (%sx = %c2) { + %dst = memref.alloc() {air.symmetric} : memref<128xf32> + %src = memref.alloc() : memref<128xf32, 2> + // expected-error @+1 {{'air.dma_memcpy_nd' op dst_rank must be >= 0, got -3}} + air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {dst_rank = -3 : i64} + : (memref<128xf32>, memref<128xf32, 2>) + } + return +} From 45e543d41779b3927b87c9b64932ca1b93bc55b6 Mon Sep 17 00:00:00 2001 From: Erwei Wang Date: Wed, 6 May 2026 04:38:43 +0000 Subject: [PATCH 4/6] [multi-gpu] Phase 1: fix negative-rank verifier check MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous commit (888bcaa7) added a `>= 0` verifier on src_rank / dst_rank, but used `getSrcRank()` / `getDstRank()` — those return `std::optional` (a TableGen quirk for `OptionalAttr`), so `*sr < 0` on the unsigned value is always false and the check never fired. The two new verifier-negative tests in air_memcpy_invalid.mlir silently regressed. Switch to the typed `getSrcRankAttr()` / `getDstRankAttr()` accessors which return `IntegerAttr`, then call `.getInt()` for a real `int64_t`. The check now fires on negative values; both negative-rank tests pass under `lit -sv ../../mlir/test/Dialect/AIR` (21/21). Co-Authored-By: Claude Opus 4.7 (1M context) --- mlir/lib/Dialect/AIR/IR/AIRDialect.cpp | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp index cce9a9294..720d09a7f 100644 --- a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp +++ b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp @@ -2823,13 +2823,20 @@ LogicalResult air::DmaMemcpyNdOp::verify() { return emitOpError("src_rank/dst_rank attributes require an enclosing " "air.rank scope"); - // Rank indices are non-negative. - if (auto sr = getSrcRank()) - if (*sr < 0) - return emitOpError() << "src_rank must be >= 0, got " << *sr; - if (auto dr = getDstRank()) - if (*dr < 0) - return emitOpError() << "dst_rank must be >= 0, got " << *dr; + // Rank indices are non-negative. Use the typed *Attr accessor instead + // of the generated getSrcRank()/getDstRank() (those return uint64_t + // for OptionalAttr, so a comparison against 0 is meaningless + // for negative values stored as i64). + if (auto srAttr = getSrcRankAttr()) { + int64_t sr = srAttr.getInt(); + if (sr < 0) + return emitOpError() << "src_rank must be >= 0, got " << sr; + } + if (auto drAttr = getDstRankAttr()) { + int64_t dr = drAttr.getInt(); + if (dr < 0) + return emitOpError() << "dst_rank must be >= 0, got " << dr; + } auto requireSymmetricAlloc = [&](Value v, StringRef side) -> LogicalResult { auto alloc = v.getDefiningOp(); From 965f853223bbcced478dc2d0cae4e77e948eaa0a Mon Sep 17 00:00:00 2001 From: Erwei Wang Date: Wed, 6 May 2026 04:48:33 +0000 Subject: [PATCH 5/6] [multi-gpu] Phase 1: rename channel_type in cascade_chain_* tests origin/main grew 5 new herd-placement tests via #1583 that use the pre-rename `channel_type = "cascade"`. After this PR's namespace rename ("cascade" -> "npu_cascade"), those tests fail under air-opt with the verifier rejecting the old name. Update them to "npu_cascade" so they keep passing on top of phase 1. Verified on rad-mi300a-sh5-1: AIRHerdPlacement 15/15 pass, Dialect/AIR 21/21 pass. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../AIRHerdPlacement/cascade_chain_3herd.mlir | 4 ++-- .../AIRHerdPlacement/cascade_chain_3herd_ew.mlir | 4 ++-- .../cascade_chain_3herd_with_l1_upstream.mlir | 4 ++-- .../AIRHerdPlacement/cascade_chain_4herd.mlir | 6 +++--- .../cascade_chain_multi_channel.mlir | 12 ++++++------ 5 files changed, 15 insertions(+), 15 deletions(-) diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd.mlir index d1f5bb4f5..688e7e0b5 100644 --- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd.mlir +++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd.mlir @@ -18,8 +18,8 @@ // CHECK: air.herd @consumer {{.*}} attributes {{{.*}}x_loc = 0 : i64, y_loc = 2 : i64} module { - air.channel @ab [8, 1] {channel_type = "cascade"} - air.channel @bc [8, 1] {channel_type = "cascade"} + air.channel @ab [8, 1] {channel_type = "npu_cascade"} + air.channel @bc [8, 1] {channel_type = "npu_cascade"} func.func @three_herd_cascade_chain() { %c1 = arith.constant 1 : index diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_ew.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_ew.mlir index 328e35995..74c1efe9c 100644 --- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_ew.mlir +++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_ew.mlir @@ -17,8 +17,8 @@ // CHECK: air.herd @consumer {{.*}} attributes {{{.*}}x_loc = 2 : i64, y_loc = 2 : i64} module { - air.channel @ab [1, 8] {channel_type = "cascade"} - air.channel @bc [1, 8] {channel_type = "cascade"} + air.channel @ab [1, 8] {channel_type = "npu_cascade"} + air.channel @bc [1, 8] {channel_type = "npu_cascade"} func.func @three_herd_cascade_chain_ew() { %c1 = arith.constant 1 : index diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_with_l1_upstream.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_with_l1_upstream.mlir index 0b1bdac01..db262bd53 100644 --- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_with_l1_upstream.mlir +++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_with_l1_upstream.mlir @@ -22,8 +22,8 @@ module { air.channel @upstream_to_a [1, 1] {broadcast_shape = [8 : index, 1 : index]} - air.channel @ab_q [8, 1] {channel_type = "cascade"} - air.channel @bc_q [8, 1] {channel_type = "cascade"} + air.channel @ab_q [8, 1] {channel_type = "npu_cascade"} + air.channel @bc_q [8, 1] {channel_type = "npu_cascade"} func.func @upstream_then_3_chain() { %c1 = arith.constant 1 : index diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_4herd.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_4herd.mlir index e8564c28e..0638f361f 100644 --- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_4herd.mlir +++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_4herd.mlir @@ -18,9 +18,9 @@ // CHECK: air.herd @h3 {{.*}} attributes {{{.*}}x_loc = 0 : i64, y_loc = 0 : i64} module { - air.channel @c01 [8, 1] {channel_type = "cascade"} - air.channel @c12 [8, 1] {channel_type = "cascade"} - air.channel @c23 [8, 1] {channel_type = "cascade"} + air.channel @c01 [8, 1] {channel_type = "npu_cascade"} + air.channel @c12 [8, 1] {channel_type = "npu_cascade"} + air.channel @c23 [8, 1] {channel_type = "npu_cascade"} func.func @four_herd_cascade_chain() { %c1 = arith.constant 1 : index diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_multi_channel.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_multi_channel.mlir index bb109f7cd..729069bcc 100644 --- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_multi_channel.mlir +++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_multi_channel.mlir @@ -19,12 +19,12 @@ // CHECK: air.herd @consumer {{.*}} attributes {{{.*}}x_loc = 0 : i64, y_loc = 2 : i64} module { - air.channel @ab_q [8, 1] {channel_type = "cascade"} - air.channel @ab_k [8, 1] {channel_type = "cascade"} - air.channel @ab_v [8, 1] {channel_type = "cascade"} - air.channel @bc_q [8, 1] {channel_type = "cascade"} - air.channel @bc_k [8, 1] {channel_type = "cascade"} - air.channel @bc_v [8, 1] {channel_type = "cascade"} + air.channel @ab_q [8, 1] {channel_type = "npu_cascade"} + air.channel @ab_k [8, 1] {channel_type = "npu_cascade"} + air.channel @ab_v [8, 1] {channel_type = "npu_cascade"} + air.channel @bc_q [8, 1] {channel_type = "npu_cascade"} + air.channel @bc_k [8, 1] {channel_type = "npu_cascade"} + air.channel @bc_v [8, 1] {channel_type = "npu_cascade"} func.func @three_herd_multi_channel() { %c1 = arith.constant 1 : index From cb104a6eaee60f0195efb56443fe4577eadba4b7 Mon Sep 17 00:00:00 2001 From: Erwei Wang Date: Wed, 6 May 2026 05:16:31 +0000 Subject: [PATCH 6/6] [multi-gpu] Phase 1: rename channel_type in 34_cascade_vecadd peano test CI on 'Build and Test with AIE tools on Ryzen AI (amdhx370)' caught one more stale "cascade" reference: test/xrt/34_cascade_vecadd/run_peano.py embeds an inline MLIR string that declared `channel_type = "cascade"`. Update to "npu_cascade" to match the namespace rename. The corresponding run_chess.py variant didn't have this issue. Verifier diagnostic from the failing job: 'air.channel' op unsupported channel_type "cascade"; expected one of "npu_dma_stream", "npu_dma_packet", "npu_cascade", "npu_mmio", or "gpu_symmetric_heap" Co-Authored-By: Claude Opus 4.7 (1M context) --- test/xrt/34_cascade_vecadd/run_peano.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/xrt/34_cascade_vecadd/run_peano.py b/test/xrt/34_cascade_vecadd/run_peano.py index 3c28274a1..84c9f3070 100644 --- a/test/xrt/34_cascade_vecadd/run_peano.py +++ b/test/xrt/34_cascade_vecadd/run_peano.py @@ -51,7 +51,7 @@ #set = affine_set<()[s0] : (s0 == 3)> #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)> module { - air.channel @channel_0 [3] {channel_type = "cascade"} + air.channel @channel_0 [3] {channel_type = "npu_cascade"} air.channel @channel_1 [1] air.channel @channel_2 [1] func.func @scf1(%arg0: memref<1x1x2048xi32>, %arg1: memref<1x1x2048xi32>) {