From b6c08d925354f2e92758f197be28f3c827cdbc58 Mon Sep 17 00:00:00 2001
From: Erwei Wang <erwei.wang@amd.com>
Date: Sun, 3 May 2026 17:10:16 +0000
Subject: [PATCH 1/6] [multi-gpu] Phase 1: namespace channel_type, add
 cross-rank attrs, doc plan
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Step toward multi-GPU messaging support per docs/MultiGPUPlan.md. Pure IR/dialect
changes — no lowering yet.

## channel_type namespace rename (Option 1)

Existing channel_type values gain a `npu_` prefix to make backend scope explicit:
- `dma_stream` → `npu_dma_stream` (default)
- `dma_packet` → `npu_dma_packet`
- `cascade`    → `npu_cascade`
- `mmio`       → `npu_mmio`

Mechanical rename across 33 files (verifier, transform/conversion passes, all
.mlir tests, Python programming examples).

## New channel_type for GPU multi-rank messaging

- `gpu_symmetric_heap`: cross-rank channel through the symmetric heap runtime
  (runtime_lib/airgpu/symmetric_heap.{h,cpp}). Verifier requires put/get sites
  to be inside an `air.rank` scope.

## air.dma_memcpy_nd cross-rank addressing

- New optional integer attributes `src_rank` / `dst_rank` name a peer rank in
  the enclosing `air.rank` scope.
- Verifier requires:
  - an enclosing `air.rank` scope
  - the peer-side memref's `memref.alloc` (when directly available) to carry
    the `air.symmetric` attribute
- Backward-compatible builder so existing call sites compile unchanged.

## air.symmetric memref attribute

A unit attribute on `memref.alloc` indicating the allocation is backed by the
symmetric heap. Documented in docs/AIRComputeModel.md §2.7.

## Documentation

- New docs/MultiGPUPlan.md: full design and 7-phase implementation plan
- docs/AIRComputeModel.md: §2.4 cross-rank addressing, §2.7 air.symmetric,
  §2.5 channel_type table updated, §5 summary table updated

## Tests

- mlir/test/Dialect/AIR/air_cross_rank_dma.mlir (new): positive round-trip
  for src_rank/dst_rank, air.symmetric memref, gpu_symmetric_heap channel
  put/get inside air.rank
- mlir/test/Dialect/AIR/air_channel_invalid.mlir: gpu_symmetric_heap
  put/get outside air.rank rejected; updated unsupported channel_type
  error message
- mlir/test/Dialect/AIR/air_memcpy_invalid.mlir: src_rank/dst_rank
  outside air.rank rejected; missing air.symmetric on alloc rejected

All 21 mlir/test/Dialect/AIR/ tests pass; GPU dma_copy and 4k_4k_mul e2e
tests pass on MI300A.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/AIRComputeModel.md                       | 65 +++++++++++--
 mlir/include/air/Dialect/AIR/AIR.td           | 93 ++++++++++++++-----
 mlir/lib/Conversion/AIRToAIEPass.cpp          | 46 ++++-----
 .../Conversion/AIRToAIESchedulingUtils.cpp    | 50 +++++-----
 mlir/lib/Conversion/ConvertToAIRPass.cpp      |  6 +-
 mlir/lib/Dialect/AIR/IR/AIRDialect.cpp        | 91 ++++++++++++++----
 mlir/lib/Transform/AIRDmaToChannel.cpp        | 12 +--
 mlir/lib/Transform/AIRHerdPlacementPass.cpp   |  2 +-
 mlir/lib/Transform/AIRLinalgCodegen.cpp       |  2 +-
 mlir/lib/Transform/AIRMiscPasses.cpp          |  2 +-
 mlir/lib/Util/Util.cpp                        |  2 +-
 .../Conversion/AIRToAIE/air_channel_mmio.mlir | 18 ++--
 .../AIRToAIE/air_channel_mmio_invalid.mlir    | 26 +++---
 .../air_channel_to_locks_core_to_core.mlir    | 12 +--
 .../AIRToAIE/air_shimcpy_to_npu.mlir          | 18 ++--
 .../bad_shim_packet_flow_npu_1col.mlir        |  2 +-
 .../good_shim_packet_flow_npu_4col.mlir       |  2 +-
 .../segment_unroll_packet_flow_ids.mlir       |  4 +-
 .../shared_shim_channel_packet_ids.mlir       |  4 +-
 .../AIRToAIE/shim_packet_flow_npu.mlir        |  4 +-
 .../AIRToAIE/shim_pkt_channel_sharing.mlir    |  6 +-
 .../ConvertToAIR/scf_parallel_to_herd.mlir    |  2 +-
 mlir/test/Dialect/AIR/air_canonicalize.mlir   |  2 +-
 mlir/test/Dialect/AIR/air_channel.mlir        | 25 +++--
 .../test/Dialect/AIR/air_channel_invalid.mlir | 38 ++++++--
 mlir/test/Dialect/AIR/air_cross_rank_dma.mlir | 75 +++++++++++++++
 mlir/test/Dialect/AIR/air_memcpy_invalid.mlir | 50 ++++++++++
 .../fuse_channels.mlir                        |  6 +-
 .../dma_to_channel_auto_packet.mlir           |  4 +-
 .../dma_to_channel_auto_packet_broadcast.mlir | 32 +++----
 ...ma_to_channel_auto_packet_single_herd.mlir |  4 +-
 .../dma_to_channel_no_auto_packet.mlir        |  2 +-
 .../AIRHerdPlacement/cascade_placement.mlir   |  4 +-
 .../AIRMiscPasses/air_collapse_herd.mlir      |  6 +-
 .../AIRMiscPasses/air_split_l2_memref.mlir    | 12 +--
 .../cascade_reduction/cascade_reduction.py    |  2 +-
 .../channel_3d_segment_unroll.py              |  2 +-
 .../dual_herd_packet_switch.py                |  2 +-
 .../channel_examples/mmio/mmio.py             |  6 +-
 .../flash_attention/dataflow_based/attn.py    |  2 +-
 .../kernel_fusion_based/attn_npu1.py          |  6 +-
 .../kernel_fusion_based/attn_npu2.py          |  6 +-
 programming_examples/herd_dataflow/air.mlir   |  4 +-
 programming_examples/herd_dataflow/run.py     |  6 +-
 .../bf16_cascade/matvec_cascade.py            |  2 +-
 45 files changed, 536 insertions(+), 231 deletions(-)
 create mode 100644 mlir/test/Dialect/AIR/air_cross_rank_dma.mlir

diff --git a/docs/AIRComputeModel.md b/docs/AIRComputeModel.md
index 0c45c3ff4..a9688776e 100644
--- a/docs/AIRComputeModel.md
+++ b/docs/AIRComputeModel.md
@@ -621,7 +621,7 @@ dimensions depend on the target backend:
   The compiler may **reshape** the iteration space (e.g., collapse a 2D herd
   into a 1D arrangement) via the `AIRCollapseHerdPass`. Reshaping is inhibited
   automatically when the herd body uses cascade channels (`channel_type =
-  "cascade"`), because cascade connections are topology-dependent and cannot
+  "npu_cascade"`), because cascade connections are topology-dependent and cannot
   survive reindexing. Explicit placement attributes (`x_loc`, `y_loc`,
   `x_size`, `y_size`) on the enclosing segment also constrain the legal shapes
   by fixing the tile footprint. The pass accepts a `max-col-size` option to
@@ -670,13 +670,29 @@ address spaces of the operand memrefs and mapped to the appropriate hardware mec
 An empty `[offsets]`, `[sizes]`, or `[strides]` list for a side means the entire memref
 is addressed with unit strides.
 
+#### Cross-rank addressing (multi-GPU)
+
+Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
+enclosing `air.rank` scope. When present, the corresponding memref is
+interpreted as living on rank R's symmetric heap rather than on the local
+process. The verifier requires the op to be enclosed by an `air.rank` and the
+referenced memref to be `air.symmetric`-tagged (see §2.7). The GPU backend
+(`air-to-rocdl`) lowers cross-rank DMAs through `mgpuGetHeapBases()`-based
+peer addressing; the NPU backend does not support these attributes.
+
+```
+// Read 1024 floats from rank 0's symmetric buffer into local L1.
+air.dma_memcpy_nd (%local[][][], %sym[][][]) {src_rank = 0 : i64}
+    : (memref<1024xf32, 2>, memref<1024xf32, 0>)
+```
+
 ---
 
 ### 2.5 `air.channel`, `air.channel.put`, `air.channel.get`
 
 ```
 // Channel declaration — at module scope
-air.channel @name [dim₀, dim₁, …] {channel_type = "dma_stream", depth = <N>}
+air.channel @name [dim₀, dim₁, …] {channel_type = "npu_dma_stream", depth = <N>}
 
 // Synchronous put/get — block until the transfer completes
 air.channel.put @name[indices] (src[offsets][sizes][strides]) : (type_src)
@@ -696,13 +712,17 @@ them independently and to introduce double-buffering.
 A channel may be an array (e.g., `[4, 4]` for a 4×4 array). The `indices` operand on
 `put`/`get` selects the specific channel within the array.
 
-The `channel_type` attribute controls the underlying mechanism:
+The `channel_type` attribute controls the underlying mechanism. Values are
+namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels
+use the `gpu_` prefix.
 
 | Value | Mechanism |
 |-------|-----------|
-| `"dma_stream"` (default) | DMA engines with streaming (circuit-switched) interconnect |
-| `"dma_packet"` | DMA engines with packet-switched interconnect |
-| `"cascade"` | Core-to-core cascade connections between adjacent tiles |
+| `"npu_dma_stream"` (default) | NPU: DMA engines with streaming (circuit-switched) interconnect |
+| `"npu_dma_packet"` | NPU: DMA engines with packet-switched interconnect |
+| `"npu_cascade"` | NPU: Core-to-core cascade connections between adjacent tiles |
+| `"npu_mmio"` | NPU: Host-side MMIO blockwrites delivering a constant payload into a tile-local L1 buffer |
+| `"gpu_symmetric_heap"` | GPU: Cross-rank messaging through the symmetric heap runtime (XGMI peer-mapped VMem). Requires an enclosing `air.rank` scope. |
 
 The `broadcast_shape` attribute enables one-to-many communication following NumPy
 broadcasting rules.
@@ -796,6 +816,27 @@ in the async dependency graph.
 
 ---
 
+### 2.7 `air.symmetric` memref attribute (multi-GPU)
+
+A `memref.alloc` op may carry the unit attribute `air.symmetric` to indicate
+that the allocation should be backed by the **symmetric heap** runtime. Every
+rank in the enclosing `air.rank` scope performs the same allocation in lockstep,
+so each rank has a memref of the same size at the same offset within the heap.
+Cross-rank addressing (via `air.dma_memcpy_nd` `src_rank`/`dst_rank` attributes
+or `air.channel` with `channel_type = "gpu_symmetric_heap"`) refers to peer
+ranks' symmetric memrefs at the same logical offset.
+
+```
+%buf = memref.alloc() {air.symmetric} : memref<1024xf32>
+```
+
+The GPU lowering routes such allocations through `mgpuSymmetricAlloc`
+(`runtime_lib/airgpu/gpu_runtime.cpp`) instead of plain `mgpuMemAlloc`.
+Peer ranks' base pointers are obtained via `mgpuGetHeapBases()`. The NPU
+backend does not interpret this attribute.
+
+---
+
 ## 3. NPU (AIE) Backend Mapping
 
 On AMD Versal AI Engine (AIE) and Ryzen AI NPU targets the three-level hierarchy maps
@@ -999,7 +1040,13 @@ See [buildingGPU.md](buildingGPU.md) for build instructions and the complete
 | L1 (space 2) | 32 KB tile-local data memory | Thread-private VGPRs / scratch |
 | L2 (space 1) | Memory tiles / URAMs | LDS (shared memory, ~64 KB / CU) |
 | L3 (space 0) | DDR via NOC | HBM via global memory |
-| `dma_memcpy_nd` | AIE Shim/Tile DMA engines | SCF load/store loops |
-| `channel` (`dma_stream`) | Streaming AXI-S switch | — (not yet mapped to GPU) |
-| Synchronization | AIE locks | `gpu.barrier` |
+| `dma_memcpy_nd` (intra-rank) | AIE Shim/Tile DMA engines | SCF load/store loops |
+| `dma_memcpy_nd` (cross-rank, `src_rank`/`dst_rank`) | — | Symmetric heap peer addressing (planned) |
+| `channel` (`npu_dma_stream`) | Streaming AXI-S switch | n/a |
+| `channel` (`npu_dma_packet`) | Packet-switched AXI-S overlay | n/a |
+| `channel` (`npu_cascade`) | Core cascade interface | n/a |
+| `channel` (`npu_mmio`) | Host MMIO blockwrite | n/a |
+| `channel` (`gpu_symmetric_heap`) | n/a | Symmetric heap peer addressing (planned) |
+| `air.symmetric` memref alloc | n/a | `mgpuSymmetricAlloc` (planned) |
+| Synchronization | AIE locks | `gpu.barrier` (intra-rank), `mgpuBarrier` (cross-rank) |
 | `!air.token` (dependency) | AIE runtime completion signals | GPU stream/event dependencies |
diff --git a/mlir/include/air/Dialect/AIR/AIR.td b/mlir/include/air/Dialect/AIR/AIR.td
index 4832355c4..475ab8bf7 100644
--- a/mlir/include/air/Dialect/AIR/AIR.td
+++ b/mlir/include/air/Dialect/AIR/AIR.td
@@ -477,7 +477,9 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
         Variadic<Index>:$src_sizes,
         Variadic<Index>:$src_strides,
         OptionalAttr<DenseI32ArrayAttr>:$pad_before,
-        OptionalAttr<DenseI32ArrayAttr>:$pad_after
+        OptionalAttr<DenseI32ArrayAttr>:$pad_after,
+        OptionalAttr<I64Attr>:$src_rank,
+        OptionalAttr<I64Attr>:$dst_rank
   );
   let results = (outs Optional<air_AsyncToken>:$async_token);
   let assemblyFormat = [{
@@ -487,7 +489,14 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
     `(` type($dst) `,` type($src) `)`
   }];
   let description = [{
-    dma operator
+    N-dimensional strided bulk copy between two memrefs.
+
+    Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
+    enclosing `air.rank` scope. When present, the corresponding memref is
+    interpreted as living on rank R's symmetric heap rather than on the local
+    process. These attributes are only valid for `air.symmetric`-tagged memref
+    allocations and require an enclosing `air.rank`. They are currently only
+    supported by the GPU lowering (`air-to-rocdl`).
   }];
   let extraClassDeclaration = [{
     Value getSrcMemref() { return getSrc(); }
@@ -501,7 +510,31 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
     bool hasPadding() {
       return getPadBefore().has_value();
     }
+    bool hasCrossRank() {
+      return getSrcRank().has_value() || getDstRank().has_value();
+    }
   }];
+  let builders = [
+    // Backward-compatible builder: defaults src_rank/dst_rank to absent.
+    OpBuilder<(ins "::mlir::TypeRange":$resultTypes,
+                   "::mlir::ValueRange":$async_dependencies,
+                   "::mlir::Value":$dst,
+                   "::mlir::ValueRange":$dst_offsets,
+                   "::mlir::ValueRange":$dst_sizes,
+                   "::mlir::ValueRange":$dst_strides,
+                   "::mlir::Value":$src,
+                   "::mlir::ValueRange":$src_offsets,
+                   "::mlir::ValueRange":$src_sizes,
+                   "::mlir::ValueRange":$src_strides,
+                   "::mlir::DenseI32ArrayAttr":$pad_before,
+                   "::mlir::DenseI32ArrayAttr":$pad_after), [{
+      build($_builder, $_state, resultTypes, async_dependencies, dst,
+            dst_offsets, dst_sizes, dst_strides, src,
+            src_offsets, src_sizes, src_strides, pad_before, pad_after,
+            /*src_rank=*/IntegerAttr(),
+            /*dst_rank=*/IntegerAttr());
+    }]>
+  ];
   let hasCanonicalizer = 1;
   let hasVerifier = 1;
 }
@@ -535,7 +568,7 @@ def air_WaitAllOp: air_Op<"wait_all", [air_AsyncOpInterface]> {
 def air_ChannelOp : air_Op<"channel", [Symbol]>,
     Arguments<(ins SymbolNameAttr:$sym_name,
                    DefaultValuedAttr<I64ArrayAttr, "{}">:$size,
-                   DefaultValuedAttr<StrAttr, "\"dma_stream\"">:$channel_type)> {
+                   DefaultValuedAttr<StrAttr, "\"npu_dma_stream\"">:$channel_type)> {
   let assemblyFormat = [{
     $sym_name $size attr-dict
   }];
@@ -543,18 +576,22 @@ def air_ChannelOp : air_Op<"channel", [Symbol]>,
   let description = [{
     Operation to represent a communication channel as a point-to-point connection between two memrefs.
     The array following the channel name symbol represents the channel's dimensional sizes. Default
-    size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled 
+    size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled
     by the `channel_type` attribute.
 
     ### Channel Types
-    The `channel_type` attribute is a string that determines the mechanism used for data movement:
-    - **"dma_stream"** (default):
+    The `channel_type` attribute is a string that determines the mechanism used for data movement.
+    Values are namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels use
+    the `gpu_` prefix.
+
+    NPU (AIE) channel types:
+    - **"npu_dma_stream"** (default):
       Use DMA engines to send and receive data, with routing performed over a streaming interconnect.
-    - **"dma_packet"**:
+    - **"npu_dma_packet"**:
       Use DMA engines to send and receive data, with routing performed over a packet-switched network.
-    - **"cascade"**:
+    - **"npu_cascade"**:
       Use processor cores to send and receive data via cascade connections between adjacent tiles.
-    - **"mmio"**:
+    - **"npu_mmio"**:
       Use host-side MMIO writes (e.g. `aiex.npu.blockwrite`) issued from the runtime
       sequence to deliver a constant payload directly into a tile-local L1 buffer.
       No DMA channel, no shim allocation, no flow is reserved.
@@ -565,32 +602,44 @@ def air_ChannelOp : air_Op<"channel", [Symbol]>,
       `memref.get_global`. The consumer-side `get` lowers to a no-op
       because the L1 buffer is already populated when the core begins executing.
 
+    GPU channel types:
+    - **"gpu_symmetric_heap"**:
+      Cross-GPU messaging through the symmetric heap runtime
+      (`runtime_lib/airgpu/symmetric_heap.{h,cpp}`). The channel must be enclosed
+      by an `air.rank` op; the put/get sites use rank indices to address peer
+      heaps. Lowered by `air-to-rocdl` to thread-cooperative loops over peer-mapped
+      VMem buffers, with synchronization via in-heap notify flags or `mgpuBarrier`.
+
     ### Broadcasting
-    If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute  
+    If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute
     annotates the output sizes after broadcasting. Broadcasting follows NumPy's broadcasting rules.
 
     Example:
 
     ```mlir
-    // An array of 4 x 4 streaming DMA channels
-    air.channel @channel_0 [4, 4] {channel_type = "dma_stream"}
+    // An array of 4 x 4 streaming DMA channels (NPU)
+    air.channel @channel_0 [4, 4] {channel_type = "npu_dma_stream"}
 
-    // A streaming DMA channel broadcasting to 4 destinations
-    air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_stream"}
+    // A streaming DMA channel broadcasting to 4 destinations (NPU)
+    air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_stream"}
 
-    // An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations.
+    // An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations (NPU).
     // Broadcasting follows NumPy's rules.
-    air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "dma_stream"}
+    air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "npu_dma_stream"}
 
-    // A packet-switched DMA channel
-    air.channel @channel_3 [] {channel_type = "dma_packet"}
+    // A packet-switched DMA channel (NPU)
+    air.channel @channel_3 [] {channel_type = "npu_dma_packet"}
 
-    // A cascade channel using core-to-core cascade connections
-    air.channel @channel_4 [] {channel_type = "cascade"}
+    // A cascade channel using core-to-core cascade connections (NPU)
+    air.channel @channel_4 [] {channel_type = "npu_cascade"}
 
     // An MMIO channel: the put writes a constant from host into L1 of each
-    // get's destination tile via runtime-sequence blockwrites
-    air.channel @channel_5 [] {channel_type = "mmio"}
+    // get's destination tile via runtime-sequence blockwrites (NPU)
+    air.channel @channel_5 [] {channel_type = "npu_mmio"}
+
+    // A cross-GPU channel through the symmetric heap (GPU). Must appear inside
+    // an air.rank scope; the indices on put/get encode the peer rank.
+    air.channel @channel_6 [] {channel_type = "gpu_symmetric_heap"}
     ```
   }];
   let extraClassDeclaration = [{
diff --git a/mlir/lib/Conversion/AIRToAIEPass.cpp b/mlir/lib/Conversion/AIRToAIEPass.cpp
index 71842b7f2..6591e63f6 100644
--- a/mlir/lib/Conversion/AIRToAIEPass.cpp
+++ b/mlir/lib/Conversion/AIRToAIEPass.cpp
@@ -2542,7 +2542,7 @@ struct SpecializeChannelBundlePattern
     // host-side puts (they sit outside the device, where this pattern's
     // rewrites don't reach), leaving them to fail later as
     // "no matching device-side air.channel.get".
-    if (channel.getChannelType() == "mmio")
+    if (channel.getChannelType() == "npu_mmio")
       return failure();
 
     std::vector<air::ChannelPutOp> channelPuts =
@@ -4017,7 +4017,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
         bool isShimFlow = f.MM2S_alloc.getDmaTile().isShimNOCorPLTile() ||
                           f.S2MM_alloc[i].getDmaTile().isShimNOCorPLTile();
 
-        if (f.memcpyResourceType == "dma_packet") {
+        if (f.memcpyResourceType == "npu_dma_packet") {
           // Use appropriate flow map based on whether flow involves shim tiles
           if (isShimFlow) {
             // Device-host flows use global shim flow ID
@@ -4059,12 +4059,12 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
             // assignment.
             intraDeviceFlowID = std::max(intraDeviceFlowID, flowID);
           }
-        } else if (f.memcpyResourceType == "dma_stream")
+        } else if (f.memcpyResourceType == "npu_dma_stream")
           getFlowOp(aie_device, f.MM2S_alloc.getDmaTile(), AIE::WireBundle::DMA,
                     (uint32_t)f.MM2S_alloc.dma_channel.channel,
                     f.S2MM_alloc[i].getDmaTile(), AIE::WireBundle::DMA,
                     (uint32_t)f.S2MM_alloc[i].dma_channel.channel);
-        else if (f.memcpyResourceType == "cascade") {
+        else if (f.memcpyResourceType == "npu_cascade") {
           getCascadeFlowOp(aie_device, f.MM2S_alloc.getDmaTile(),
                            AIE::WireBundle::DMA,
                            (uint32_t)f.MM2S_alloc.dma_channel.channel,
@@ -5408,19 +5408,19 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
     return aieDmaBdOp;
   }
 
-  // Converts an air.channel.put/get operation with channel_type = "cascade"
+  // Converts an air.channel.put/get operation with channel_type = "npu_cascade"
   // into aie.get/put_cascade + vector.transfer_read/write sequence.
   // The conversion flattens the entire memref into a 1-D vector to match
   // the cascade data format expected by the AIE put/get_cascade ops.
   LogicalResult ConvertCascadeChannelIfToAIE(RewriterBase &rewriter,
                                              air::ChannelInterface op) {
-    // Match only if the associated channel has channel_type = "cascade".
+    // Match only if the associated channel has channel_type = "npu_cascade".
     auto chan = air::getChannelDeclarationThroughSymbol(op);
     if (!chan)
       return op->emitOpError("cannot resolve channel symbol");
 
-    if (chan.getChannelType().str() != "cascade")
-      return op->emitOpError("channel_type is not cascade");
+    if (chan.getChannelType().str() != "npu_cascade")
+      return op->emitOpError("channel_type is not npu_cascade");
 
     Location loc = op.getLoc();
     Value memref = op.getMemref();
@@ -5492,13 +5492,13 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
 
   FailureOr<air::ChannelInterface> TileCascadeChannelIfUsingScfFor(
       RewriterBase &rewriter, air::ChannelInterface op, unsigned cascadeWidth) {
-    // Match only if the associated channel has channel_type = "cascade".
+    // Match only if the associated channel has channel_type = "npu_cascade".
     auto chan = air::getChannelDeclarationThroughSymbol(op);
     if (!chan)
       return op->emitOpError("cannot resolve channel symbol");
 
-    if (chan.getChannelType().str() != "cascade")
-      return op->emitOpError("channel_type is not cascade");
+    if (chan.getChannelType().str() != "npu_cascade")
+      return op->emitOpError("channel_type is not npu_cascade");
 
     Location loc = op.getLoc();
     Value memref = op.getMemref();
@@ -5610,7 +5610,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
 
   // Lower mmio-typed channels into runtime-sequence MMIO writes.
   //
-  // For each `air.channel @c [...] {channel_type = "mmio"}`:
+  // For each `air.channel @c [...] {channel_type = "npu_mmio"}`:
   //   * each `air.channel.get @c` inside an `aie.core` is replaced by an
   //     erase — the destination L1 `aie.buffer` is populated by the host
   //     before the core runs, so the get is a no-op;
@@ -5636,7 +5636,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
     SmallVector<air::ChannelOp> mmioChannels;
     auto collectMMIO = [&](Operation *root) {
       root->walk([&](air::ChannelOp chan) {
-        if (chan.getChannelType() == "mmio")
+        if (chan.getChannelType() == "npu_mmio")
           if (!llvm::is_contained(mmioChannels, chan))
             mmioChannels.push_back(chan);
       });
@@ -5802,7 +5802,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
                                     StringRef kind) -> LogicalResult {
           if (constIndices(indices))
             return success();
-          return op->emitOpError("channel_type=\"mmio\" non-broadcast ")
+          return op->emitOpError("channel_type=\"npu_mmio\" non-broadcast ")
                  << kind << " requires compile-time constant indices";
         };
         for (auto put : hostPuts)
@@ -5832,7 +5832,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
         memref::GetGlobalOp getGlobalOp = getSourceGlobal(src);
         if (!getGlobalOp)
           return put.emitOpError(
-              "channel_type=\"mmio\" put requires source memref defined by "
+              "channel_type=\"npu_mmio\" put requires source memref defined by "
               "memref.get_global of a constant memref.global");
 
         StringAttr origName = getGlobalOp.getNameAttr().getAttr();
@@ -5844,7 +5844,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
                      : nullptr);
         if (!moduleGlobal)
           return getGlobalOp.emitOpError(
-              "channel_type=\"mmio\" lowering: cannot find memref.global "
+              "channel_type=\"npu_mmio\" lowering: cannot find memref.global "
               "for the put source at module scope");
 
         auto initOpt = moduleGlobal.getInitialValue();
@@ -5852,7 +5852,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
             initOpt ? dyn_cast<DenseElementsAttr>(*initOpt) : nullptr;
         if (!initDense)
           return put.emitOpError(
-              "channel_type=\"mmio\" source memref.global must have a "
+              "channel_type=\"npu_mmio\" source memref.global must have a "
               "DenseElementsAttr initializer");
 
         unsigned matchCount = 0;
@@ -5862,7 +5862,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
           AIE::BufferOp bufferOp = getDefiningBuffer(get.getMemref());
           if (!bufferOp)
             return get.emitOpError(
-                "channel_type=\"mmio\" get destination does not resolve to "
+                "channel_type=\"npu_mmio\" get destination does not resolve to "
                 "an aie.buffer (must be an L1 allocation)");
 
           // Element type and total element count must match between source
@@ -5872,13 +5872,13 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
           auto srcMemTy = cast<MemRefType>(getGlobalOp.getType());
           if (bufMemTy.getElementType() != srcMemTy.getElementType())
             return get.emitOpError(
-                       "channel_type=\"mmio\" source/destination element type "
+                       "channel_type=\"npu_mmio\" source/destination element type "
                        "mismatch (source: ")
                    << srcMemTy.getElementType()
                    << ", destination: " << bufMemTy.getElementType() << ")";
           if (bufMemTy.getNumElements() != srcMemTy.getNumElements())
             return get.emitOpError(
-                       "channel_type=\"mmio\" source/destination element count "
+                       "channel_type=\"npu_mmio\" source/destination element count "
                        "mismatch (source: ")
                    << srcMemTy.getNumElements()
                    << ", destination: " << bufMemTy.getNumElements() << ")";
@@ -5891,13 +5891,13 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
 
           if (auto existing = bufferOp.getInitialValue())
             return bufferOp.emitOpError(
-                "channel_type=\"mmio\" destination aie.buffer already has an "
+                "channel_type=\"npu_mmio\" destination aie.buffer already has an "
                 "initial_value; cannot stamp two sources into one buffer");
           bufferOp.setInitialValueAttr(reshapedInit);
           ++matchCount;
         }
         if (matchCount == 0)
-          return put.emitOpError("channel_type=\"mmio\" put has no matching "
+          return put.emitOpError("channel_type=\"npu_mmio\" put has no matching "
                                  "device-side air.channel.get");
       }
 
@@ -6237,7 +6237,7 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
     for (auto &alloc : tileDmaAlloc.s2mm_allocs)
       alloc.memcpyOps.clear();
 
-    // Lower channel_type="mmio" puts/gets into runtime-sequence blockwrites
+    // Lower channel_type="npu_mmio" puts/gets into runtime-sequence blockwrites
     // before the generic erase loop below removes the underlying air ops.
     // Only meaningful for the ChannelInterface specialization; for the
     // DmaMemcpyNd specialization there are no air.channel ops to convert.
diff --git a/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp b/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp
index a7da49eef..8d429249b 100644
--- a/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp
+++ b/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp
@@ -599,7 +599,7 @@ bool xilinx::air::allocation_info_t::foundPacketFlowAllocInTile(int32_t col,
       continue;
     auto chanTypeRes = air::getChannelType(memcpy_op);
     if (succeeded(chanTypeRes))
-      return chanTypeRes.value().str() == "dma_packet";
+      return chanTypeRes.value().str() == "npu_dma_packet";
   }
   return false;
 }
@@ -865,7 +865,7 @@ air::TileDMAAllocator::simpleDmaChannelAlloc(air::MemcpyInterface &memcpyOp,
   bool isPacketFlowOp = false;
   auto chanTypeRes = getChannelType(memcpyOp);
   if (succeeded(chanTypeRes)) {
-    isPacketFlowOp = chanTypeRes.value().str() == "dma_packet";
+    isPacketFlowOp = chanTypeRes.value().str() == "npu_dma_packet";
   }
 
   // Search for existing dma channel allocation
@@ -942,7 +942,7 @@ FailureOr<air::allocation_info_t> air::ShimDMAAllocator::allocNewDmaChannel(
   bool isPacketFlowOp = false;
   auto chanTypeRes = getChannelType(memcpyOp);
   if (succeeded(chanTypeRes)) {
-    isPacketFlowOp = chanTypeRes.value().str() == "dma_packet";
+    isPacketFlowOp = chanTypeRes.value().str() == "npu_dma_packet";
   }
 
   // Search for existing dma channel allocation
@@ -1158,7 +1158,7 @@ air::MemTileDMAAllocator::simpleDmaChannelAlloc(air::MemcpyInterface &memcpyOp,
   bool isPacketFlowOp = false;
   auto chanTypeRes = getChannelType(memcpyOp);
   if (succeeded(chanTypeRes)) {
-    isPacketFlowOp = chanTypeRes.value().str() == "dma_packet";
+    isPacketFlowOp = chanTypeRes.value().str() == "npu_dma_packet";
   }
 
   // Search for existing dma channel allocation
@@ -1379,7 +1379,7 @@ air::MemcpyBundleAsFlow::pushBackMemcpyOpToBundle(air::DmaMemcpyNdOp memcpyOp) {
   S2MM_memspace = *dstMS;
   MM2S.push_back(memcpyOp.getOperation());
   MM2S_memspace = *srcMS;
-  memcpyResourceType = "dma_stream";
+  memcpyResourceType = "npu_dma_stream";
   return success();
 }
 
@@ -1391,7 +1391,7 @@ air::MemcpyBundleAsFlow::pushBackMemcpyOpToBundle(air::ChannelGetOp memcpyOp) {
   // broadcast/index-matching logic below, which assumes hardware fanout.
   // Record the resource type (so downstream code can skip mmio bundles)
   // and return — the dedicated mmio lowering pass handles the rest.
-  if (chan.getChannelType() == "mmio") {
+  if (chan.getChannelType() == "npu_mmio") {
     air_flow_op = chan.getOperation();
     S2MM[alloc_id].push_back(memcpyOp.getOperation());
     auto getMS = air::getMemorySpace(
@@ -1399,7 +1399,7 @@ air::MemcpyBundleAsFlow::pushBackMemcpyOpToBundle(air::ChannelGetOp memcpyOp) {
     if (!getMS)
       return memcpyOp->emitOpError("unrecognized memory space on memref");
     S2MM_memspace = *getMS;
-    memcpyResourceType = "mmio";
+    memcpyResourceType = "npu_mmio";
     return success();
   }
   if (chan->hasAttr("broadcast_shape")) {
@@ -1470,7 +1470,7 @@ air::MemcpyBundleAsFlow::MemcpyBundleAsFlow(air::DmaMemcpyNdOp dmaMemcpyOp) {
                                            std::vector<Operation *>());
   S2MM = v1;
   S2MM_alloc = std::vector<air::allocation_info_t>(numS2MMAllocs);
-  memcpyResourceType = "dma_stream";
+  memcpyResourceType = "npu_dma_stream";
 }
 
 air::MemcpyBundleAsFlow::MemcpyBundleAsFlow(air::ChannelOp chan) {
@@ -1509,7 +1509,7 @@ LogicalResult air::simpleDMAChannelAllocation(
     // not DMA. They consume no DMA channel, BD, or routing resource and
     // bypass allocation entirely. Their put/get pairs are converted by a
     // dedicated late pass (see lowerAIRMMIOChannelOps).
-    if (f.memcpyResourceType == "mmio")
+    if (f.memcpyResourceType == "npu_mmio")
       continue;
     if (f.MM2S_memspace == air::MemorySpace::L1) {
       for (auto o : f.MM2S) {
@@ -1524,13 +1524,13 @@ LogicalResult air::simpleDMAChannelAllocation(
         int y = tile.getRow();
 
         FailureOr<air::allocation_info_t> alloc_res;
-        if (f.memcpyResourceType == "dma_stream" ||
-            f.memcpyResourceType == "dma_packet") {
+        if (f.memcpyResourceType == "npu_dma_stream" ||
+            f.memcpyResourceType == "npu_dma_packet") {
           alloc_res = tile_dma_alloc.simpleDmaChannelAlloc(
               memcpyOpIf, x, y, f.MM2S_alloc.dma_channel.channel);
           if (failed(alloc_res))
             return failure();
-        } else if (f.memcpyResourceType == "cascade") {
+        } else if (f.memcpyResourceType == "npu_cascade") {
           alloc_res = core_cascade_alloc.coreCascadeAlloc(memcpyOpIf);
           if (failed(alloc_res))
             return failure();
@@ -1555,13 +1555,13 @@ LogicalResult air::simpleDMAChannelAllocation(
           int y = tile.getRow();
 
           FailureOr<air::allocation_info_t> alloc_res;
-          if (f.memcpyResourceType == "dma_stream" ||
-              f.memcpyResourceType == "dma_packet") {
+          if (f.memcpyResourceType == "npu_dma_stream" ||
+              f.memcpyResourceType == "npu_dma_packet") {
             alloc_res = tile_dma_alloc.simpleDmaChannelAlloc(
                 memcpyOpIf, x, y, f.S2MM_alloc[i].dma_channel.channel);
             if (failed(alloc_res))
               return failure();
-          } else if (f.memcpyResourceType == "cascade") {
+          } else if (f.memcpyResourceType == "npu_cascade") {
             alloc_res = core_cascade_alloc.coreCascadeAlloc(memcpyOpIf);
             if (failed(alloc_res))
               return failure();
@@ -1576,15 +1576,15 @@ LogicalResult air::simpleDMAChannelAllocation(
   }
   for (auto &f : memcpy_flows) {
     // MMIO channels are not allocated to any DMA resource at L2 either.
-    if (f.memcpyResourceType == "mmio")
+    if (f.memcpyResourceType == "npu_mmio")
       continue;
     if (f.MM2S_memspace == air::MemorySpace::L2) {
       for (auto o : f.MM2S) {
         auto memcpyOpIf = cast<air::MemcpyInterface>(o);
         // Report error if the data movement lowers to neither dma stream
         // (aie.flow) nor dma packet flow (aie.packet_flow).
-        if (f.memcpyResourceType != "dma_stream" &&
-            f.memcpyResourceType != "dma_packet")
+        if (f.memcpyResourceType != "npu_dma_stream" &&
+            f.memcpyResourceType != "npu_dma_packet")
           return memcpyOpIf->emitOpError("only supports dma_stream or "
                                          "dma_packet connections at L2 memory");
         auto alloc_res = memtile_dma_alloc.simpleDmaChannelAlloc(memcpyOpIf);
@@ -1599,8 +1599,8 @@ LogicalResult air::simpleDMAChannelAllocation(
           auto memcpyOpIf = cast<air::MemcpyInterface>(o);
           // Report error if the data movement lowers to neither dma stream
           // (aie.flow) nor dma packet flow (aie.packet_flow).
-          if (f.memcpyResourceType != "dma_stream" &&
-              f.memcpyResourceType != "dma_packet")
+          if (f.memcpyResourceType != "npu_dma_stream" &&
+              f.memcpyResourceType != "npu_dma_packet")
             return memcpyOpIf->emitOpError(
                 "only supports dma_stream or dma_packet connections at L2 "
                 "memory");
@@ -1614,7 +1614,7 @@ LogicalResult air::simpleDMAChannelAllocation(
   }
   for (auto &f : memcpy_flows) {
     // MMIO channels are not allocated to any shim DMA resource.
-    if (f.memcpyResourceType == "mmio")
+    if (f.memcpyResourceType == "npu_mmio")
       continue;
     if (f.MM2S_memspace == air::MemorySpace::L3) {
       for (size_t i = 0; i < f.S2MM.size(); i++) {
@@ -1622,8 +1622,8 @@ LogicalResult air::simpleDMAChannelAllocation(
           auto memcpyOpIf = cast<air::MemcpyInterface>(o);
           // Report error if the data movement lowers to neither dma stream
           // (aie.flow) nor dma packet flow (aie.packet_flow).
-          if (f.memcpyResourceType != "dma_stream" &&
-              f.memcpyResourceType != "dma_packet")
+          if (f.memcpyResourceType != "npu_dma_stream" &&
+              f.memcpyResourceType != "npu_dma_packet")
             return memcpyOpIf->emitOpError(
                 "only supports dma_stream or dma_packet connections at L3 "
                 "memory");
@@ -1650,8 +1650,8 @@ LogicalResult air::simpleDMAChannelAllocation(
         auto memcpyOpIf = cast<air::MemcpyInterface>(o);
         // Report error if the data movement lowers to neither dma stream
         // (aie.flow) nor dma packet flow (aie.packet_flow).
-        if (f.memcpyResourceType != "dma_stream" &&
-            f.memcpyResourceType != "dma_packet")
+        if (f.memcpyResourceType != "npu_dma_stream" &&
+            f.memcpyResourceType != "npu_dma_packet")
           return memcpyOpIf->emitOpError("only supports dma_stream or "
                                          "dma_packet connections at L3 memory");
         if (!f.MM2S_alloc.getDmaTile())
diff --git a/mlir/lib/Conversion/ConvertToAIRPass.cpp b/mlir/lib/Conversion/ConvertToAIRPass.cpp
index ed41b3d38..ed6f9b593 100644
--- a/mlir/lib/Conversion/ConvertToAIRPass.cpp
+++ b/mlir/lib/Conversion/ConvertToAIRPass.cpp
@@ -781,7 +781,7 @@ separateScfParallel(scf::ParallelOp op, unsigned innerNumLoops,
 
 // Create a new air.channel symbol in the module for the cascade pipeline.
 // The symbol name is unique in the module, and the channel is tagged with
-// the "cascade" attribute.
+// the "npu_cascade" attribute.
 air::ChannelOp
 createCascadeChannelOp(OpBuilder &builder, ModuleOp module, Location loc,
                        SmallVector<int64_t> channel_bundle_sizes) {
@@ -797,10 +797,10 @@ createCascadeChannelOp(OpBuilder &builder, ModuleOp module, Location loc,
     o = o->getNextNode();
   builder.setInsertionPoint(o);
 
-  // Create the channel op with the given bundle sizes and "cascade" tag.
+  // Create the channel op with the given bundle sizes and "npu_cascade" tag.
   auto channel_op = air::ChannelOp::create(
       builder, loc, cname, builder.getI64ArrayAttr(channel_bundle_sizes),
-      builder.getStringAttr("cascade"));
+      builder.getStringAttr("npu_cascade"));
 
   return channel_op;
 }
diff --git a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
index b0fe068a1..2fc448a0a 100644
--- a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
+++ b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
@@ -2811,6 +2811,36 @@ LogicalResult air::DmaMemcpyNdOp::verify() {
         return emitOpError("padding values must be <= 65535");
     }
   }
+
+  // Cross-rank addressing requires an enclosing air.rank scope, and the
+  // peer-side memref (when it is a direct memref.alloc result) must carry
+  // the air.symmetric attribute.
+  if (hasCrossRank()) {
+    Operation *p = (*this)->getParentOp();
+    while (p && !isa<air::RankOp>(p))
+      p = p->getParentOp();
+    if (!p)
+      return emitOpError("src_rank/dst_rank attributes require an enclosing "
+                         "air.rank scope");
+
+    auto requireSymmetricAlloc = [&](Value v, StringRef side) -> LogicalResult {
+      auto alloc = v.getDefiningOp<memref::AllocOp>();
+      if (!alloc)
+        return success(); // Block args / non-alloc sources are user-trusted.
+      if (!alloc->hasAttr("air.symmetric"))
+        return emitOpError() << side
+                             << " memref is referenced cross-rank but its "
+                                "memref.alloc lacks the \"air.symmetric\" "
+                                "attribute";
+      return success();
+    };
+    if (getSrcRank().has_value() &&
+        failed(requireSymmetricAlloc(getSrc(), "src")))
+      return failure();
+    if (getDstRank().has_value() &&
+        failed(requireSymmetricAlloc(getDst(), "dst")))
+      return failure();
+  }
   return success();
 }
 
@@ -3075,9 +3105,9 @@ static LogicalResult ComposeMemrefOpOnChannelOp(OpT op,
   if (!chan)
     // If the channel declaration cannot be resolved, signal a failure.
     return failure();
-  // If the channel is of type "cascade", try to fold memref.cast but skip full
+  // If the channel is of type "npu_cascade", try to fold memref.cast but skip full
   // composition
-  if (chan.getChannelType() == "cascade")
+  if (chan.getChannelType() == "npu_cascade")
     return FoldMemrefCastOnChannelOp(op, rewriter);
 
   // Init. memref type and offsets from memref's defining op's input type
@@ -3157,18 +3187,29 @@ LogicalResult air::ChannelPutOp::verify() {
                 "must not be temporal scf.for induction variables";
   }
 
-  // For channel_type="mmio", the put runs from the host runtime sequence
+  // For channel_type="npu_mmio", the put runs from the host runtime sequence
   // and writes into a tile-local L1 buffer. Its source memref must
   // therefore live in L3 (host memory). Allow lookup-failure to silently
   // pass — that's a separate diagnostic surface.
   if (auto chan = resolveChannelDecl(*this)) {
-    if (chan.getChannelType() == "mmio") {
+    if (chan.getChannelType() == "npu_mmio") {
       auto memrefTy = dyn_cast<MemRefType>(getMemref().getType());
       if (memrefTy && memrefTy.getMemorySpaceAsInt() != 0)
-        return emitOpError() << "channel_type=\"mmio\" put source must be "
+        return emitOpError() << "channel_type=\"npu_mmio\" put source must be "
                                 "in L3 (memory_space=0), got memory_space="
                              << memrefTy.getMemorySpaceAsInt();
     }
+    // For channel_type="gpu_symmetric_heap", the put must be inside an
+    // air.rank scope (cross-rank addressing requires a multi-rank world).
+    if (chan.getChannelType() == "gpu_symmetric_heap") {
+      Operation *p = (*this)->getParentOp();
+      while (p && !isa<air::RankOp>(p))
+        p = p->getParentOp();
+      if (!p)
+        return emitOpError()
+               << "channel_type=\"gpu_symmetric_heap\" put requires an "
+                  "enclosing air.rank scope";
+    }
   }
 
   auto padBefore = getPadBefore();
@@ -3246,17 +3287,29 @@ LogicalResult air::ChannelGetOp::verify() {
                 "must not be temporal scf.for induction variables";
   }
 
-  // For channel_type="mmio", the destination must be a tile-local L1 buffer
-  // (memory_space=2): the host writes into it via runtime-sequence
+  // For channel_type="npu_mmio", the destination must be a tile-local L1
+  // buffer (memory_space=2): the host writes into it via runtime-sequence
   // blockwrites before the consuming core starts. L2/L3 destinations have
   // no representation in the lowered IR.
   if (auto chan = resolveChannelDecl(*this)) {
-    if (chan.getChannelType() == "mmio") {
+    if (chan.getChannelType() == "npu_mmio") {
       auto memrefTy = dyn_cast<MemRefType>(getMemref().getType());
       if (memrefTy && memrefTy.getMemorySpaceAsInt() != 2)
-        return emitOpError() << "channel_type=\"mmio\" get destination must be "
-                                "in L1 (memory_space=2), got memory_space="
-                             << memrefTy.getMemorySpaceAsInt();
+        return emitOpError()
+               << "channel_type=\"npu_mmio\" get destination must be "
+                  "in L1 (memory_space=2), got memory_space="
+               << memrefTy.getMemorySpaceAsInt();
+    }
+    // For channel_type="gpu_symmetric_heap", the get must be inside an
+    // air.rank scope.
+    if (chan.getChannelType() == "gpu_symmetric_heap") {
+      Operation *p = (*this)->getParentOp();
+      while (p && !isa<air::RankOp>(p))
+        p = p->getParentOp();
+      if (!p)
+        return emitOpError()
+               << "channel_type=\"gpu_symmetric_heap\" get requires an "
+                  "enclosing air.rank scope";
     }
   }
 
@@ -3290,14 +3343,18 @@ void air::ChannelGetOp::getCanonicalizationPatterns(RewritePatternSet &patterns,
 //
 
 LogicalResult air::ChannelOp::verify() {
-  // Allow-list of supported channel_type values. Adding a new transport
-  // requires both an enum entry here and a lowering branch in air-to-aie.
+  // Allow-list of supported channel_type values. Values are namespaced by
+  // backend: NPU (AIE) channels use the "npu_" prefix, GPU channels use the
+  // "gpu_" prefix. Adding a new transport requires both an entry here and a
+  // lowering branch in the appropriate conversion pass.
   StringRef chanType = getChannelType();
-  if (chanType != "dma_stream" && chanType != "dma_packet" &&
-      chanType != "cascade" && chanType != "mmio")
+  if (chanType != "npu_dma_stream" && chanType != "npu_dma_packet" &&
+      chanType != "npu_cascade" && chanType != "npu_mmio" &&
+      chanType != "gpu_symmetric_heap")
     return emitOpError() << "unsupported channel_type \"" << chanType
-                         << "\"; expected one of \"dma_stream\", "
-                            "\"dma_packet\", \"cascade\", or \"mmio\"";
+                         << "\"; expected one of \"npu_dma_stream\", "
+                            "\"npu_dma_packet\", \"npu_cascade\", "
+                            "\"npu_mmio\", or \"gpu_symmetric_heap\"";
 
   if (isBroadcast()) {
     auto bundle_size = getSize();
diff --git a/mlir/lib/Transform/AIRDmaToChannel.cpp b/mlir/lib/Transform/AIRDmaToChannel.cpp
index 522d9809a..a68705a3d 100644
--- a/mlir/lib/Transform/AIRDmaToChannel.cpp
+++ b/mlir/lib/Transform/AIRDmaToChannel.cpp
@@ -495,7 +495,7 @@ createChannelOp(OpBuilder builder, ModuleOp module, std::string cname,
 
   auto channel_op = air::ChannelOp::create(
       builder, loc, cname, builder.getI64ArrayAttr(channel_bundle_sizes),
-      builder.getStringAttr("dma_stream"));
+      builder.getStringAttr("npu_dma_stream"));
 
   builder.restoreInsertionPoint(insertionCheckpoint);
 
@@ -1606,10 +1606,10 @@ struct DmaToChannelPass : public air::impl::DmaToChannelBase<DmaToChannelPass> {
         // mmio channels are runtime-sequence MMIO writes, not shim DMA, so
         // they neither contribute to per-column shim pressure nor are
         // eligible for dma_packet upgrade.
-        if (chanOp.getChannelType() == "mmio")
+        if (chanOp.getChannelType() == "npu_mmio")
           continue;
 
-        bool isAlreadyPacket = (chanOp.getChannelType() == "dma_packet");
+        bool isAlreadyPacket = (chanOp.getChannelType() == "npu_dma_packet");
         auto channelName = chanOp.getSymName();
 
         // Check if this channel has a herd-side endpoint in this segment.
@@ -1722,7 +1722,7 @@ struct DmaToChannelPass : public air::impl::DmaToChannelBase<DmaToChannelPass> {
                            << pressure << " exceeds shim DMA limit of "
                            << shimChannelsPerCol << ")";
         for (auto chanOp : channels) {
-          chanOp.setChannelType(StringAttr::get(context, "dma_packet"));
+          chanOp.setChannelType(StringAttr::get(context, "npu_dma_packet"));
         }
       };
 
@@ -1733,9 +1733,9 @@ struct DmaToChannelPass : public air::impl::DmaToChannelBase<DmaToChannelPass> {
                             << inputChannels.size() + outputChannels.size()
                             << " shim-bound channels to dma_packet";
         for (auto chanOp : inputChannels)
-          chanOp.setChannelType(StringAttr::get(context, "dma_packet"));
+          chanOp.setChannelType(StringAttr::get(context, "npu_dma_packet"));
         for (auto chanOp : outputChannels)
-          chanOp.setChannelType(StringAttr::get(context, "dma_packet"));
+          chanOp.setChannelType(StringAttr::get(context, "npu_dma_packet"));
         return;
       }
 
diff --git a/mlir/lib/Transform/AIRHerdPlacementPass.cpp b/mlir/lib/Transform/AIRHerdPlacementPass.cpp
index 87b181483..ecc30eefa 100644
--- a/mlir/lib/Transform/AIRHerdPlacementPass.cpp
+++ b/mlir/lib/Transform/AIRHerdPlacementPass.cpp
@@ -374,7 +374,7 @@ class AIRHerdPlacementPass
     // Collect cascade channel declarations
     std::map<StringRef, air::ChannelOp> cascadeChannels;
     module.walk([&](air::ChannelOp channelOp) {
-      if (channelOp.getChannelType() == "cascade") {
+      if (channelOp.getChannelType() == "npu_cascade") {
         cascadeChannels[channelOp.getSymName()] = channelOp;
       }
     });
diff --git a/mlir/lib/Transform/AIRLinalgCodegen.cpp b/mlir/lib/Transform/AIRLinalgCodegen.cpp
index e9d8d5b45..3236d8f17 100644
--- a/mlir/lib/Transform/AIRLinalgCodegen.cpp
+++ b/mlir/lib/Transform/AIRLinalgCodegen.cpp
@@ -1205,7 +1205,7 @@ FailureOr<linalg::TiledLinalgOp> static pipelineReduceLinalgOp(
       auto cname = createChannelName(module);
       b.setInsertionPointToStart(module.getBody());
       auto channel_op = air::ChannelOp::create(
-          b, loc, cname, b.getI64ArrayAttr({1}), b.getStringAttr("dma_stream"));
+          b, loc, cname, b.getI64ArrayAttr({1}), b.getStringAttr("npu_dma_stream"));
       b.setInsertionPoint(stageBlock->getTerminator());
       SmallVector<Value> src_offsets;
       SmallVector<Value> src_sizes;
diff --git a/mlir/lib/Transform/AIRMiscPasses.cpp b/mlir/lib/Transform/AIRMiscPasses.cpp
index 814927de6..77f26ac98 100644
--- a/mlir/lib/Transform/AIRMiscPasses.cpp
+++ b/mlir/lib/Transform/AIRMiscPasses.cpp
@@ -1147,7 +1147,7 @@ static bool segmentUsesCascade(air::HerdOp herd) {
 
   auto result = container->walk([&](air::ChannelInterface chanOp) {
     auto channelDecl = air::getChannelDeclarationThroughSymbol(chanOp);
-    if (channelDecl && channelDecl.getChannelType() == "cascade")
+    if (channelDecl && channelDecl.getChannelType() == "npu_cascade")
       return WalkResult::interrupt();
     return WalkResult::advance();
   });
diff --git a/mlir/lib/Util/Util.cpp b/mlir/lib/Util/Util.cpp
index ac690ffcc..07aed55da 100644
--- a/mlir/lib/Util/Util.cpp
+++ b/mlir/lib/Util/Util.cpp
@@ -522,7 +522,7 @@ FailureOr<StringRef> air::getChannelType(air::MemcpyInterface memcpyIfOp) {
   auto chanIfOp =
       dyn_cast_if_present<air::ChannelInterface>(memcpyIfOp.getOperation());
   if (!chanIfOp)
-    return StringRef("dma_stream");
+    return StringRef("npu_dma_stream");
   auto chanOp = getChannelDeclarationThroughSymbol(chanIfOp);
   if (chanOp) {
     return chanOp.getChannelType();
diff --git a/mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir b/mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir
index 040456b5e..cc0b248e9 100644
--- a/mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir
+++ b/mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir
@@ -5,7 +5,7 @@
 //
 //===----------------------------------------------------------------------===//
 
-// Positive tests for channel_type="mmio" in air-to-aie. Each split has
+// Positive tests for channel_type="npu_mmio" in air-to-aie. Each split has
 // its own CHECK prefix so directives don't leak across boundaries.
 // Negative cases live in `air_channel_mmio_invalid.mlir`.
 //
@@ -34,7 +34,7 @@
 // CHECK-SIMPLE-NOT:     aiex.npu.blockwrite
 
 memref.global "private" @const_data : memref<8xi32> = dense<42>
-air.channel @mmio_chan [] {channel_type = "mmio"}
+air.channel @mmio_chan [] {channel_type = "npu_mmio"}
 func.func @mmio_simple() {
   %src = memref.get_global @const_data : memref<8xi32>
   %c1 = arith.constant 1 : index
@@ -66,8 +66,8 @@ func.func @mmio_simple() {
 // CHECK-MIXED-NOT:     air.channel.get @mmio_chan2
 
 memref.global "private" @mmio_const : memref<8xi32> = dense<7>
-air.channel @mmio_chan2 [] {channel_type = "mmio"}
-air.channel @dma_chan [] {channel_type = "dma_stream"}
+air.channel @mmio_chan2 [] {channel_type = "npu_mmio"}
+air.channel @dma_chan [] {channel_type = "npu_dma_stream"}
 func.func @mixed(%dma_src: memref<16xi32>) {
   %src = memref.get_global @mmio_const : memref<8xi32>
   %c1 = arith.constant 1 : index
@@ -106,7 +106,7 @@ func.func @mixed(%dma_src: memref<16xi32>) {
 // CHECK-BCAST-NOT:     aiex.npu.blockwrite
 
 memref.global "private" @const_q : memref<8xi32> = dense<5>
-air.channel @bcast_mmio [1] {channel_type = "mmio", broadcast_shape = [2]}
+air.channel @bcast_mmio [1] {channel_type = "npu_mmio", broadcast_shape = [2]}
 func.func @bcast() {
   %src = memref.get_global @const_q : memref<8xi32>
   %c1 = arith.constant 1 : index
@@ -146,7 +146,7 @@ func.func @bcast() {
 
 memref.global "private" @c0 : memref<8xi32> = dense<10>
 memref.global "private" @c1 : memref<8xi32> = dense<20>
-air.channel @qm [2] {channel_type = "mmio"}
+air.channel @qm [2] {channel_type = "npu_mmio"}
 func.func @indexed() {
   %g0 = memref.get_global @c0 : memref<8xi32>
   %g1 = memref.get_global @c1 : memref<8xi32>
@@ -180,7 +180,7 @@ func.func @indexed() {
 // CHECK-BF16-NOT:     aiex.npu.blockwrite
 
 memref.global "private" @qbf16 : memref<2x2xbf16> = dense<1.5>
-air.channel @qbf16_chan [] {channel_type = "mmio"}
+air.channel @qbf16_chan [] {channel_type = "npu_mmio"}
 func.func @bf16_payload() {
   %src = memref.get_global @qbf16 : memref<2x2xbf16>
   %c1 = arith.constant 1 : index
@@ -208,7 +208,7 @@ func.func @bf16_payload() {
 // CHECK-BF16NS-NOT:     aiex.npu.blockwrite
 
 memref.global "private" @qbf16ns : memref<2x2xbf16> = dense<[[1.5, 2.5], [3.5, 4.5]]>
-air.channel @qbf16ns_chan [] {channel_type = "mmio"}
+air.channel @qbf16ns_chan [] {channel_type = "npu_mmio"}
 func.func @bf16_nonsplat() {
   %src = memref.get_global @qbf16ns : memref<2x2xbf16>
   %c1 = arith.constant 1 : index
@@ -237,7 +237,7 @@ func.func @bf16_nonsplat() {
 // CHECK-I8-NOT:     aiex.npu.blockwrite
 
 memref.global "private" @c8s : memref<4xi8> = dense<66>
-air.channel @c8s_chan [] {channel_type = "mmio"}
+air.channel @c8s_chan [] {channel_type = "npu_mmio"}
 func.func @i8_splat() {
   %src = memref.get_global @c8s : memref<4xi8>
   %c1 = arith.constant 1 : index
diff --git a/mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir b/mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir
index 8a372db9d..df5decf6a 100644
--- a/mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir
+++ b/mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir
@@ -5,7 +5,7 @@
 //
 //===----------------------------------------------------------------------===//
 
-// Negative tests for channel_type="mmio". Each split runs under `not`
+// Negative tests for channel_type="npu_mmio". Each split runs under `not`
 // so FileCheck sees only that split's diagnostic.
 
 // RUN: not air-opt %s -split-input-file -air-to-aie="row-offset=2 col-offset=0 device=npu1" 2>&1 | FileCheck %s
@@ -13,8 +13,8 @@
 // The source data is stamped onto the destination L1 buffer's
 // initial_value, so the put source must be a compile-time constant
 // memref.global.
-// CHECK: channel_type="mmio" put requires source memref defined by memref.get_global
-air.channel @mmio_nc [] {channel_type = "mmio"}
+// CHECK: channel_type="npu_mmio" put requires source memref defined by memref.get_global
+air.channel @mmio_nc [] {channel_type = "npu_mmio"}
 func.func @mmio_nonconst(%h: memref<8xi32>) {
   %c1 = arith.constant 1 : index
   air.launch (%lx) in (%sx = %c1) args(%a = %h) : memref<8xi32> {
@@ -35,9 +35,9 @@ func.func @mmio_nonconst(%h: memref<8xi32>) {
 
 // Non-broadcast mmio with non-constant index can't match any get;
 // would silently erase the put. Reject up front.
-// CHECK: channel_type="mmio" non-broadcast put requires compile-time constant indices
+// CHECK: channel_type="npu_mmio" non-broadcast put requires compile-time constant indices
 memref.global "private" @nci_const : memref<8xi32> = dense<1>
-air.channel @nci_chan [1] {channel_type = "mmio"}
+air.channel @nci_chan [1] {channel_type = "npu_mmio"}
 func.func @mmio_nonconst_index(%n: index) {
   %src = memref.get_global @nci_const : memref<8xi32>
   %c1 = arith.constant 1 : index
@@ -59,9 +59,9 @@ func.func @mmio_nonconst_index(%n: index) {
 // -----
 
 // Constant-index put with no matching get would be silently erased.
-// CHECK: channel_type="mmio" put has no matching device-side air.channel.get
+// CHECK: channel_type="npu_mmio" put has no matching device-side air.channel.get
 memref.global "private" @nm_const : memref<8xi32> = dense<2>
-air.channel @nm_chan [2] {channel_type = "mmio"}
+air.channel @nm_chan [2] {channel_type = "npu_mmio"}
 func.func @mmio_no_match() {
   %src = memref.get_global @nm_const : memref<8xi32>
   %c1 = arith.constant 1 : index
@@ -85,9 +85,9 @@ func.func @mmio_no_match() {
 
 // The destination L1 buffer's element type must match the source so the
 // initializer is type-compatible.
-// CHECK: channel_type="mmio" source/destination element type mismatch
+// CHECK: channel_type="npu_mmio" source/destination element type mismatch
 memref.global "private" @i32_src : memref<4xi32> = dense<7>
-air.channel @typemis_chan [] {channel_type = "mmio"}
+air.channel @typemis_chan [] {channel_type = "npu_mmio"}
 func.func @mmio_type_mismatch() {
   %src = memref.get_global @i32_src : memref<4xi32>
   %c1 = arith.constant 1 : index
@@ -108,9 +108,9 @@ func.func @mmio_type_mismatch() {
 // -----
 
 // Source/destination must agree on total element count.
-// CHECK: channel_type="mmio" source/destination element count mismatch
+// CHECK: channel_type="npu_mmio" source/destination element count mismatch
 memref.global "private" @short_src : memref<4xi32> = dense<7>
-air.channel @sizemis_chan [] {channel_type = "mmio"}
+air.channel @sizemis_chan [] {channel_type = "npu_mmio"}
 func.func @mmio_size_mismatch() {
   %src = memref.get_global @short_src : memref<4xi32>
   %c1 = arith.constant 1 : index
@@ -132,9 +132,9 @@ func.func @mmio_size_mismatch() {
 
 // initial_value is set by the lowering, so the source memref.global
 // needs a DenseElementsAttr initializer to copy from.
-// CHECK: channel_type="mmio" source memref.global must have a DenseElementsAttr initializer
+// CHECK: channel_type="npu_mmio" source memref.global must have a DenseElementsAttr initializer
 memref.global "private" @uninit_bf16 : memref<2x2xbf16>
-air.channel @uninit_chan [] {channel_type = "mmio"}
+air.channel @uninit_chan [] {channel_type = "npu_mmio"}
 func.func @mmio_uninitialized_global() {
   %src = memref.get_global @uninit_bf16 : memref<2x2xbf16>
   %c1 = arith.constant 1 : index
diff --git a/mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir b/mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir
index 2a1820874..6fcf2d20e 100644
--- a/mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir
+++ b/mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir
@@ -241,7 +241,7 @@ func.func @one_to_two() {
 
 #set = affine_set<()[s0] : (s0 - 3 == 0)>
 #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)>
-air.channel @channel_0 [3] {channel_type = "cascade"}
+air.channel @channel_0 [3] {channel_type = "npu_cascade"}
 air.channel @channel_1 [1]
 air.channel @channel_2 [1]
 func.func @cascade(%arg0: memref<2048xi32>, %arg1: memref<2048xi32>) {
@@ -393,7 +393,7 @@ func.func @cascade(%arg0: memref<2048xi32>, %arg1: memref<2048xi32>) {
 #set = affine_set<()[s0] : (s0 - 3 == 0)>
 #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)>
 module {
-  air.channel @channel_0 [3] {channel_type = "cascade"}
+  air.channel @channel_0 [3] {channel_type = "npu_cascade"}
   air.channel @channel_1 [1]
   air.channel @channel_2 [1]
   func.func @cascade2(%arg0: memref<1x1x2048xi32>, %arg1: memref<1x1x2048xi32>) {
@@ -496,7 +496,7 @@ module {
 // Test 2D memref flattening for cascade
 #set_2d = affine_set<()[s0] : (s0 - 1 == 0)>
 module {
-  air.channel @cascade_2d [1] {channel_type = "cascade"}
+  air.channel @cascade_2d [1] {channel_type = "npu_cascade"}
   func.func @cascade_2d_flatten(%arg0: memref<32x64xi32>) {
     %c1 = arith.constant 1 : index
     %0 = air.launch async (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0) : memref<32x64xi32> attributes {id = 1 : i32} {
@@ -543,7 +543,7 @@ module {
 // Test 4D memref flattening for cascade
 #set_4d = affine_set<()[s0] : (s0 - 1 == 0)>
 module {
-  air.channel @cascade_4d [1] {channel_type = "cascade"}
+  air.channel @cascade_4d [1] {channel_type = "npu_cascade"}
   func.func @cascade_4d_flatten(%arg0: memref<2x4x8x32xi32>) {
     %c1 = arith.constant 1 : index
     %0 = air.launch async (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0) : memref<2x4x8x32xi32> attributes {id = 1 : i32} {
@@ -589,7 +589,7 @@ module {
 // Test bf16 cascade flattening (different tile size due to element width)
 #set_bf16 = affine_set<()[s0] : (s0 - 1 == 0)>
 module {
-  air.channel @cascade_bf16 [1] {channel_type = "cascade"}
+  air.channel @cascade_bf16 [1] {channel_type = "npu_cascade"}
   func.func @cascade_bf16_flatten(%arg0: memref<32x32xbf16>) {
     %c1 = arith.constant 1 : index
     %0 = air.launch async (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0) : memref<32x32xbf16> attributes {id = 1 : i32} {
@@ -684,7 +684,7 @@ module {
 #set = affine_set<()[s0] : (s0 - 3 == 0)>
 #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)>
 module {
-  air.channel @channel_0 [3] {channel_type = "cascade"}
+  air.channel @channel_0 [3] {channel_type = "npu_cascade"}
   air.channel @channel_1 [1]
   air.channel @channel_2 [1]
   func.func @cascade3(%arg0: memref<1x1x2048xi32>, %arg1: memref<1x1x2048xi32>) {
diff --git a/mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir b/mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir
index b4813e7b4..e5c723abb 100644
--- a/mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir
+++ b/mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir
@@ -1596,14 +1596,14 @@ module {
 #set5 = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 3 >= 0, s1 - 2 == 0)>
 module {
   air.channel @L3ToL2Chan1 [1, 4]
-  air.channel @L2ToL1Chan1_0 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_packet"}
-  air.channel @L2ToL1Chan1_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_packet"}
-  air.channel @L2ToL1Chan1_2 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_packet"}
-  air.channel @L2ToL1Chan1_3 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_packet"}
-  air.channel @L2ToL1Chan2_0 [1, 1] {broadcast_shape = [4, 1], channel_type = "dma_packet"}
-  air.channel @L2ToL1Chan2_1 [1, 1] {broadcast_shape = [4, 1], channel_type = "dma_packet"}
-  air.channel @L2ToL1Chan2_2 [1, 1] {broadcast_shape = [4, 1], channel_type = "dma_packet"}
-  air.channel @L2ToL1Chan2_3 [1, 1] {broadcast_shape = [4, 1], channel_type = "dma_packet"}
+  air.channel @L2ToL1Chan1_0 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"}
+  air.channel @L2ToL1Chan1_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"}
+  air.channel @L2ToL1Chan1_2 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"}
+  air.channel @L2ToL1Chan1_3 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"}
+  air.channel @L2ToL1Chan2_0 [1, 1] {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"}
+  air.channel @L2ToL1Chan2_1 [1, 1] {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"}
+  air.channel @L2ToL1Chan2_2 [1, 1] {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"}
+  air.channel @L2ToL1Chan2_3 [1, 1] {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"}
   func.func @func20(%arg0: memref<128x64xbf16>, %arg1: memref<64x3072xbf16>, %arg2: memref<3072x64xbf16>, %arg3: memref<128x3072xbf16>, %arg4: memref<128x64xbf16>) {
     %c1 = arith.constant 1 : index
     %0 = air.launch async (%arg5, %arg6) in (%arg7=%c1, %arg8=%c1) args(%arg9=%arg0, %arg10=%arg1) : memref<128x64xbf16>, memref<64x3072xbf16> attributes {id = 1 : i32} {
@@ -1781,7 +1781,7 @@ module {
 // RACECONDFIX: @func21
 
 module {
-  air.channel @L1ToL3Pkt [1, 1] {channel_type = "dma_packet"}
+  air.channel @L1ToL3Pkt [1, 1] {channel_type = "npu_dma_packet"}
   func.func @func21(%arg0: memref<64xbf16>) {
     %c1 = arith.constant 1 : index
     %c0 = arith.constant 0 : index
diff --git a/mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir b/mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir
index a34ede4d5..d6c87875e 100644
--- a/mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir
+++ b/mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir
@@ -22,7 +22,7 @@ module {
   air.channel @channel_13 [1, 1] {broadcast_shape = [1, 4]}
   air.channel @channel_14 [1, 1] {broadcast_shape = [1, 4]}
   air.channel @channel_15 [1, 1] {broadcast_shape = [1, 4]}
-  air.channel @channel_2 [4, 1] {channel_type = "dma_packet"}
+  air.channel @channel_2 [4, 1] {channel_type = "npu_dma_packet"}
   func.func @func2(%arg0: memref<512x512xbf16>) {
     %c2 = arith.constant 2 : index
     %0 = air.launch async (%arg3, %arg4) in (%arg5=%c2, %arg6=%c2) args(%arg7=%arg0) : memref<512x512xbf16> attributes {id = 1 : i32} {
diff --git a/mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir b/mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir
index c4d24f0e1..ac6af7d8a 100644
--- a/mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir
+++ b/mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir
@@ -26,7 +26,7 @@ module {
   air.channel @channel_13 [1, 1] {broadcast_shape = [1, 4]}
   air.channel @channel_14 [1, 1] {broadcast_shape = [1, 4]}
   air.channel @channel_15 [1, 1] {broadcast_shape = [1, 4]}
-  air.channel @channel_2 [4, 1] {channel_type = "dma_packet"}
+  air.channel @channel_2 [4, 1] {channel_type = "npu_dma_packet"}
   func.func @func2(%arg0: memref<512x512xbf16>) {
     %c2 = arith.constant 2 : index
     %0 = air.launch async (%arg3, %arg4) in (%arg5=%c2, %arg6=%c2) args(%arg7=%arg0) : memref<512x512xbf16> attributes {id = 1 : i32} {
diff --git a/mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir b/mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir
index 19e53f61c..abad216f8 100644
--- a/mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir
+++ b/mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir
@@ -34,8 +34,8 @@
 
 module {
   // Intra-device channels for L1-L2 communication (packet flow type)
-  air.channel @chan_intra_a [2, 1] {channel_type = "dma_packet"}
-  air.channel @chan_intra_b [2, 1] {channel_type = "dma_packet"}
+  air.channel @chan_intra_a [2, 1] {channel_type = "npu_dma_packet"}
+  air.channel @chan_intra_b [2, 1] {channel_type = "npu_dma_packet"}
 
   func.func @test_packet_flow_id_reset(%arg0: memref<128xbf16>) {
     %0 = air.launch async () in () args(%input=%arg0) : memref<128xbf16> attributes {id = 1 : i32} {
diff --git a/mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir b/mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir
index c6d7870af..e4a78c488 100644
--- a/mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir
+++ b/mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir
@@ -25,8 +25,8 @@
 
 module {
   // Two dma_packet channels from L3 to L1, sharing the same shim column.
-  air.channel @chan_a [1, 1] {channel_type = "dma_packet"}
-  air.channel @chan_b [1, 1] {channel_type = "dma_packet"}
+  air.channel @chan_a [1, 1] {channel_type = "npu_dma_packet"}
+  air.channel @chan_b [1, 1] {channel_type = "npu_dma_packet"}
 
   func.func @test_shared_shim_packet_ids(%arg0: memref<64xbf16>, %arg1: memref<64xbf16>) {
     %0 = air.launch async () in () args(%in0=%arg0, %in1=%arg1) : memref<64xbf16>, memref<64xbf16> attributes {id = 1 : i32} {
diff --git a/mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir b/mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir
index 772133841..8e92bb45f 100644
--- a/mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir
+++ b/mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir
@@ -23,7 +23,7 @@
 // CHECK: @func0
 // CHECK: air.channel.put  @channel_0[] {{.*}} metadataArray = [{base = "air_channel_0", index = 0 : i32}], packet = #aie.packet_info<pkt_type = 0, pkt_id = 0>
 #map2 = affine_map<(d0) -> (d0)>
-air.channel @channel_0 [1, 1] {channel_type = "dma_packet"}
+air.channel @channel_0 [1, 1] {channel_type = "npu_dma_packet"}
 air.channel @channel_1 [1, 1]
 air.channel @channel_2 [1, 1]
 air.channel @channel_3 [1, 1]
@@ -83,7 +83,7 @@ func.func @func0(%arg0 : memref<64xi32>, %arg1 : memref<64xi32>) -> () {
 // CHECK: air.channel.put async @channel_0[] {{.*}} metadataArray = [{base = "air_channel_0", index = 0 : i32}], packet = #aie.packet_info<pkt_type = 0, pkt_id = 0>
 #map = affine_map<(d0) -> (d0)>
 module {
-  air.channel @channel_0 [1, 1] {channel_type = "dma_packet"}
+  air.channel @channel_0 [1, 1] {channel_type = "npu_dma_packet"}
   air.channel @channel_1 [1, 1]
   air.channel @channel_2 [1, 1]
   air.channel @channel_3 [1, 1]
diff --git a/mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir b/mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir
index 8cd6485ec..cf950a44b 100644
--- a/mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir
+++ b/mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir
@@ -24,9 +24,9 @@
 // CHECK: aie.shim_dma_allocation @air_pkt_in_2({{.*}}, MM2S, 0)
 
 module {
-  air.channel @pkt_in_0 [1, 1] {channel_type = "dma_packet"}
-  air.channel @pkt_in_1 [1, 1] {channel_type = "dma_packet"}
-  air.channel @pkt_in_2 [1, 1] {channel_type = "dma_packet"}
+  air.channel @pkt_in_0 [1, 1] {channel_type = "npu_dma_packet"}
+  air.channel @pkt_in_1 [1, 1] {channel_type = "npu_dma_packet"}
+  air.channel @pkt_in_2 [1, 1] {channel_type = "npu_dma_packet"}
   air.channel @to_core [1, 1]
   air.channel @from_core [1, 1]
   air.channel @out [1, 1]
diff --git a/mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir b/mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir
index 7fbabd53a..41eb0f9df 100644
--- a/mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir
+++ b/mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir
@@ -388,7 +388,7 @@ module {
 
 // CHECK: [[$SET0:#set[0-9]*]] = affine_set<()[s0] : (s0 - 3 == 0)>
 // CHECK: [[$SET1:#set[0-9]+]] = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)>
-// CHECK: air.channel @channel_0 [3] {channel_type = "cascade"}
+// CHECK: air.channel @channel_0 [3] {channel_type = "npu_cascade"}
 // CHECK-LABEL: scf_reduce
 // CHECK: air.herd @herd_0  tile (%[[arg0:.*]], %[[arg1:.*]]) in (%{{.*}}=%c4{{.*}}, %{{.*}}=%c1{{.*}})
 // CHECK: %[[alloc_4:.*]] = memref.alloc() : memref<32xi32, 2 : i32>
diff --git a/mlir/test/Dialect/AIR/air_canonicalize.mlir b/mlir/test/Dialect/AIR/air_canonicalize.mlir
index 723f174e1..9cff8ba7d 100644
--- a/mlir/test/Dialect/AIR/air_canonicalize.mlir
+++ b/mlir/test/Dialect/AIR/air_canonicalize.mlir
@@ -1094,7 +1094,7 @@ func.func @channel_fold_reinterpret_cast_empty_access(%arg0: memref<*xbf16>) {
 
 // Test cascade channel type (should only fold memref.cast, not full composition)
 
-air.channel @channel_cascade [3] {channel_type = "cascade"}
+air.channel @channel_cascade [3] {channel_type = "npu_cascade"}
 
 // CHECK-LABEL: func.func @channel_cascade_fold_cast_only
 // CHECK-NOT: %[[CAST:.*]] = memref.cast %{{.*}} : memref<256x256xbf16> to memref<256x256xbf16, strided<[256, 1], offset: ?>>
diff --git a/mlir/test/Dialect/AIR/air_channel.mlir b/mlir/test/Dialect/AIR/air_channel.mlir
index e148282da..cb7a035a7 100644
--- a/mlir/test/Dialect/AIR/air_channel.mlir
+++ b/mlir/test/Dialect/AIR/air_channel.mlir
@@ -12,7 +12,7 @@
 // CHECK: func.func @channel
 // CHECK: %[[V1:.*]] = air.channel.put async [{{.*}}] @channel_1[{{.*}}, {{.*}}]
 // CHECK: %[[V2:.*]] = air.channel.get async [{{.*}}] @channel_1[{{.*}}, {{.*}}]
-air.channel @channel_1 [2,2] {channel_type = "dma_stream"}
+air.channel @channel_1 [2,2] {channel_type = "npu_dma_stream"}
 func.func @channel() {
   %c0 = arith.constant 0 : index
   %c1 = arith.constant 1 : index
@@ -35,7 +35,7 @@ func.func @channel() {
 // CHECK: func.func @fork
 // CHECK: %[[V1:.*]] = air.channel.put async [{{.*}}] @bcast[] ({{.*}}[{{.*}},{{.*}}]
 // CHECK: air.channel.get @bcast[{{.*}}, {{.*}}] ({{.*}}[] [] []) 
-air.channel @bcast [2,1] {channel_type = "dma_stream"}
+air.channel @bcast [2,1] {channel_type = "npu_dma_stream"}
 func.func @fork() {
   %c0 = arith.constant 0 : index
   %c1 = arith.constant 1 : index
@@ -55,7 +55,7 @@ func.func @fork() {
 // CHECK: func.func @distribute
 // CHECK: air.channel.put @merge[{{.*}}, {{.*}}] ({{.*}}[
 // CHECK: %[[V2:.*]] = air.channel.get async [{{.*}}] @merge[]
-air.channel @merge[2,2] {channel_type = "dma_stream"}
+air.channel @merge[2,2] {channel_type = "npu_dma_stream"}
 func.func @distribute() {
   %c0 = arith.constant 0 : index
   %c1 = arith.constant 1 : index
@@ -74,11 +74,11 @@ func.func @distribute() {
   return 
 } 
 
-// CHECK: air.channel @packet_flow [2, 2] {channel_type = "dma_packet"}
+// CHECK: air.channel @packet_flow [2, 2] {channel_type = "npu_dma_packet"}
 // CHECK: func.func @packet_flow_func
 // CHECK: %[[V1:.*]] = air.channel.put async [{{.*}}] @packet_flow[{{.*}}, {{.*}}]
 // CHECK: %[[V2:.*]] = air.channel.get async [{{.*}}] @packet_flow[{{.*}}, {{.*}}]
-air.channel @packet_flow[2,2] {channel_type = "dma_packet"}
+air.channel @packet_flow[2,2] {channel_type = "npu_dma_packet"}
 func.func @packet_flow_func() {
   %c0 = arith.constant 0 : index
   %c1 = arith.constant 1 : index
@@ -98,7 +98,7 @@ func.func @packet_flow_func() {
   return 
 } 
 
-// CHECK: air.channel @cascade [3] {channel_type = "cascade"}
+// CHECK: air.channel @cascade [3] {channel_type = "npu_cascade"}
 // CHECK: func.func @cascade_func
 // CHECK: affine.if
 // CHECK: air.channel.put  @cascade[%{{.*}}]
@@ -110,7 +110,7 @@ func.func @packet_flow_func() {
 // CHECK: air.channel.get  @cascade[%{{.*}}]
 #set = affine_set<()[s0] : (s0 == 0)>
 #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)>
-air.channel @cascade [3] {channel_type = "cascade"}
+air.channel @cascade [3] {channel_type = "npu_cascade"}
 func.func @cascade_func() {
   %c4 = arith.constant 4 : index
   %c1_0 = arith.constant 1 : index
@@ -141,5 +141,12 @@ func.func @cascade_func() {
 // -----
 
 // CHECK: air.channel @mmio_chan
-// CHECK-SAME: channel_type = "mmio"
-air.channel @mmio_chan [] {channel_type = "mmio"}
+// CHECK-SAME: channel_type = "npu_mmio"
+air.channel @mmio_chan [] {channel_type = "npu_mmio"}
+
+// -----
+
+// Round-trip: gpu_symmetric_heap channel parses and prints correctly.
+// CHECK: air.channel @sym_heap_chan
+// CHECK-SAME: channel_type = "gpu_symmetric_heap"
+air.channel @sym_heap_chan [] {channel_type = "gpu_symmetric_heap"}
diff --git a/mlir/test/Dialect/AIR/air_channel_invalid.mlir b/mlir/test/Dialect/AIR/air_channel_invalid.mlir
index 812af3462..51fb7d041 100644
--- a/mlir/test/Dialect/AIR/air_channel_invalid.mlir
+++ b/mlir/test/Dialect/AIR/air_channel_invalid.mlir
@@ -85,15 +85,15 @@ func.func @channel_get_temporal_for_iv(%m: memref<64xi32>) {
 // -----
 
 // Test: unsupported channel_type string is rejected by the verifier.
-// expected-error @+1 {{'air.channel' op unsupported channel_type "ddr_stream"; expected one of "dma_stream", "dma_packet", "cascade", or "mmio"}}
+// expected-error @+1 {{'air.channel' op unsupported channel_type "ddr_stream"; expected one of "npu_dma_stream", "npu_dma_packet", "npu_cascade", "npu_mmio", or "gpu_symmetric_heap"}}
 air.channel @bad_chan_type [] {channel_type = "ddr_stream"}
 
 // -----
 
 // Test: mmio channel put source must be in L3 (memory_space=0).
-air.channel @mmio_bad_put [] {channel_type = "mmio"}
+air.channel @mmio_bad_put [] {channel_type = "npu_mmio"}
 func.func @mmio_put_wrong_memspace(%m: memref<8xi32, 2>) {
-  // expected-error @+1 {{'air.channel.put' op channel_type="mmio" put source must be in L3 (memory_space=0), got memory_space=2}}
+  // expected-error @+1 {{'air.channel.put' op channel_type="npu_mmio" put source must be in L3 (memory_space=0), got memory_space=2}}
   air.channel.put @mmio_bad_put[] (%m[] [] []) : (memref<8xi32, 2>)
   return
 }
@@ -101,9 +101,9 @@ func.func @mmio_put_wrong_memspace(%m: memref<8xi32, 2>) {
 // -----
 
 // Test: mmio channel get destination must be in L1 (memory_space=2).
-air.channel @mmio_bad_get [] {channel_type = "mmio"}
+air.channel @mmio_bad_get [] {channel_type = "npu_mmio"}
 func.func @mmio_get_wrong_memspace(%m: memref<8xi32, 1>) {
-  // expected-error @+1 {{'air.channel.get' op channel_type="mmio" get destination must be in L1 (memory_space=2), got memory_space=1}}
+  // expected-error @+1 {{'air.channel.get' op channel_type="npu_mmio" get destination must be in L1 (memory_space=2), got memory_space=1}}
   air.channel.get @mmio_bad_get[] (%m[] [] []) : (memref<8xi32, 1>)
   return
 }
@@ -111,9 +111,9 @@ func.func @mmio_get_wrong_memspace(%m: memref<8xi32, 1>) {
 // -----
 
 // Test: mmio put with L2 source is also rejected (only L3 is allowed).
-air.channel @mmio_bad_put_l2 [] {channel_type = "mmio"}
+air.channel @mmio_bad_put_l2 [] {channel_type = "npu_mmio"}
 func.func @mmio_put_l2(%m: memref<8xi32, 1>) {
-  // expected-error @+1 {{'air.channel.put' op channel_type="mmio" put source must be in L3 (memory_space=0), got memory_space=1}}
+  // expected-error @+1 {{'air.channel.put' op channel_type="npu_mmio" put source must be in L3 (memory_space=0), got memory_space=1}}
   air.channel.put @mmio_bad_put_l2[] (%m[] [] []) : (memref<8xi32, 1>)
   return
 }
@@ -121,9 +121,29 @@ func.func @mmio_put_l2(%m: memref<8xi32, 1>) {
 // -----
 
 // Test: mmio get with L3 destination is rejected (only L1 is allowed).
-air.channel @mmio_bad_get_l3 [] {channel_type = "mmio"}
+air.channel @mmio_bad_get_l3 [] {channel_type = "npu_mmio"}
 func.func @mmio_get_l3(%m: memref<8xi32>) {
-  // expected-error @+1 {{'air.channel.get' op channel_type="mmio" get destination must be in L1 (memory_space=2), got memory_space=0}}
+  // expected-error @+1 {{'air.channel.get' op channel_type="npu_mmio" get destination must be in L1 (memory_space=2), got memory_space=0}}
   air.channel.get @mmio_bad_get_l3[] (%m[] [] []) : (memref<8xi32>)
   return
 }
+
+// -----
+
+// Test: gpu_symmetric_heap put outside an air.rank scope is rejected.
+air.channel @sym_chan_put [] {channel_type = "gpu_symmetric_heap"}
+func.func @sym_put_no_rank(%m: memref<128xf32>) {
+  // expected-error @+1 {{'air.channel.put' op channel_type="gpu_symmetric_heap" put requires an enclosing air.rank scope}}
+  air.channel.put @sym_chan_put[] (%m[] [] []) : (memref<128xf32>)
+  return
+}
+
+// -----
+
+// Test: gpu_symmetric_heap get outside an air.rank scope is rejected.
+air.channel @sym_chan_get [] {channel_type = "gpu_symmetric_heap"}
+func.func @sym_get_no_rank(%m: memref<128xf32>) {
+  // expected-error @+1 {{'air.channel.get' op channel_type="gpu_symmetric_heap" get requires an enclosing air.rank scope}}
+  air.channel.get @sym_chan_get[] (%m[] [] []) : (memref<128xf32>)
+  return
+}
diff --git a/mlir/test/Dialect/AIR/air_cross_rank_dma.mlir b/mlir/test/Dialect/AIR/air_cross_rank_dma.mlir
new file mode 100644
index 000000000..3ad0138ac
--- /dev/null
+++ b/mlir/test/Dialect/AIR/air_cross_rank_dma.mlir
@@ -0,0 +1,75 @@
+//===- air_cross_rank_dma.mlir ----------------------------------*- MLIR -*-===//
+//
+// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
+// SPDX-License-Identifier: MIT
+//
+//===----------------------------------------------------------------------===//
+
+// Round-trip tests for air.dma_memcpy_nd with src_rank/dst_rank attributes
+// and for memref.alloc with the air.symmetric attribute. The cross-rank
+// attributes require an enclosing air.rank scope.
+//
+// RUN: air-opt %s | FileCheck %s
+
+// CHECK-LABEL: func.func @test_dma_with_src_rank
+func.func @test_dma_with_src_rank() {
+  %c2 = arith.constant 2 : index
+  // CHECK: air.rank
+  air.rank (%rx) in (%sx = %c2) {
+    // CHECK: %[[BUF:.*]] = memref.alloc() {air.symmetric} : memref<128xf32>
+    %buf = memref.alloc() {air.symmetric} : memref<128xf32>
+    %local = memref.alloc() : memref<128xf32, 2>
+    // CHECK: air.dma_memcpy_nd
+    // CHECK-SAME: src_rank = 0
+    air.dma_memcpy_nd (%local[] [] [], %buf[] [] []) {src_rank = 0 : i64}
+        : (memref<128xf32, 2>, memref<128xf32>)
+  }
+  return
+}
+
+// CHECK-LABEL: func.func @test_dma_with_dst_rank
+func.func @test_dma_with_dst_rank() {
+  %c2 = arith.constant 2 : index
+  air.rank (%rx) in (%sx = %c2) {
+    %buf = memref.alloc() {air.symmetric} : memref<128xf32>
+    %local = memref.alloc() : memref<128xf32, 2>
+    // CHECK: air.dma_memcpy_nd
+    // CHECK-SAME: dst_rank = 1
+    air.dma_memcpy_nd (%buf[] [] [], %local[] [] []) {dst_rank = 1 : i64}
+        : (memref<128xf32>, memref<128xf32, 2>)
+  }
+  return
+}
+
+// CHECK-LABEL: func.func @test_dma_with_both_ranks
+func.func @test_dma_with_both_ranks() {
+  %c2 = arith.constant 2 : index
+  air.rank (%rx) in (%sx = %c2) {
+    %src = memref.alloc() {air.symmetric} : memref<128xf32>
+    %dst = memref.alloc() {air.symmetric} : memref<128xf32>
+    // CHECK: air.dma_memcpy_nd
+    // CHECK-SAME: dst_rank = 1
+    // CHECK-SAME: src_rank = 0
+    air.dma_memcpy_nd (%dst[] [] [], %src[] [] [])
+        {src_rank = 0 : i64, dst_rank = 1 : i64}
+        : (memref<128xf32>, memref<128xf32>)
+  }
+  return
+}
+
+// CHECK: air.channel @sym_chan
+// CHECK-SAME: channel_type = "gpu_symmetric_heap"
+air.channel @sym_chan [] {channel_type = "gpu_symmetric_heap"}
+
+// CHECK-LABEL: func.func @test_sym_channel_put_get_in_rank
+func.func @test_sym_channel_put_get_in_rank() {
+  %c2 = arith.constant 2 : index
+  air.rank (%rx) in (%sx = %c2) {
+    %buf = memref.alloc() : memref<128xf32>
+    // CHECK: air.channel.put @sym_chan
+    air.channel.put @sym_chan[] (%buf[] [] []) : (memref<128xf32>)
+    // CHECK: air.channel.get @sym_chan
+    air.channel.get @sym_chan[] (%buf[] [] []) : (memref<128xf32>)
+  }
+  return
+}
diff --git a/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir b/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir
index 2fba20890..f82d215d3 100644
--- a/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir
+++ b/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir
@@ -56,3 +56,53 @@ func.func @channel_get_dst_mismatch(%m: memref<64xi32>) {
   air.channel.get @channel_get_test[] (%m[%c0] [%c64, %c64] [%c1]) : (memref<64xi32>)
   return
 }
+
+// -----
+
+// Test: src_rank requires an enclosing air.rank scope.
+func.func @dma_src_rank_no_enclosing_rank(%dst: memref<128xf32, 2>, %src: memref<128xf32>) {
+  // expected-error @+1 {{'air.dma_memcpy_nd' op src_rank/dst_rank attributes require an enclosing air.rank scope}}
+  air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {src_rank = 0 : i64}
+      : (memref<128xf32, 2>, memref<128xf32>)
+  return
+}
+
+// -----
+
+// Test: dst_rank requires an enclosing air.rank scope.
+func.func @dma_dst_rank_no_enclosing_rank(%dst: memref<128xf32>, %src: memref<128xf32, 2>) {
+  // expected-error @+1 {{'air.dma_memcpy_nd' op src_rank/dst_rank attributes require an enclosing air.rank scope}}
+  air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {dst_rank = 1 : i64}
+      : (memref<128xf32>, memref<128xf32, 2>)
+  return
+}
+
+// -----
+
+// Test: src_rank requires the source memref.alloc to carry the air.symmetric attribute.
+func.func @dma_src_rank_alloc_not_symmetric() {
+  %c2 = arith.constant 2 : index
+  air.rank (%rx) in (%sx = %c2) {
+    %src = memref.alloc() : memref<128xf32>
+    %dst = memref.alloc() : memref<128xf32, 2>
+    // expected-error @+1 {{'air.dma_memcpy_nd' op src memref is referenced cross-rank but its memref.alloc lacks the "air.symmetric" attribute}}
+    air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {src_rank = 0 : i64}
+        : (memref<128xf32, 2>, memref<128xf32>)
+  }
+  return
+}
+
+// -----
+
+// Test: dst_rank requires the destination memref.alloc to carry the air.symmetric attribute.
+func.func @dma_dst_rank_alloc_not_symmetric() {
+  %c2 = arith.constant 2 : index
+  air.rank (%rx) in (%sx = %c2) {
+    %dst = memref.alloc() : memref<128xf32>
+    %src = memref.alloc() : memref<128xf32, 2>
+    // expected-error @+1 {{'air.dma_memcpy_nd' op dst memref is referenced cross-rank but its memref.alloc lacks the "air.symmetric" attribute}}
+    air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {dst_rank = 1 : i64}
+        : (memref<128xf32>, memref<128xf32, 2>)
+  }
+  return
+}
diff --git a/mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir b/mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir
index 11d304bc1..c96fc096b 100644
--- a/mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir
+++ b/mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir
@@ -1942,7 +1942,7 @@ module {
 // -----
 
 // Prevent fusing channels with different channel_type attributes.
-// Channels with "dma_stream" and "dma_packet" should not be fused.
+// Channels with "npu_dma_stream" and "npu_dma_packet" should not be fused.
 
 // CHECK-LABEL: func14
 // CHECK: air.launch
@@ -1967,8 +1967,8 @@ module {
 // AGGL1: air.channel.get{{.*}}@channel_packet
 
 module {
-  air.channel @channel_stream [1, 1] {channel_type = "dma_stream"}
-  air.channel @channel_packet [1, 1] {channel_type = "dma_packet"}
+  air.channel @channel_stream [1, 1] {channel_type = "npu_dma_stream"}
+  air.channel @channel_packet [1, 1] {channel_type = "npu_dma_packet"}
   func.func @func14(){
     %c1 = arith.constant 1 : index
     air.launch (%arg3, %arg4) in (%arg5=%c1, %arg6=%c1) {
diff --git a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir
index 793834127..611ca41cc 100644
--- a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir
+++ b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir
@@ -12,8 +12,8 @@
 
 // RUN: air-opt %s -air-dma-to-channel | FileCheck %s
 
-// CHECK-COUNT-4: air.channel {{.*}} {channel_type = "dma_packet"}
-// CHECK-NOT: channel_type = "dma_packet"
+// CHECK-COUNT-4: air.channel {{.*}} {channel_type = "npu_dma_packet"}
+// CHECK-NOT: channel_type = "npu_dma_packet"
 
 module {
   func.func @dual_herd_overflow(%arg0: memref<1024xbf16>,
diff --git a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir
index a2df4fb82..123d32179 100644
--- a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir
+++ b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir
@@ -22,7 +22,7 @@
 // CHECK:       air.channel @channel_2 [1, 1] {broadcast_shape = [8, 1]}
 // CHECK:       air.channel @channel_3 [1, 1] {broadcast_shape = [8, 1]}
 // CHECK:       air.channel @channel_4 [8, 4]
-// CHECK-NOT:   channel_type = "dma_packet"
+// CHECK-NOT:   channel_type = "npu_dma_packet"
 // CHECK-LABEL: func.func @broadcast_4x8_no_upgrade
 
 #set_ty0 = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 7 >= 0, s1 == 0)>
@@ -89,12 +89,12 @@ module {
 // Each broadcast channel has broadcast_shape=[2,1]. Pressure:
 // ceil(6/2) = 3 > 2 (per-column limit). All 6 inputs upgraded to dma_packet.
 
-// CHECK:       air.channel @channel_0 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_1 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_2 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_3 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_4 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_5 {{.*}} {broadcast_shape = [2, 1], channel_type = "dma_packet"}
+// CHECK:       air.channel @channel_0 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_1 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_2 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_3 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_4 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_5 {{.*}} {broadcast_shape = [2, 1], channel_type = "npu_dma_packet"}
 // CHECK-LABEL: func.func @broadcast_6x2_upgrade
 
 #set2_ty0 = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 1 >= 0, s1 == 0)>
@@ -179,11 +179,11 @@ module {
 // 2 non-broadcast inputs + 3 broadcast inputs spanning 4 columns.
 // Pressure: 2 + ceil(3/4) = 2 + 1 = 3 > 2. All 5 inputs upgraded.
 
-// CHECK:       air.channel @channel_0 {{.*}} {channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_1 {{.*}} {channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_2 {{.*}} {broadcast_shape = [4, 1], channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_3 {{.*}} {broadcast_shape = [4, 1], channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_4 {{.*}} {broadcast_shape = [4, 1], channel_type = "dma_packet"}
+// CHECK:       air.channel @channel_0 {{.*}} {channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_1 {{.*}} {channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_2 {{.*}} {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_3 {{.*}} {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_4 {{.*}} {broadcast_shape = [4, 1], channel_type = "npu_dma_packet"}
 // CHECK-LABEL: func.func @mixed_broadcast_upgrade
 
 #set3_bcast = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 3 >= 0, s1 == 0)>
@@ -262,7 +262,7 @@ module {
 // CHECK:       air.channel @channel_2 [1, 1] {broadcast_shape = [2, 1]}
 // CHECK:       air.channel @channel_3 [1, 1] {broadcast_shape = [2, 1]}
 // CHECK:       air.channel @channel_4 [8, 4]
-// CHECK-NOT:   channel_type = "dma_packet"
+// CHECK-NOT:   channel_type = "npu_dma_packet"
 // CHECK-LABEL: func.func @mixed_spans_no_upgrade
 
 #set4_wide0 = affine_set<()[s0, s1] : (s0 >= 0, -s0 + 7 >= 0, s1 == 0)>
@@ -331,9 +331,9 @@ module {
 // 3 > 2 -> upgrade. Tests that row-only broadcasts aren't incorrectly
 // discounted.
 
-// CHECK:       air.channel @channel_0 {{.*}} {broadcast_shape = [1, 4], channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_1 {{.*}} {broadcast_shape = [1, 4], channel_type = "dma_packet"}
-// CHECK:       air.channel @channel_2 {{.*}} {broadcast_shape = [1, 4], channel_type = "dma_packet"}
+// CHECK:       air.channel @channel_0 {{.*}} {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_1 {{.*}} {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"}
+// CHECK:       air.channel @channel_2 {{.*}} {broadcast_shape = [1, 4], channel_type = "npu_dma_packet"}
 // CHECK-LABEL: func.func @row_broadcast_upgrade
 
 #set5_row0 = affine_set<()[s0, s1] : (s0 == 0, s1 >= 0, -s1 + 3 >= 0)>
diff --git a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir
index 631cad453..a0def63eb 100644
--- a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir
+++ b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir
@@ -10,8 +10,8 @@
 
 // RUN: air-opt %s -air-dma-to-channel 2>&1 | FileCheck %s
 
-// CHECK-COUNT-3: air.channel {{.*}} {channel_type = "dma_packet"}
-// CHECK-NOT: channel_type = "dma_packet"
+// CHECK-COUNT-3: air.channel {{.*}} {channel_type = "npu_dma_packet"}
+// CHECK-NOT: channel_type = "npu_dma_packet"
 
 module {
   func.func @single_herd_overflow(%arg0: memref<1024xbf16>,
diff --git a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir
index 68c8b2854..3a2bfbb19 100644
--- a/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir
+++ b/mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir
@@ -11,7 +11,7 @@
 // RUN: air-opt %s -air-dma-to-channel -split-input-file | FileCheck %s
 
 // None of these cases should produce dma_packet channels.
-// CHECK-NOT: channel_type = "dma_packet"
+// CHECK-NOT: channel_type = "npu_dma_packet"
 
 // Test 1: Two 1x1 herds with 1 input each = 2 input channels.
 // Capacity = 2 channels/col * 1 col = 2. 2 <= 2 => no upgrade.
diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir
index 3d1119d72..2a113b54a 100644
--- a/mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir
+++ b/mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir
@@ -43,8 +43,8 @@
 // CHECK: air.herd @conv1x1_skip {{.*}} attributes {{{.*}}x_loc = 1 : i64, y_loc = 3 : i64}
 
 module {
-  air.channel @L1ToL1_Conv3x3AToSkip [1] {channel_type = "cascade"}
-  air.channel @L1ToL1_Conv3x3BToSkip [1] {channel_type = "cascade"}
+  air.channel @L1ToL1_Conv3x3AToSkip [1] {channel_type = "npu_cascade"}
+  air.channel @L1ToL1_Conv3x3BToSkip [1] {channel_type = "npu_cascade"}
   air.channel @L1ToL1_Conv1ToConv3x3 [1, 1] {broadcast_shape = [2 : index, 1 : index]}
   air.channel @L2ToL1_ActIn [1, 1]
   air.channel @L1ToL2_ActOut [1, 1]
diff --git a/mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir b/mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir
index 496a05a7e..02d7cbfd1 100644
--- a/mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir
+++ b/mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir
@@ -104,7 +104,7 @@ func.func @test2() -> () {
 // MAXCOL-DAG: %[[CST2:.*]] = arith.constant 2 : index
 // MAXCOL: air.herd @cascade_herd tile (%{{.*}}, %{{.*}}) in (%{{.*}}=%[[CST2]], %{{.*}}=%[[CST2]])
 
-air.channel @cascade_chan [3] {channel_type = "cascade"}
+air.channel @cascade_chan [3] {channel_type = "npu_cascade"}
 func.func @test_intra_herd_cascade_skip() -> () {
   %c2 = arith.constant 2 : index
   air.herd @cascade_herd tile (%x, %y) in (%sx=%c2, %sy=%c2) {
@@ -131,7 +131,7 @@ func.func @test_intra_herd_cascade_skip() -> () {
 // MAXCOL-DAG: %[[CST2:.*]] = arith.constant 2 : index
 // MAXCOL: air.herd @cascade_producer tile (%{{.*}}, %{{.*}}) in (%{{.*}}=%[[CST2]], %{{.*}}=%[[CST2]])
 
-air.channel @inter_cascade [1] {channel_type = "cascade"}
+air.channel @inter_cascade [1] {channel_type = "npu_cascade"}
 func.func @test_inter_herd_cascade_skip() -> () {
   %c2 = arith.constant 2 : index
   air.herd @cascade_producer tile (%x, %y) in (%sx=%c2, %sy=%c2) {
@@ -157,7 +157,7 @@ func.func @test_inter_herd_cascade_skip() -> () {
 // MAXCOL: air.herd @no_cascade_herd tile (%{{.*}}, %{{.*}}) in (%{{.*}}=%[[CST2]], %{{.*}}=%[[CST2]])
 // MAXCOL: air.herd @cascade_herd_put tile (%{{.*}}, %{{.*}}) in (%{{.*}}=%[[CST2]], %{{.*}}=%[[CST2]])
 
-air.channel @seg_cascade [1] {channel_type = "cascade"}
+air.channel @seg_cascade [1] {channel_type = "npu_cascade"}
 func.func @test_non_cascade_herd_in_cascade_segment() -> () {
   %c1 = arith.constant 1 : index
   air.launch (%arg0) in (%arg1=%c1) {
diff --git a/mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir b/mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir
index f4234f6e2..5cc4f9705 100644
--- a/mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir
+++ b/mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir
@@ -2417,17 +2417,17 @@ module {
 
 // Verify that channel_type attribute is preserved after L2 memref splitting.
 // The split pass creates new channel declarations; non-default channel_type
-// (e.g., "dma_packet") must be carried over from the original channel.
+// (e.g., "npu_dma_packet") must be carried over from the original channel.
 
-// CHECK: air.channel @channel_1 [1, 1] {channel_type = "dma_packet"}
-// CHECK: air.channel @channel_0 [4, 4] {channel_type = "dma_packet"}
-// CHECK: air.channel @channel_2 [4, 1] {channel_type = "dma_packet"}
+// CHECK: air.channel @channel_1 [1, 1] {channel_type = "npu_dma_packet"}
+// CHECK: air.channel @channel_0 [4, 4] {channel_type = "npu_dma_packet"}
+// CHECK: air.channel @channel_2 [4, 1] {channel_type = "npu_dma_packet"}
 // CHECK-LABEL: func.func @test_preserve_channel_type
 
 #map = affine_map<()[s0] -> (s0 * 256)>
 #map1 = affine_map<()[s0] -> (s0 * 64)>
-air.channel @channel_1 [1, 1] {channel_type = "dma_packet"}
-air.channel @channel_0 [4, 4] {channel_type = "dma_packet"}
+air.channel @channel_1 [1, 1] {channel_type = "npu_dma_packet"}
+air.channel @channel_0 [4, 4] {channel_type = "npu_dma_packet"}
 func.func @test_preserve_channel_type(%arg0: memref<512x1024xbf16>, %arg1: memref<1024x512xbf16>, %arg2: memref<512x512xbf16>) {
   %c2 = arith.constant 2 : index
   %0 = air.launch async (%arg3, %arg4) in (%arg5=%c2, %arg6=%c2) args(%arg7=%arg2) : memref<512x512xbf16> attributes {id = 1 : i32} {
diff --git a/programming_examples/cascade_reduction/cascade_reduction.py b/programming_examples/cascade_reduction/cascade_reduction.py
index 10cb7ff92..287351736 100644
--- a/programming_examples/cascade_reduction/cascade_reduction.py
+++ b/programming_examples/cascade_reduction/cascade_reduction.py
@@ -54,7 +54,7 @@ def build_module():
     # Channels: chan_in/chan_out use DMA (L3<->L1), chan_cascade uses
     # direct core-to-core cascade connections between adjacent tiles.
     channel("chan_in", size=[1])
-    channel("chan_cascade", size=[NUM_TILES], channel_type="cascade")
+    channel("chan_cascade", size=[NUM_TILES], channel_type="npu_cascade")
     channel("chan_out", size=[1])
 
     @FuncOp.from_py_func(l3MemrefTy, l3MemrefTy)
diff --git a/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py b/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py
index 2581f1652..805ffa74b 100644
--- a/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py
+++ b/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py
@@ -72,7 +72,7 @@ def build_module():
 
     # Cascade channel: per-segment, two independent chains.
     channel(
-        "chan_cascade", size=[NUM_SEGMENTS, NUM_TILES, NUM_COLS], channel_type="cascade"
+        "chan_cascade", size=[NUM_SEGMENTS, NUM_TILES, NUM_COLS], channel_type="npu_cascade"
     )
 
     # Output channel: one per cascade column across all segments.
diff --git a/programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py b/programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py
index f009436eb..2725cc521 100644
--- a/programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py
+++ b/programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py
@@ -11,7 +11,7 @@
 # The compiler automatically:
 #   1. Converts DMAs to channels (air-dma-to-channel)
 #   2. Detects that 4 input channels exceed the 2-per-column shim DMA limit
-#   3. Upgrades input channels to channel_type="dma_packet" for packet-switched
+#   3. Upgrades input channels to channel_type="npu_dma_packet" for packet-switched
 #      time-multiplexing of shim DMA MM2S ports
 
 import argparse
diff --git a/programming_examples/channel_examples/mmio/mmio.py b/programming_examples/channel_examples/mmio/mmio.py
index 6cb020df0..bf95e2f97 100644
--- a/programming_examples/channel_examples/mmio/mmio.py
+++ b/programming_examples/channel_examples/mmio/mmio.py
@@ -1,7 +1,7 @@
 # Copyright (C) 2026, Advanced Micro Devices, Inc.
 # SPDX-License-Identifier: MIT
 
-"""mmio_add: end-to-end demonstration of `channel_type="mmio"`.
+"""mmio_add: end-to-end demonstration of `channel_type="npu_mmio"`.
 
 A single AIE tile computes `out[i] = a[i] + c[i]` where:
   * `a` is a host input vector delivered to L1 via shim DMA;
@@ -12,7 +12,7 @@
 
 This is the AIR-Python equivalent of the hand-written
 `/tmp/mmio_bench/aie_q_n*_w*.mlir` benchmark variants. It exercises the
-new `channel_type="mmio"` lowering on real NPU2 hardware.
+new `channel_type="npu_mmio"` lowering on real NPU2 hardware.
 """
 
 import argparse
@@ -61,7 +61,7 @@ def build_module():
 
     # Channels: `mmio` for the constant, ordinary dma_stream for input/out.
     channel("a_in", size=[1])  # default: dma_stream
-    channel("c_mmio", size=[1], channel_type="mmio")
+    channel("c_mmio", size=[1], channel_type="npu_mmio")
     channel("o_out", size=[1])
 
     @FuncOp.from_py_func(l3_ty, l3_ty)
diff --git a/programming_examples/flash_attention/dataflow_based/attn.py b/programming_examples/flash_attention/dataflow_based/attn.py
index d31f338c5..6af066d52 100644
--- a/programming_examples/flash_attention/dataflow_based/attn.py
+++ b/programming_examples/flash_attention/dataflow_based/attn.py
@@ -158,7 +158,7 @@ def external_func(name, inputs, outputs=None, link_with=None, visibility="privat
     Channel("L1ToL1Chan2", size=[1, num_cascade_stages])
     Channel("L1ToL1Chan3", size=[1, num_cascade_stages])
     chan_cascade = Channel("cascade", size=[1, num_cascade_stages - 1])
-    chan_cascade.attributes["channel_type"] = StringAttr.get("cascade")
+    chan_cascade.attributes["channel_type"] = StringAttr.get("npu_cascade")
 
     # Main attention function
     @FuncOp.from_py_func(
diff --git a/programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py b/programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py
index 5d8423478..d475f6023 100644
--- a/programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py
+++ b/programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py
@@ -249,9 +249,9 @@ def external_func(name, inputs, outputs=None, link_with=None, visibility="privat
         Channel(f"VIn_{s}", size=[num_heads_per_unroll])
 
     # Cascade: 2D per-segment (shared within each segment instance)
-    channel("cascade_gp", size=[NQ, NS - 1], channel_type="cascade")
-    channel("cascade_up", size=[NQ, NS - 1], channel_type="cascade")
-    channel("cascade_sp", size=[NQ, NS - 1], channel_type="cascade")
+    channel("cascade_gp", size=[NQ, NS - 1], channel_type="npu_cascade")
+    channel("cascade_up", size=[NQ, NS - 1], channel_type="npu_cascade")
+    channel("cascade_sp", size=[NQ, NS - 1], channel_type="npu_cascade")
 
     # Output: L1-to-L2 gather, then L2-to-L3
     Channel("Gp2L2", size=[NQ, 1])
diff --git a/programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py b/programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py
index f373f979d..dc1d44e41 100644
--- a/programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py
+++ b/programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py
@@ -239,9 +239,9 @@ def external_func(name, inputs, outputs=None, link_with=None, visibility="privat
         Channel(f"VIn_{s}", size=[num_heads_per_unroll])
 
     # Cascade: 2D per-segment (shared within each segment instance)
-    channel("cascade_gp", size=[NQ, NS - 1], channel_type="cascade")
-    channel("cascade_up", size=[NQ, NS - 1], channel_type="cascade")
-    channel("cascade_sp", size=[NQ, NS - 1], channel_type="cascade")
+    channel("cascade_gp", size=[NQ, NS - 1], channel_type="npu_cascade")
+    channel("cascade_up", size=[NQ, NS - 1], channel_type="npu_cascade")
+    channel("cascade_sp", size=[NQ, NS - 1], channel_type="npu_cascade")
 
     # Output: L1-to-L2 gather, then L2-to-L3
     Channel("Gp2L2", size=[NQ, 1])
diff --git a/programming_examples/herd_dataflow/air.mlir b/programming_examples/herd_dataflow/air.mlir
index 249087c90..e92828b32 100644
--- a/programming_examples/herd_dataflow/air.mlir
+++ b/programming_examples/herd_dataflow/air.mlir
@@ -5,10 +5,10 @@ module {
   func.func private @add_3_bf16(memref<64x64xbf16, 2 : i32>, memref<64x64xbf16, 2 : i32>) attributes {link_with = "extern_func.o", llvm.emit_c_interface}
 
   // AIR channels model hardware FIFOs for inter-stage communication
-  air.channel @L2ToL1Chan1 [4, 1]         // L2 to L1, input A; default channel_type is "dma_stream", representing data movement performed using DMA streaming interconnects
+  air.channel @L2ToL1Chan1 [4, 1]         // L2 to L1, input A; default channel_type is "npu_dma_stream", representing data movement performed using DMA streaming interconnects
   air.channel @L2ToL1Chan2 [4, 1]         // L2 to L1, input B
   air.channel @L1ToL1Chan1 [4, 1]         // Between herd_0 and herd_1
-  air.channel @L1ToL1Chan2 [4, 1] {channel_type = "cascade"} // Between herd_1 and herd_2; channel_type="cascade" means this channel is expected to map to cascade connections (peer-to-peer communication between compute tiles)
+  air.channel @L1ToL1Chan2 [4, 1] {channel_type = "npu_cascade"} // Between herd_1 and herd_2; channel_type="npu_cascade" means this channel is expected to map to cascade connections (peer-to-peer communication between compute tiles)
   air.channel @L1ToL2Chan1 [4, 1]         // Output from herd_2 to L2
 
   // Top-level function: runtime dispatch over a 4x1 iteration space (not necessarily hardware parallelism)
diff --git a/programming_examples/herd_dataflow/run.py b/programming_examples/herd_dataflow/run.py
index 177813e35..9e5b66120 100644
--- a/programming_examples/herd_dataflow/run.py
+++ b/programming_examples/herd_dataflow/run.py
@@ -102,13 +102,13 @@ def build_module(M_SIZE, N_SIZE):
     add_3_func.attributes["link_with"] = StringAttr.get("extern_func.o")
 
     # AIR channels model hardware FIFOs for inter-stage communication
-    # Default channel_type is "dma_stream", representing data movement performed using DMA streaming interconnects
+    # Default channel_type is "npu_dma_stream", representing data movement performed using DMA streaming interconnects
     channel("L2ToL1Chan1", size=[NUM_COLUMNS, 1])  # L2 to L1, input A
     channel("L2ToL1Chan2", size=[NUM_COLUMNS, 1])  # L2 to L1, input B
     channel("L1ToL1Chan1", size=[NUM_COLUMNS, 1])  # Between herd_0 and herd_1
     channel(
-        "L1ToL1Chan2", size=[NUM_COLUMNS, 1], channel_type="cascade"
-    )  # Between herd_1 and herd_2; channel_type="cascade" means peer-to-peer communication between compute tiles
+        "L1ToL1Chan2", size=[NUM_COLUMNS, 1], channel_type="npu_cascade"
+    )  # Between herd_1 and herd_2; channel_type="npu_cascade" means peer-to-peer communication between compute tiles
     channel("L1ToL2Chan1", size=[NUM_COLUMNS, 1])  # Output from herd_2 to L2
 
     @FuncOp.from_py_func(memrefMxN, memrefMxN, memrefMxN)
diff --git a/programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py b/programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py
index 6d60d5970..c35aadcc1 100644
--- a/programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py
+++ b/programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py
@@ -192,7 +192,7 @@ def build_module(
 
     # Cascade channel: per-column cascade links.
     # n_cascade tiles per column → n_cascade-1 links per column.
-    channel("chan_cascade", size=[herd_cols, n_cascade - 1], channel_type="cascade")
+    channel("chan_cascade", size=[herd_cols, n_cascade - 1], channel_type="npu_cascade")
 
     @FuncOp.from_py_func(memrefTyA, memrefTyB, memrefTyC)
     def matvec_cascade_bf16(arg0, arg1, arg2):

From cb3a7a5199b1eeb027f599cd01edcecd58e69762 Mon Sep 17 00:00:00 2001
From: Erwei Wang <erwei.wang@amd.com>
Date: Wed, 6 May 2026 00:55:47 +0000
Subject: [PATCH 2/6] [multi-gpu] Phase 1: fix CI clang-format / black
 formatting

Apply clang-format-17 reflow to three .cpp files (text-string wrapping
across the renamed channel_type values "npu_mmio" / "npu_cascade" /
"npu_dma_stream") and black reformat to one .py file (npu_cascade arg
list now exceeds the line limit).

These were reported by the lintAndFormat workflow on PR #1576; this
commit folds them into Phase 1 so the diff CI saw is what's now in tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 mlir/lib/Conversion/AIRToAIEPass.cpp          | 20 ++++++++++---------
 mlir/lib/Dialect/AIR/IR/AIRDialect.cpp        |  4 ++--
 mlir/lib/Transform/AIRLinalgCodegen.cpp       |  5 +++--
 .../channel_3d_segment_unroll.py              |  4 +++-
 4 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/mlir/lib/Conversion/AIRToAIEPass.cpp b/mlir/lib/Conversion/AIRToAIEPass.cpp
index 6591e63f6..6ef9956e9 100644
--- a/mlir/lib/Conversion/AIRToAIEPass.cpp
+++ b/mlir/lib/Conversion/AIRToAIEPass.cpp
@@ -5871,15 +5871,15 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
           auto bufMemTy = bufferOp.getType();
           auto srcMemTy = cast<MemRefType>(getGlobalOp.getType());
           if (bufMemTy.getElementType() != srcMemTy.getElementType())
-            return get.emitOpError(
-                       "channel_type=\"npu_mmio\" source/destination element type "
-                       "mismatch (source: ")
+            return get.emitOpError("channel_type=\"npu_mmio\" "
+                                   "source/destination element type "
+                                   "mismatch (source: ")
                    << srcMemTy.getElementType()
                    << ", destination: " << bufMemTy.getElementType() << ")";
           if (bufMemTy.getNumElements() != srcMemTy.getNumElements())
-            return get.emitOpError(
-                       "channel_type=\"npu_mmio\" source/destination element count "
-                       "mismatch (source: ")
+            return get.emitOpError("channel_type=\"npu_mmio\" "
+                                   "source/destination element count "
+                                   "mismatch (source: ")
                    << srcMemTy.getNumElements()
                    << ", destination: " << bufMemTy.getNumElements() << ")";
 
@@ -5891,14 +5891,16 @@ class AIRToAIEPass : public air::impl::AIRToAIEBase<AIRToAIEPass> {
 
           if (auto existing = bufferOp.getInitialValue())
             return bufferOp.emitOpError(
-                "channel_type=\"npu_mmio\" destination aie.buffer already has an "
+                "channel_type=\"npu_mmio\" destination aie.buffer already has "
+                "an "
                 "initial_value; cannot stamp two sources into one buffer");
           bufferOp.setInitialValueAttr(reshapedInit);
           ++matchCount;
         }
         if (matchCount == 0)
-          return put.emitOpError("channel_type=\"npu_mmio\" put has no matching "
-                                 "device-side air.channel.get");
+          return put.emitOpError(
+              "channel_type=\"npu_mmio\" put has no matching "
+              "device-side air.channel.get");
       }
 
       // Erase all mmio puts (host-side ones have been replaced with
diff --git a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
index 2fc448a0a..edb8af6a8 100644
--- a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
+++ b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
@@ -3105,8 +3105,8 @@ static LogicalResult ComposeMemrefOpOnChannelOp(OpT op,
   if (!chan)
     // If the channel declaration cannot be resolved, signal a failure.
     return failure();
-  // If the channel is of type "npu_cascade", try to fold memref.cast but skip full
-  // composition
+  // If the channel is of type "npu_cascade", try to fold memref.cast but skip
+  // full composition
   if (chan.getChannelType() == "npu_cascade")
     return FoldMemrefCastOnChannelOp(op, rewriter);
 
diff --git a/mlir/lib/Transform/AIRLinalgCodegen.cpp b/mlir/lib/Transform/AIRLinalgCodegen.cpp
index 3236d8f17..f51fb116b 100644
--- a/mlir/lib/Transform/AIRLinalgCodegen.cpp
+++ b/mlir/lib/Transform/AIRLinalgCodegen.cpp
@@ -1204,8 +1204,9 @@ FailureOr<linalg::TiledLinalgOp> static pipelineReduceLinalgOp(
       auto module = op->getParentOfType<ModuleOp>();
       auto cname = createChannelName(module);
       b.setInsertionPointToStart(module.getBody());
-      auto channel_op = air::ChannelOp::create(
-          b, loc, cname, b.getI64ArrayAttr({1}), b.getStringAttr("npu_dma_stream"));
+      auto channel_op =
+          air::ChannelOp::create(b, loc, cname, b.getI64ArrayAttr({1}),
+                                 b.getStringAttr("npu_dma_stream"));
       b.setInsertionPoint(stageBlock->getTerminator());
       SmallVector<Value> src_offsets;
       SmallVector<Value> src_sizes;
diff --git a/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py b/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py
index 805ffa74b..68440ba53 100644
--- a/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py
+++ b/programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py
@@ -72,7 +72,9 @@ def build_module():
 
     # Cascade channel: per-segment, two independent chains.
     channel(
-        "chan_cascade", size=[NUM_SEGMENTS, NUM_TILES, NUM_COLS], channel_type="npu_cascade"
+        "chan_cascade",
+        size=[NUM_SEGMENTS, NUM_TILES, NUM_COLS],
+        channel_type="npu_cascade",
     )
 
     # Output channel: one per cascade column across all segments.

From 44d743d0449527b2c1572eec2c949c9ab51f121e Mon Sep 17 00:00:00 2001
From: Erwei Wang <erwei.wang@amd.com>
Date: Wed, 6 May 2026 04:24:18 +0000
Subject: [PATCH 3/6] [multi-gpu] Phase 1: address Copilot review comments
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Six Copilot comments on PR #1576:

1. AIRToAIESchedulingUtils.cpp: four diagnostic strings still said
   "dma_stream / dma_packet" after the rename to "npu_dma_stream /
   npu_dma_packet". Updated.

2. docs/AIRComputeModel.md (cross-rank DMA, §2.4): said the GPU
   backend lowers src_rank/dst_rank, contradicting the summary table
   that calls it "planned". Reworded as "planned: air-cross-rank-dma-
   to-mgpu" to match.

3. docs/AIRComputeModel.md (air.symmetric, §2.7): same inconsistency
   for mgpuSymmetricAlloc routing. Reworded as "planned:
   air-symmetric-alloc-to-mgpu".

4. AIR.td (DmaMemcpyNdOp description): same inconsistency. Reworded.

5. AIR.td (gpu_symmetric_heap channel_type description): claimed
   "Lowered by air-to-rocdl to thread-cooperative loops..." with no
   such lowering yet in tree. Reworded as "planned:
   air-gpu-channel-to-mgpu".

6. AIRDialect.cpp DmaMemcpyNdOp::verify: rank indices are
   non-negative. Added explicit `>= 0` check, plus matching verifier-
   negative tests in air_memcpy_invalid.mlir for both src_rank=-1 and
   dst_rank=-3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/AIRComputeModel.md                       | 16 +++++-----
 mlir/include/air/Dialect/AIR/AIR.td           | 11 ++++---
 .../Conversion/AIRToAIESchedulingUtils.cpp    | 18 ++++++-----
 mlir/lib/Dialect/AIR/IR/AIRDialect.cpp        |  8 +++++
 mlir/test/Dialect/AIR/air_memcpy_invalid.mlir | 30 +++++++++++++++++++
 5 files changed, 64 insertions(+), 19 deletions(-)

diff --git a/docs/AIRComputeModel.md b/docs/AIRComputeModel.md
index a9688776e..2c9f7b73f 100644
--- a/docs/AIRComputeModel.md
+++ b/docs/AIRComputeModel.md
@@ -676,9 +676,10 @@ Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
 enclosing `air.rank` scope. When present, the corresponding memref is
 interpreted as living on rank R's symmetric heap rather than on the local
 process. The verifier requires the op to be enclosed by an `air.rank` and the
-referenced memref to be `air.symmetric`-tagged (see §2.7). The GPU backend
-(`air-to-rocdl`) lowers cross-rank DMAs through `mgpuGetHeapBases()`-based
-peer addressing; the NPU backend does not support these attributes.
+referenced memref to be `air.symmetric`-tagged (see §2.7). Lowering for the
+GPU backend (planned: `air-cross-rank-dma-to-mgpu`) will expand these into
+`mgpuGetHeapBases()`-based peer-VA arithmetic + `mgpuMemcpy`; the NPU
+backend does not support these attributes.
 
 ```
 // Read 1024 floats from rank 0's symmetric buffer into local L1.
@@ -830,10 +831,11 @@ ranks' symmetric memrefs at the same logical offset.
 %buf = memref.alloc() {air.symmetric} : memref<1024xf32>
 ```
 
-The GPU lowering routes such allocations through `mgpuSymmetricAlloc`
-(`runtime_lib/airgpu/gpu_runtime.cpp`) instead of plain `mgpuMemAlloc`.
-Peer ranks' base pointers are obtained via `mgpuGetHeapBases()`. The NPU
-backend does not interpret this attribute.
+The GPU lowering (planned: `air-symmetric-alloc-to-mgpu`) will route such
+allocations through `mgpuSymmetricAlloc` (`runtime_lib/airgpu/gpu_runtime.cpp`)
+instead of plain `mgpuMemAlloc`. Peer ranks' base pointers are obtained at
+runtime via `mgpuGetHeapBases()`. The NPU backend does not interpret this
+attribute.
 
 ---
 
diff --git a/mlir/include/air/Dialect/AIR/AIR.td b/mlir/include/air/Dialect/AIR/AIR.td
index 475ab8bf7..19575bb3e 100644
--- a/mlir/include/air/Dialect/AIR/AIR.td
+++ b/mlir/include/air/Dialect/AIR/AIR.td
@@ -495,8 +495,9 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
     enclosing `air.rank` scope. When present, the corresponding memref is
     interpreted as living on rank R's symmetric heap rather than on the local
     process. These attributes are only valid for `air.symmetric`-tagged memref
-    allocations and require an enclosing `air.rank`. They are currently only
-    supported by the GPU lowering (`air-to-rocdl`).
+    allocations and require an enclosing `air.rank`. Lowering for these
+    attributes will be added by a future GPU pass (planned: `air-cross-rank-
+    dma-to-mgpu`); this PR introduces only the IR surface and verifier rules.
   }];
   let extraClassDeclaration = [{
     Value getSrcMemref() { return getSrc(); }
@@ -607,8 +608,10 @@ def air_ChannelOp : air_Op<"channel", [Symbol]>,
       Cross-GPU messaging through the symmetric heap runtime
       (`runtime_lib/airgpu/symmetric_heap.{h,cpp}`). The channel must be enclosed
       by an `air.rank` op; the put/get sites use rank indices to address peer
-      heaps. Lowered by `air-to-rocdl` to thread-cooperative loops over peer-mapped
-      VMem buffers, with synchronization via in-heap notify flags or `mgpuBarrier`.
+      heaps. Lowering will be added by a future GPU pass (planned:
+      `air-gpu-channel-to-mgpu`) which expands put/get to peer-mapped
+      `mgpuMemcpy` calls plus a barrier; this PR introduces only the IR
+      surface and verifier rules.
 
     ### Broadcasting
     If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute
diff --git a/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp b/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp
index 8d429249b..617bf89ef 100644
--- a/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp
+++ b/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp
@@ -1585,8 +1585,9 @@ LogicalResult air::simpleDMAChannelAllocation(
         // (aie.flow) nor dma packet flow (aie.packet_flow).
         if (f.memcpyResourceType != "npu_dma_stream" &&
             f.memcpyResourceType != "npu_dma_packet")
-          return memcpyOpIf->emitOpError("only supports dma_stream or "
-                                         "dma_packet connections at L2 memory");
+          return memcpyOpIf->emitOpError(
+              "only supports npu_dma_stream or npu_dma_packet "
+              "connections at L2 memory");
         auto alloc_res = memtile_dma_alloc.simpleDmaChannelAlloc(memcpyOpIf);
         if (failed(alloc_res) || !alloc_res->valid())
           return failure();
@@ -1602,8 +1603,8 @@ LogicalResult air::simpleDMAChannelAllocation(
           if (f.memcpyResourceType != "npu_dma_stream" &&
               f.memcpyResourceType != "npu_dma_packet")
             return memcpyOpIf->emitOpError(
-                "only supports dma_stream or dma_packet connections at L2 "
-                "memory");
+                "only supports npu_dma_stream or npu_dma_packet "
+                "connections at L2 memory");
           auto alloc_res = memtile_dma_alloc.simpleDmaChannelAlloc(memcpyOpIf);
           if (failed(alloc_res) || !alloc_res->valid())
             return failure();
@@ -1625,8 +1626,8 @@ LogicalResult air::simpleDMAChannelAllocation(
           if (f.memcpyResourceType != "npu_dma_stream" &&
               f.memcpyResourceType != "npu_dma_packet")
             return memcpyOpIf->emitOpError(
-                "only supports dma_stream or dma_packet connections at L3 "
-                "memory");
+                "only supports npu_dma_stream or npu_dma_packet "
+                "connections at L3 memory");
           if (!f.S2MM_alloc[i].getDmaTile())
             return memcpyOpIf->emitOpError(
                 "failed to get S2MM tile for L3 allocation.");
@@ -1652,8 +1653,9 @@ LogicalResult air::simpleDMAChannelAllocation(
         // (aie.flow) nor dma packet flow (aie.packet_flow).
         if (f.memcpyResourceType != "npu_dma_stream" &&
             f.memcpyResourceType != "npu_dma_packet")
-          return memcpyOpIf->emitOpError("only supports dma_stream or "
-                                         "dma_packet connections at L3 memory");
+          return memcpyOpIf->emitOpError(
+              "only supports npu_dma_stream or npu_dma_packet "
+              "connections at L3 memory");
         if (!f.MM2S_alloc.getDmaTile())
           return memcpyOpIf->emitOpError(
               "failed to get MM2S tile for L3 allocation.");
diff --git a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
index edb8af6a8..cce9a9294 100644
--- a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
+++ b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
@@ -2823,6 +2823,14 @@ LogicalResult air::DmaMemcpyNdOp::verify() {
       return emitOpError("src_rank/dst_rank attributes require an enclosing "
                          "air.rank scope");
 
+    // Rank indices are non-negative.
+    if (auto sr = getSrcRank())
+      if (*sr < 0)
+        return emitOpError() << "src_rank must be >= 0, got " << *sr;
+    if (auto dr = getDstRank())
+      if (*dr < 0)
+        return emitOpError() << "dst_rank must be >= 0, got " << *dr;
+
     auto requireSymmetricAlloc = [&](Value v, StringRef side) -> LogicalResult {
       auto alloc = v.getDefiningOp<memref::AllocOp>();
       if (!alloc)
diff --git a/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir b/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir
index f82d215d3..230a2d2c9 100644
--- a/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir
+++ b/mlir/test/Dialect/AIR/air_memcpy_invalid.mlir
@@ -106,3 +106,33 @@ func.func @dma_dst_rank_alloc_not_symmetric() {
   }
   return
 }
+
+// -----
+
+// Test: src_rank must be non-negative.
+func.func @dma_src_rank_negative() {
+  %c2 = arith.constant 2 : index
+  air.rank (%rx) in (%sx = %c2) {
+    %dst = memref.alloc() : memref<128xf32, 2>
+    %src = memref.alloc() {air.symmetric} : memref<128xf32>
+    // expected-error @+1 {{'air.dma_memcpy_nd' op src_rank must be >= 0, got -1}}
+    air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {src_rank = -1 : i64}
+        : (memref<128xf32, 2>, memref<128xf32>)
+  }
+  return
+}
+
+// -----
+
+// Test: dst_rank must be non-negative.
+func.func @dma_dst_rank_negative() {
+  %c2 = arith.constant 2 : index
+  air.rank (%rx) in (%sx = %c2) {
+    %dst = memref.alloc() {air.symmetric} : memref<128xf32>
+    %src = memref.alloc() : memref<128xf32, 2>
+    // expected-error @+1 {{'air.dma_memcpy_nd' op dst_rank must be >= 0, got -3}}
+    air.dma_memcpy_nd (%dst[] [] [], %src[] [] []) {dst_rank = -3 : i64}
+        : (memref<128xf32>, memref<128xf32, 2>)
+  }
+  return
+}

From 45e543d41779b3927b87c9b64932ca1b93bc55b6 Mon Sep 17 00:00:00 2001
From: Erwei Wang <erwei.wang@amd.com>
Date: Wed, 6 May 2026 04:38:43 +0000
Subject: [PATCH 4/6] [multi-gpu] Phase 1: fix negative-rank verifier check
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The previous commit (888bcaa7) added a `>= 0` verifier on src_rank /
dst_rank, but used `getSrcRank()` / `getDstRank()` — those return
`std::optional<uint64_t>` (a TableGen quirk for `OptionalAttr<I64Attr>`),
so `*sr < 0` on the unsigned value is always false and the check never
fired. The two new verifier-negative tests in air_memcpy_invalid.mlir
silently regressed.

Switch to the typed `getSrcRankAttr()` / `getDstRankAttr()` accessors
which return `IntegerAttr`, then call `.getInt()` for a real `int64_t`.
The check now fires on negative values; both negative-rank tests pass
under `lit -sv ../../mlir/test/Dialect/AIR` (21/21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 mlir/lib/Dialect/AIR/IR/AIRDialect.cpp | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
index cce9a9294..720d09a7f 100644
--- a/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
+++ b/mlir/lib/Dialect/AIR/IR/AIRDialect.cpp
@@ -2823,13 +2823,20 @@ LogicalResult air::DmaMemcpyNdOp::verify() {
       return emitOpError("src_rank/dst_rank attributes require an enclosing "
                          "air.rank scope");
 
-    // Rank indices are non-negative.
-    if (auto sr = getSrcRank())
-      if (*sr < 0)
-        return emitOpError() << "src_rank must be >= 0, got " << *sr;
-    if (auto dr = getDstRank())
-      if (*dr < 0)
-        return emitOpError() << "dst_rank must be >= 0, got " << *dr;
+    // Rank indices are non-negative. Use the typed *Attr accessor instead
+    // of the generated getSrcRank()/getDstRank() (those return uint64_t
+    // for OptionalAttr<I64Attr>, so a comparison against 0 is meaningless
+    // for negative values stored as i64).
+    if (auto srAttr = getSrcRankAttr()) {
+      int64_t sr = srAttr.getInt();
+      if (sr < 0)
+        return emitOpError() << "src_rank must be >= 0, got " << sr;
+    }
+    if (auto drAttr = getDstRankAttr()) {
+      int64_t dr = drAttr.getInt();
+      if (dr < 0)
+        return emitOpError() << "dst_rank must be >= 0, got " << dr;
+    }
 
     auto requireSymmetricAlloc = [&](Value v, StringRef side) -> LogicalResult {
       auto alloc = v.getDefiningOp<memref::AllocOp>();

From 965f853223bbcced478dc2d0cae4e77e948eaa0a Mon Sep 17 00:00:00 2001
From: Erwei Wang <erwei.wang@amd.com>
Date: Wed, 6 May 2026 04:48:33 +0000
Subject: [PATCH 5/6] [multi-gpu] Phase 1: rename channel_type in
 cascade_chain_* tests

origin/main grew 5 new herd-placement tests via #1583 that use the
pre-rename `channel_type = "cascade"`. After this PR's namespace rename
("cascade" -> "npu_cascade"), those tests fail under air-opt with the
verifier rejecting the old name. Update them to "npu_cascade" so they
keep passing on top of phase 1.

Verified on rad-mi300a-sh5-1: AIRHerdPlacement 15/15 pass, Dialect/AIR
21/21 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../AIRHerdPlacement/cascade_chain_3herd.mlir        |  4 ++--
 .../AIRHerdPlacement/cascade_chain_3herd_ew.mlir     |  4 ++--
 .../cascade_chain_3herd_with_l1_upstream.mlir        |  4 ++--
 .../AIRHerdPlacement/cascade_chain_4herd.mlir        |  6 +++---
 .../cascade_chain_multi_channel.mlir                 | 12 ++++++------
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd.mlir
index d1f5bb4f5..688e7e0b5 100644
--- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd.mlir
+++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd.mlir
@@ -18,8 +18,8 @@
 // CHECK: air.herd @consumer   {{.*}} attributes {{{.*}}x_loc = 0 : i64, y_loc = 2 : i64}
 
 module {
-  air.channel @ab [8, 1] {channel_type = "cascade"}
-  air.channel @bc [8, 1] {channel_type = "cascade"}
+  air.channel @ab [8, 1] {channel_type = "npu_cascade"}
+  air.channel @bc [8, 1] {channel_type = "npu_cascade"}
 
   func.func @three_herd_cascade_chain() {
     %c1 = arith.constant 1 : index
diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_ew.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_ew.mlir
index 328e35995..74c1efe9c 100644
--- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_ew.mlir
+++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_ew.mlir
@@ -17,8 +17,8 @@
 // CHECK: air.herd @consumer   {{.*}} attributes {{{.*}}x_loc = 2 : i64, y_loc = 2 : i64}
 
 module {
-  air.channel @ab [1, 8] {channel_type = "cascade"}
-  air.channel @bc [1, 8] {channel_type = "cascade"}
+  air.channel @ab [1, 8] {channel_type = "npu_cascade"}
+  air.channel @bc [1, 8] {channel_type = "npu_cascade"}
 
   func.func @three_herd_cascade_chain_ew() {
     %c1 = arith.constant 1 : index
diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_with_l1_upstream.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_with_l1_upstream.mlir
index 0b1bdac01..db262bd53 100644
--- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_with_l1_upstream.mlir
+++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_3herd_with_l1_upstream.mlir
@@ -22,8 +22,8 @@
 
 module {
   air.channel @upstream_to_a [1, 1] {broadcast_shape = [8 : index, 1 : index]}
-  air.channel @ab_q [8, 1] {channel_type = "cascade"}
-  air.channel @bc_q [8, 1] {channel_type = "cascade"}
+  air.channel @ab_q [8, 1] {channel_type = "npu_cascade"}
+  air.channel @bc_q [8, 1] {channel_type = "npu_cascade"}
 
   func.func @upstream_then_3_chain() {
     %c1 = arith.constant 1 : index
diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_4herd.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_4herd.mlir
index e8564c28e..0638f361f 100644
--- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_4herd.mlir
+++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_4herd.mlir
@@ -18,9 +18,9 @@
 // CHECK: air.herd @h3 {{.*}} attributes {{{.*}}x_loc = 0 : i64, y_loc = 0 : i64}
 
 module {
-  air.channel @c01 [8, 1] {channel_type = "cascade"}
-  air.channel @c12 [8, 1] {channel_type = "cascade"}
-  air.channel @c23 [8, 1] {channel_type = "cascade"}
+  air.channel @c01 [8, 1] {channel_type = "npu_cascade"}
+  air.channel @c12 [8, 1] {channel_type = "npu_cascade"}
+  air.channel @c23 [8, 1] {channel_type = "npu_cascade"}
 
   func.func @four_herd_cascade_chain() {
     %c1 = arith.constant 1 : index
diff --git a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_multi_channel.mlir b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_multi_channel.mlir
index bb109f7cd..729069bcc 100644
--- a/mlir/test/Transform/AIRHerdPlacement/cascade_chain_multi_channel.mlir
+++ b/mlir/test/Transform/AIRHerdPlacement/cascade_chain_multi_channel.mlir
@@ -19,12 +19,12 @@
 // CHECK: air.herd @consumer   {{.*}} attributes {{{.*}}x_loc = 0 : i64, y_loc = 2 : i64}
 
 module {
-  air.channel @ab_q [8, 1] {channel_type = "cascade"}
-  air.channel @ab_k [8, 1] {channel_type = "cascade"}
-  air.channel @ab_v [8, 1] {channel_type = "cascade"}
-  air.channel @bc_q [8, 1] {channel_type = "cascade"}
-  air.channel @bc_k [8, 1] {channel_type = "cascade"}
-  air.channel @bc_v [8, 1] {channel_type = "cascade"}
+  air.channel @ab_q [8, 1] {channel_type = "npu_cascade"}
+  air.channel @ab_k [8, 1] {channel_type = "npu_cascade"}
+  air.channel @ab_v [8, 1] {channel_type = "npu_cascade"}
+  air.channel @bc_q [8, 1] {channel_type = "npu_cascade"}
+  air.channel @bc_k [8, 1] {channel_type = "npu_cascade"}
+  air.channel @bc_v [8, 1] {channel_type = "npu_cascade"}
 
   func.func @three_herd_multi_channel() {
     %c1 = arith.constant 1 : index

From cb104a6eaee60f0195efb56443fe4577eadba4b7 Mon Sep 17 00:00:00 2001
From: Erwei Wang <erwei.wang@amd.com>
Date: Wed, 6 May 2026 05:16:31 +0000
Subject: [PATCH 6/6] [multi-gpu] Phase 1: rename channel_type in
 34_cascade_vecadd peano test

CI on 'Build and Test with AIE tools on Ryzen AI (amdhx370)' caught one
more stale "cascade" reference: test/xrt/34_cascade_vecadd/run_peano.py
embeds an inline MLIR string that declared `channel_type = "cascade"`.
Update to "npu_cascade" to match the namespace rename. The corresponding
run_chess.py variant didn't have this issue.

Verifier diagnostic from the failing job:
  'air.channel' op unsupported channel_type "cascade"; expected one of
  "npu_dma_stream", "npu_dma_packet", "npu_cascade", "npu_mmio", or
  "gpu_symmetric_heap"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 test/xrt/34_cascade_vecadd/run_peano.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/test/xrt/34_cascade_vecadd/run_peano.py b/test/xrt/34_cascade_vecadd/run_peano.py
index 3c28274a1..84c9f3070 100644
--- a/test/xrt/34_cascade_vecadd/run_peano.py
+++ b/test/xrt/34_cascade_vecadd/run_peano.py
@@ -51,7 +51,7 @@
     #set = affine_set<()[s0] : (s0 == 3)>
     #set1 = affine_set<()[s0] : (s0 - 1 >= 0, -s0 + 2 >= 0)>
     module {
-    air.channel @channel_0 [3] {channel_type = "cascade"}
+    air.channel @channel_0 [3] {channel_type = "npu_cascade"}
       air.channel @channel_1 [1]
       air.channel @channel_2 [1]
       func.func @scf1(%arg0: memref<1x1x2048xi32>, %arg1: memref<1x1x2048xi32>) {