[airrt-to-npu] Honor inner-dim alignment when tiling oversized wraps (#1586)

erwei-xilinx · claude · web-flow · commit dcbe37e21626 · 2026-05-06T20:19:09.000Z
* [airrt-to-npu] Honor inner-dim alignment when tiling oversized wraps

`tileIllegalWrapDim` currently calls `findLargestFactor(wrap, 1023)` to
split a wrap that exceeds the shim 10-bit limit. For a contiguous bf16
transfer of length 131136 the factor it picks is 683 (an odd prime),
producing an inner segment of 683 elem * 2 B = 1366 B that fails the
`aie.dma_bd` 4-byte-alignment verifier.

Add `findLargestAlignedFactor`, used only when tiling the contiguous
innermost dim (`stride == 1`), with `alignment = addressGenGranularity /
elemBits` (= 2 for bf16, 1 for f32, 4 for i8). For the bf16 length-131136
case it now picks 192 instead, yielding `[&lt;size=683, stride=192&gt;,
&lt;size=192, stride=1&gt;]` — both dims 4-B aligned.

This bug fires for any bf16 / sub-32-bit transfer whose length factors
cleanly only into an odd prime above ~511. Surfaced by an LLaMA-3.2-1B
decode attention design where the K/V cache load is `(pos+1) * head_dim`
and `pos+1 = 2049 = 3 * 683`.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* Emit diagnostic when no aligned factor exists

Drop the silent fallback in `findLargestAlignedFactor` and have it
return 0 when no factor of `num` in `[alignment, max]` is a multiple of
`alignment`. Plumb the failure through `tileIllegalWrapDim` and
`enforceAIE2WrapLimit` so the pass emits an op-level error and
`signalPassFailure()` instead of producing IR that the downstream
`aie.dma_bd` verifier rejects with a generic alignment message.

The diagnostic names the offending dim, the size, the legal range, and
the byte ratio (shim address granularity / element size), so the user
knows whether to reshape the transfer or pad the inner dimension.

Add a third sub-test exercising bf16 length 2049 (= 3 * 683) — the only
factors &lt;= 1023 are 1, 3, and 683, all odd, so no aligned factor exists
and the diagnostic fires.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* Consolidate alignment-aware tiling into air::Util; cover canonicalizeWrapAndStrideList

- Move findLargestAlignedFactor into air:: (Util.h/Util.cpp); delete the
  private duplicate of findLargestFactor in AIRRtToNpuPass.cpp that had
  been accumulating helpers next to the canonical one in air::.
- Add air::getDmaInnerElementAlignment(memrefTy, op) so both tileIllegalWrapDim
  and canonicalizeWrapAndStrideList derive alignment the same way (DataLayout
  for the element width, fixed 32-bit shim address granularity matching
  AIE2/AIE2P AIETargetModel::getAddressGenGranularity).
- Apply the alignment fix to air::canonicalizeWrapAndStrideList via a new
  innerAlignment parameter (default 1, no behavior change for callers that
  don't opt in). Update the three shim-bound call sites in AIRToAIEPass.cpp
  and AIRDependencyScheduleOpt.cpp to pass the derived alignment so the bug
  is caught earlier in the pipeline. When no aligned factor exists at this
  layer, leave the dim oversized so AIRRtToNpuPass emits the diagnostic with
  full op context — avoids a bare LogicalResult failure that callers ignore.
- Tighten the inline comment in tileIllegalWrapDim now that the bug story
  lives in the commit message and the helper docstring.
- Add an i8 sub-test (length 1028 = 4*257; alignment=4 forces inner wrap to
  drop from 514 to 4) and an NPU2 sub-test (mirrors the bf16 case, guarding
  against a future device divergence in addressGenGranularity).

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* Query AIE target model for shim address-gen granularity when reachable

getDmaInnerElementAlignment now consults the parent AIE::DeviceOp's
target model via getAddressGenGranularity() instead of hardcoding 32.
The hardcoded 32 stays as a fallback for when the op has no DeviceOp
ancestor (early-pipeline contexts) or when AIR is built with
AIR_ENABLE_AIE=OFF.

Pulls AIE into AIRUtil's link libs conditionally on AIR_ENABLE_AIE,
matching the pattern used by AIRConversionPasses and AIRTransformPasses.
The include is similarly guarded so the AIE-disabled build still works.

For both AIE2 and AIE2P (the only current devices) this reads the same
32 the fallback would have produced, so no test output changes — but a
future device with a different addressGenGranularity will now Just Work
without anyone having to remember to update a constant.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/mlir/include/air/Util/Util.h b/mlir/include/air/Util/Util.h
@@ -192,10 +192,33 @@ LogicalResult foldForLoopNestAsExtendedSizesAndStrides(
 // Find the largest factor of 'num' which is not larger than 'max'.
 int findLargestFactor(int num, int max);
 
+// Largest factor of 'num' that is <= 'max' and a multiple of 'alignment'.
+// Returns 0 when no aligned factor exists, so the caller can emit a
+// diagnostic instead of silently producing misaligned IR. With alignment<=1
+// behaves as findLargestFactor.
+int findLargestAlignedFactor(int num, int max, int alignment);
+
+// Element-count alignment required so that an inner DMA wrap stays a
+// multiple of the AIE shim address granularity. Queries the parent
+// AIE::DeviceOp's target model when reachable (preferred); otherwise falls
+// back to 32 bits (the AIE2 / AIE2P value). Returns 1 when each element
+// already meets or exceeds the granularity (e.g. f32, i32); 2 for bf16/i16;
+// 4 for i8/ui8. 'op' is used both to find the DeviceOp ancestor and to
+// resolve the DataLayout via DataLayout::closest().
+int getDmaInnerElementAlignment(mlir::BaseMemRefType memrefTy,
+                                mlir::Operation *op);
+
 // Canonicalize wrap and stride lists, by removing redundant dimensions.
-LogicalResult canonicalizeWrapAndStrideList(
-    OpBuilder &builder, SmallVector<Value> &offsets, SmallVector<Value> &sizes,
-    SmallVector<Value> &strides, int memref_volume, int maxSize = -1);
+// 'innerAlignment' constrains the contiguous innermost dim (stride==1) when
+// it must be split: the new inner wrap is forced to be a multiple of
+// 'innerAlignment' elements (e.g. 2 for bf16 on a 4-byte shim BD). Pass 1
+// (the default) when no extra constraint applies.
+LogicalResult canonicalizeWrapAndStrideList(OpBuilder &builder,
+                                            SmallVector<Value> &offsets,
+                                            SmallVector<Value> &sizes,
+                                            SmallVector<Value> &strides,
+                                            int memref_volume, int maxSize = -1,
+                                            int innerAlignment = 1);
 
 // If wrap-and-stride lists are empty, populate them with default data access
 // layout (contiguous, row-major).
diff --git a/mlir/lib/Conversion/AIRRtToNpuPass.cpp b/mlir/lib/Conversion/AIRRtToNpuPass.cpp
@@ -1179,40 +1179,7 @@ bool violatesAIE2WrapLimit(airrt::DmaMemcpyNdOp dma) {
   return false;
 }
 
-// Find the largest factor of 'num' which is not larger than 'max'. Ref:
-// https://github.com/nod-ai/iree-amd-aie/blob/main/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEUtils.cpp#L334
-int findLargestFactor(int num, int max) {
-  // No factors less than or equal to 0 exist
-  if (max <= 0)
-    return 0;
-
-  // Do O(1) instead of O(sqrt(num)) computation for this common case.
-  if (num <= max) {
-    return num;
-  }
-
-  int largestLowFactor = 1;
-  for (int lowFactor = 2; lowFactor <= max; ++lowFactor) {
-    const int highFactor = num / lowFactor;
-
-    // This early exit is what makes this O(sqrt(num)) instead of O(num).
-    if (highFactor < lowFactor)
-      return largestLowFactor;
-
-    const bool areActuallyFactors = num % lowFactor == 0;
-    if (areActuallyFactors) {
-      // We're certain that here lowFactor <= highFactor, and highFactor is
-      // descending in this loop. So we can return immediately if highFactor
-      // is good.
-      if (highFactor <= max)
-        return highFactor;
-      largestLowFactor = lowFactor;
-    }
-  }
-  return largestLowFactor;
-}
-
-void tileIllegalWrapDim(airrt::DmaMemcpyNdOp memcpy_op) {
+LogicalResult tileIllegalWrapDim(airrt::DmaMemcpyNdOp memcpy_op) {
   auto loc = memcpy_op->getLoc();
   auto ctx = memcpy_op->getContext();
   auto oper_begin = memcpy_op.getOperands().begin();
@@ -1221,13 +1188,30 @@ void tileIllegalWrapDim(airrt::DmaMemcpyNdOp memcpy_op) {
   SmallVector<Value> strides(oper_begin + 12, oper_begin + 16);
   OpBuilder builder(memcpy_op);
 
+  auto memrefTy =
+      llvm::dyn_cast<BaseMemRefType>(memcpy_op.getMemref().getType());
+  int innerAlignment =
+      memrefTy ? air::getDmaInnerElementAlignment(memrefTy, memcpy_op) : 1;
+
   for (int i = wraps.size() - 1; i >= 0; i--) {
     auto const_wrap = *getConstantIntValue(wraps[i]);
     auto const_stride = *getConstantIntValue(strides[i]);
     if (const_wrap >= AIE2_WRAP_UPPER_BOUNDS[i]) {
-      // Found dimension with illegal wrap. Tiling. (Prefers smaller outer
-      // wrap values, as long as stride fits)
-      int a_wrap = findLargestFactor(const_wrap, AIE2_WRAP_UPPER_BOUNDS[i] - 1);
+      // Found dimension with illegal wrap. Prefers smaller outer wrap as
+      // long as stride fits. For stride==1, force the inner wrap to a
+      // multiple of innerAlignment elements so its byte size stays aligned
+      // to the shim address granularity (otherwise aie.dma_bd rejects it).
+      int alignment = (const_stride == 1) ? innerAlignment : 1;
+      int a_wrap = air::findLargestAlignedFactor(
+          const_wrap, AIE2_WRAP_UPPER_BOUNDS[i] - 1, alignment);
+      if (a_wrap == 0) {
+        return memcpy_op.emitOpError()
+               << "cannot tile dim " << i << " of size " << const_wrap
+               << " into shim-legal chunks: no factor in [" << alignment << ", "
+               << (AIE2_WRAP_UPPER_BOUNDS[i] - 1) << "] is a multiple of "
+               << alignment
+               << " elements. Reshape the transfer or pad the inner dimension.";
+      }
       int b_wrap = llvm::divideCeilSigned(const_wrap, a_wrap);
       int new_a_stride = const_stride * a_wrap;
       auto volume = air::getTensorVolume(
@@ -1357,9 +1341,10 @@ void tileIllegalWrapDim(airrt::DmaMemcpyNdOp memcpy_op) {
   }
 
   memcpy_op.erase();
+  return success();
 }
 
-void enforceAIE2WrapLimit(ModuleOp module) {
+LogicalResult enforceAIE2WrapLimit(ModuleOp module) {
   // Identify airrt.dma_memcpy_nd ops that violate the AIE2 wrap size
   // constraint.
   SmallVector<airrt::DmaMemcpyNdOp> target_airrt_dmas;
@@ -1374,7 +1359,9 @@ void enforceAIE2WrapLimit(ModuleOp module) {
 
   // Enforce the AIE2 wrap limit by tiling that dimension.
   for (auto memcpy_op : target_airrt_dmas)
-    tileIllegalWrapDim(memcpy_op);
+    if (failed(tileIllegalWrapDim(memcpy_op)))
+      return failure();
+  return success();
 }
 
 struct AIRRtToNpuPass : public impl::AIRRtToNpuBase<AIRRtToNpuPass> {
@@ -1455,7 +1442,10 @@ struct AIRRtToNpuPass : public impl::AIRRtToNpuBase<AIRRtToNpuPass> {
     generateNpuWaitFromAIRRtWaitAll(module);
 
     // Enforce AIE2 hardware constraints.
-    enforceAIE2WrapLimit(module);
+    if (failed(enforceAIE2WrapLimit(module))) {
+      signalPassFailure();
+      return;
+    }
 
     // Simplify arith ops (from airrt)
     RewritePatternSet canoPatterns_3(ctx);
diff --git a/mlir/lib/Conversion/AIRToAIEPass.cpp b/mlir/lib/Conversion/AIRToAIEPass.cpp
@@ -2686,9 +2686,13 @@ struct SpecializeChannelBundlePattern
     SmallVector<Value> offsets = ci.getOffsets();
     SmallVector<Value> wraps = ci.getSizes();
     SmallVector<Value> strides = ci.getStrides();
+    auto memrefTy = llvm::dyn_cast<BaseMemRefType>(ci.getMemref().getType());
+    int innerAlignment =
+        memrefTy ? air::getDmaInnerElementAlignment(memrefTy, ci) : 1;
     (void)air::canonicalizeWrapAndStrideList(
         builder, offsets, wraps, strides,
-        air::getTensorVolume(ci.getMemref().getType()), maxSize);
+        air::getTensorVolume(ci.getMemref().getType()), maxSize,
+        innerAlignment);
     air::ChannelInterface new_ci = nullptr;
     if (isa<air::ChannelPutOp>(ci))
       new_ci = air::ChannelPutOp::create(
diff --git a/mlir/lib/Transform/AIRDependencyScheduleOpt.cpp b/mlir/lib/Transform/AIRDependencyScheduleOpt.cpp
@@ -2196,9 +2196,14 @@ struct AIRSpecializeChannelWrapAndStrideInScfFor
     SmallVector<Value> strides = channel_op.getStrides();
 
     OpBuilder b(channel_op);
+    auto memrefTy =
+        llvm::dyn_cast<BaseMemRefType>(channel_op.getMemref().getType());
+    int innerAlignment =
+        memrefTy ? air::getDmaInnerElementAlignment(memrefTy, channel_op) : 1;
     (void)canonicalizeWrapAndStrideList(
         b, offsets, wraps, strides,
-        air::getTensorVolume(channel_op.getMemref().getType()), maxSize);
+        air::getTensorVolume(channel_op.getMemref().getType()), maxSize,
+        innerAlignment);
 
     // If empty offsets/sizes/strides, then populate the lists with default
     // values.
@@ -2225,7 +2230,8 @@ struct AIRSpecializeChannelWrapAndStrideInScfFor
 
     (void)canonicalizeWrapAndStrideList(
         rewriter, offsets, wraps, strides,
-        air::getTensorVolume(channel_op.getMemref().getType()), maxSize);
+        air::getTensorVolume(channel_op.getMemref().getType()), maxSize,
+        innerAlignment);
 
     // Whether repeat (i.e. stride = 0) is supported at highest dimension.
     if (enableRepeatAtHighestDim && !wraps.empty()) {
@@ -2605,9 +2611,13 @@ struct AIRCanonicalizeChannelPutGetOpWrapAndStrideList
       if (padBeforeCheck)
         return failure();
       // Canonicalize offsets/sizes/strides using a helper function.
+      auto memrefTy = llvm::dyn_cast<BaseMemRefType>(op.getMemref().getType());
+      int innerAlignment =
+          memrefTy ? air::getDmaInnerElementAlignment(memrefTy, op) : 1;
       if (failed(canonicalizeWrapAndStrideList(
               rewriter, offsets, sizes, strides,
-              air::getTensorVolume(op.getMemref().getType()), maxSize)))
+              air::getTensorVolume(op.getMemref().getType()), maxSize,
+              innerAlignment)))
         return failure();
 
       // When highest-dimension repeat is active, pad offsets/sizes/strides to
diff --git a/mlir/lib/Util/CMakeLists.txt b/mlir/lib/Util/CMakeLists.txt
@@ -2,6 +2,19 @@
 # Copyright (C) 2022, Advanced Micro Devices, Inc. All rights reserved.
 # SPDX-License-Identifier: MIT
 
+set(AIRUTIL_LINK_LIBS
+  MLIRIR
+  MLIRTransforms
+)
+
+# AIE target model is queried for the shim address-gen granularity when
+# computing DMA inner-element alignment. Conditional on AIR_ENABLE_AIE; the
+# fallback path in Util.cpp is used when AIE is disabled or no DeviceOp is
+# reachable from the op.
+if(AIR_ENABLE_AIE)
+  list(APPEND AIRUTIL_LINK_LIBS AIE)
+endif()
+
 add_mlir_library(AIRUtil
   Util.cpp
   Outliner.cpp
@@ -15,6 +28,5 @@ add_mlir_library(AIRUtil
   AIRDialect
 
   LINK_LIBS PUBLIC
-  MLIRIR
-  MLIRTransforms
+  ${AIRUTIL_LINK_LIBS}
 )
diff --git a/mlir/lib/Util/Util.cpp b/mlir/lib/Util/Util.cpp
@@ -9,6 +9,11 @@
 #include "air/Util/Util.h"
 #include "air/Dialect/AIR/AIRDialect.h"
 
+#if AIR_ENABLE_AIE
+#include "aie/Dialect/AIE/IR/AIEDialect.h"
+#include "aie/Dialect/AIE/IR/AIETargetModel.h"
+#endif
+
 #include "mlir/Analysis/SliceAnalysis.h"
 #include "mlir/Dialect/Affine/Analysis/LoopAnalysis.h"
 #include "mlir/Dialect/Affine/IR/AffineOps.h"
@@ -21,6 +26,7 @@
 #include "mlir/IR/IntegerSet.h"
 #include "mlir/IR/Iterators.h"
 #include "mlir/IR/OperationSupport.h"
+#include "mlir/Interfaces/DataLayoutInterfaces.h"
 
 #include "llvm/ADT/SmallPtrSet.h"
 #include "llvm/Support/Debug.h"
@@ -1126,10 +1132,52 @@ int air::findLargestFactor(int num, int max) {
   return largestLowFactor;
 }
 
+// Fallback shim address-gen granularity when we can't reach an AIE::DeviceOp
+// to query the target model. Matches AIETargetModel::getAddressGenGranularity
+// for AIE2 and AIE2P. The dynamic lookup below is preferred when available so
+// future devices with a different value just work.
+static constexpr unsigned kAIEShimAddrGenBitsFallback = 32;
+
+int air::getDmaInnerElementAlignment(BaseMemRefType memrefTy, Operation *op) {
+  if (!memrefTy || !op)
+    return 1;
+  DataLayout dl = DataLayout::closest(op);
+  unsigned elemBits = dl.getTypeSizeInBits(memrefTy.getElementType());
+  if (elemBits == 0)
+    return 1;
+  unsigned addrGenBits = kAIEShimAddrGenBitsFallback;
+#if AIR_ENABLE_AIE
+  if (auto dev = op->getParentOfType<AIE::DeviceOp>())
+    addrGenBits = dev.getTargetModel().getAddressGenGranularity();
+#endif
+  if (elemBits >= addrGenBits)
+    return 1;
+  return addrGenBits / elemBits;
+}
+
+// Largest factor of 'num' that is <= 'max' and a multiple of 'alignment'.
+// See header for rationale.
+int air::findLargestAlignedFactor(int num, int max, int alignment) {
+  if (alignment <= 1)
+    return findLargestFactor(num, max);
+  if (max < alignment)
+    return 0;
+  int alignedMax = (max / alignment) * alignment;
+  for (int candidate = alignedMax; candidate >= alignment;
+       candidate -= alignment) {
+    if (num % candidate == 0)
+      return candidate;
+  }
+  return 0;
+}
+
 // Canonicalize wrap and stride lists by removing redundant dimensions.
-LogicalResult air::canonicalizeWrapAndStrideList(
-    OpBuilder &builder, SmallVector<Value> &offsets, SmallVector<Value> &sizes,
-    SmallVector<Value> &strides, int memref_volume, int maxSize) {
+LogicalResult air::canonicalizeWrapAndStrideList(OpBuilder &builder,
+                                                 SmallVector<Value> &offsets,
+                                                 SmallVector<Value> &sizes,
+                                                 SmallVector<Value> &strides,
+                                                 int memref_volume, int maxSize,
+                                                 int innerAlignment) {
   // AIE2 hardware constraints. TODO: import these info from target model.
   const int AIE2_STRIDE_UPPER_BOUND = 1048576;
   bool listsHaveChanged = false;
@@ -1159,8 +1207,20 @@ LogicalResult air::canonicalizeWrapAndStrideList(
     if (const_wrap <= maxSize)
       continue;
     // Found dimension with illegal wrap. Tiling. (Prefers smaller outer wrap
-    // values, as long as stride fits)
-    int a_wrap = findLargestFactor(const_wrap, maxSize);
+    // values, as long as stride fits.) For the contiguous innermost dim
+    // (stride==1), require the new inner wrap to be a multiple of
+    // innerAlignment elements so the resulting d0 byte size stays aligned to
+    // the shim address granularity (e.g. 4 B for bf16 / i8). Falls back to
+    // findLargestFactor when innerAlignment <= 1 or stride != 1.
+    int a_wrap =
+        (const_stride == 1)
+            ? findLargestAlignedFactor(const_wrap, maxSize, innerAlignment)
+            : findLargestFactor(const_wrap, maxSize);
+    // No aligned factor exists. Leave the dim oversized and let the
+    // downstream shim lowering (tileIllegalWrapDim in AIRRtToNpuPass) emit
+    // the diagnostic with full op context.
+    if (a_wrap == 0)
+      continue;
     int b_wrap = llvm::divideCeilSigned(const_wrap, a_wrap);
     int new_a_stride = const_stride * a_wrap;
     if (memref_volume != 1)
diff --git a/mlir/test/Conversion/AIRRtToNpu/tile_illegal_wrap_alignment.mlir b/mlir/test/Conversion/AIRRtToNpu/tile_illegal_wrap_alignment.mlir