[airrt-to-npu] Honor inner-dim alignment when tiling oversized wraps#1586
Merged
erwei-xilinx merged 4 commits intoMay 6, 2026
Merged
Conversation
`tileIllegalWrapDim` currently calls `findLargestFactor(wrap, 1023)` to split a wrap that exceeds the shim 10-bit limit. For a contiguous bf16 transfer of length 131136 the factor it picks is 683 (an odd prime), producing an inner segment of 683 elem * 2 B = 1366 B that fails the `aie.dma_bd` 4-byte-alignment verifier. Add `findLargestAlignedFactor`, used only when tiling the contiguous innermost dim (`stride == 1`), with `alignment = addressGenGranularity / elemBits` (= 2 for bf16, 1 for f32, 4 for i8). For the bf16 length-131136 case it now picks 192 instead, yielding `[<size=683, stride=192>, <size=192, stride=1>]` — both dims 4-B aligned. This bug fires for any bf16 / sub-32-bit transfer whose length factors cleanly only into an odd prime above ~511. Surfaced by an LLaMA-3.2-1B decode attention design where the K/V cache load is `(pos+1) * head_dim` and `pos+1 = 2049 = 3 * 683`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the silent fallback in `findLargestAlignedFactor` and have it return 0 when no factor of `num` in `[alignment, max]` is a multiple of `alignment`. Plumb the failure through `tileIllegalWrapDim` and `enforceAIE2WrapLimit` so the pass emits an op-level error and `signalPassFailure()` instead of producing IR that the downstream `aie.dma_bd` verifier rejects with a generic alignment message. The diagnostic names the offending dim, the size, the legal range, and the byte ratio (shim address granularity / element size), so the user knows whether to reshape the transfer or pad the inner dimension. Add a third sub-test exercising bf16 length 2049 (= 3 * 683) — the only factors <= 1023 are 1, 3, and 683, all odd, so no aligned factor exists and the diagnostic fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…WrapAndStrideList - Move findLargestAlignedFactor into air:: (Util.h/Util.cpp); delete the private duplicate of findLargestFactor in AIRRtToNpuPass.cpp that had been accumulating helpers next to the canonical one in air::. - Add air::getDmaInnerElementAlignment(memrefTy, op) so both tileIllegalWrapDim and canonicalizeWrapAndStrideList derive alignment the same way (DataLayout for the element width, fixed 32-bit shim address granularity matching AIE2/AIE2P AIETargetModel::getAddressGenGranularity). - Apply the alignment fix to air::canonicalizeWrapAndStrideList via a new innerAlignment parameter (default 1, no behavior change for callers that don't opt in). Update the three shim-bound call sites in AIRToAIEPass.cpp and AIRDependencyScheduleOpt.cpp to pass the derived alignment so the bug is caught earlier in the pipeline. When no aligned factor exists at this layer, leave the dim oversized so AIRRtToNpuPass emits the diagnostic with full op context — avoids a bare LogicalResult failure that callers ignore. - Tighten the inline comment in tileIllegalWrapDim now that the bug story lives in the commit message and the helper docstring. - Add an i8 sub-test (length 1028 = 4*257; alignment=4 forces inner wrap to drop from 514 to 4) and an NPU2 sub-test (mirrors the bf16 case, guarding against a future device divergence in addressGenGranularity). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
getDmaInnerElementAlignment now consults the parent AIE::DeviceOp's target model via getAddressGenGranularity() instead of hardcoding 32. The hardcoded 32 stays as a fallback for when the op has no DeviceOp ancestor (early-pipeline contexts) or when AIR is built with AIR_ENABLE_AIE=OFF. Pulls AIE into AIRUtil's link libs conditionally on AIR_ENABLE_AIE, matching the pattern used by AIRConversionPasses and AIRTransformPasses. The include is similarly guarded so the AIE-disabled build still works. For both AIE2 and AIE2P (the only current devices) this reads the same 32 the fallback would have produced, so no test output changes — but a future device with a different addressGenGranularity will now Just Work without anyone having to remember to update a constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tileIllegalWrapDiminAIRRtToNpuPass.cppcallsfindLargestFactor(wrap, 1023)to split a wrap that exceeds the shim 10-bit limit. For a contiguous bf16 transfer of length 131136 the factor it picks is 683 (an odd prime), producing an inner segment of683 elem × 2 B = 1366 Bthat fails theaie.dma_bd4-byte-alignment verifier:Add
findLargestAlignedFactor, used only when tiling the contiguous innermost dim (stride == 1), withalignment = addressGenGranularity / elemBits:For the bf16 length-131136 case it now picks 192 instead of 683, yielding
[<size=683, stride=192>, <size=192, stride=1>]— both dims 4-B aligned.Why this matters
This bug fires for any sub-32-bit transfer whose length factors cleanly only into an odd prime above ~511. Surfaced by an LLaMA-3.2-1B decode attention design where the K/V cache load is
(pos+1) × head_dimandpos+1 = 2049 = 3 × 683. Other prime-heavy lengths in the same regime would hit the same verifier error.Test plan
mlir/test/Conversion/AIRRtToNpu/tile_illegal_wrap_alignment.mlircovering the bf16 (alignment-aware) and f32 (no-alignment-needed) pathsninja check-air-mlir— new test passes; no regressions in the AIRRtToNpu suiteOutput 0 correlation: 1.000000 (threshold: 0.99) PASS!🤖 Generated with Claude Code