Skip to content

[airrt-to-npu] Honor inner-dim alignment when tiling oversized wraps#1586

Merged
erwei-xilinx merged 4 commits into
Xilinx:mainfrom
erwei-xilinx:mlir-air-fix-tile-illegal-wrap-alignment
May 6, 2026
Merged

[airrt-to-npu] Honor inner-dim alignment when tiling oversized wraps#1586
erwei-xilinx merged 4 commits into
Xilinx:mainfrom
erwei-xilinx:mlir-air-fix-tile-illegal-wrap-alignment

Conversation

@erwei-xilinx

Copy link
Copy Markdown
Collaborator

Summary

tileIllegalWrapDim in AIRRtToNpuPass.cpp calls findLargestFactor(wrap, 1023) to split a wrap that exceeds the shim 10-bit limit. For a contiguous bf16 transfer of length 131136 the factor it picks is 683 (an odd prime), producing an inner segment of 683 elem × 2 B = 1366 B that fails the aie.dma_bd 4-byte-alignment verifier:

error: 'aie.dma_bd' op Transfer sizes must be multiples of 4 bytes.
       683 elements at 2 bytes each equal 1366 bytes, which is not divisible by 4.
       aie.dma_bd(... [<size = 192, stride = 683>, <size = 683, stride = 1>])

Add findLargestAlignedFactor, used only when tiling the contiguous innermost dim (stride == 1), with alignment = addressGenGranularity / elemBits:

  • bf16 (16 b): alignment = 2 → must pick even factor
  • f32 (32 b): alignment = 1 → no extra constraint, behaves as before
  • i8 (8 b): alignment = 4 → must pick multiple-of-4 factor

For the bf16 length-131136 case it now picks 192 instead of 683, yielding [<size=683, stride=192>, <size=192, stride=1>] — both dims 4-B aligned.

Why this matters

This bug fires for any sub-32-bit transfer whose length factors cleanly only into an odd prime above ~511. Surfaced by an LLaMA-3.2-1B decode attention design where the K/V cache load is (pos+1) × head_dim and pos+1 = 2049 = 3 × 683. Other prime-heavy lengths in the same regime would hit the same verifier error.

Test plan

  • Added mlir/test/Conversion/AIRRtToNpu/tile_illegal_wrap_alignment.mlir covering the bf16 (alignment-aware) and f32 (no-alignment-needed) paths
  • ninja check-air-mlir — new test passes; no regressions in the AIRRtToNpu suite
  • End-to-end: built and ran the LLaMA-3.2-1B decode attention kernel that originally triggered the bug — Output 0 correlation: 1.000000 (threshold: 0.99) PASS!

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 6, 2026 17:44
@erwei-xilinx erwei-xilinx requested a review from fifield as a code owner May 6, 2026 17:44
erwei-xilinx and others added 4 commits May 6, 2026 10:45
`tileIllegalWrapDim` currently calls `findLargestFactor(wrap, 1023)` to
split a wrap that exceeds the shim 10-bit limit. For a contiguous bf16
transfer of length 131136 the factor it picks is 683 (an odd prime),
producing an inner segment of 683 elem * 2 B = 1366 B that fails the
`aie.dma_bd` 4-byte-alignment verifier.

Add `findLargestAlignedFactor`, used only when tiling the contiguous
innermost dim (`stride == 1`), with `alignment = addressGenGranularity /
elemBits` (= 2 for bf16, 1 for f32, 4 for i8). For the bf16 length-131136
case it now picks 192 instead, yielding `[<size=683, stride=192>,
<size=192, stride=1>]` — both dims 4-B aligned.

This bug fires for any bf16 / sub-32-bit transfer whose length factors
cleanly only into an odd prime above ~511. Surfaced by an LLaMA-3.2-1B
decode attention design where the K/V cache load is `(pos+1) * head_dim`
and `pos+1 = 2049 = 3 * 683`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the silent fallback in `findLargestAlignedFactor` and have it
return 0 when no factor of `num` in `[alignment, max]` is a multiple of
`alignment`. Plumb the failure through `tileIllegalWrapDim` and
`enforceAIE2WrapLimit` so the pass emits an op-level error and
`signalPassFailure()` instead of producing IR that the downstream
`aie.dma_bd` verifier rejects with a generic alignment message.

The diagnostic names the offending dim, the size, the legal range, and
the byte ratio (shim address granularity / element size), so the user
knows whether to reshape the transfer or pad the inner dimension.

Add a third sub-test exercising bf16 length 2049 (= 3 * 683) — the only
factors <= 1023 are 1, 3, and 683, all odd, so no aligned factor exists
and the diagnostic fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…WrapAndStrideList

- Move findLargestAlignedFactor into air:: (Util.h/Util.cpp); delete the
  private duplicate of findLargestFactor in AIRRtToNpuPass.cpp that had
  been accumulating helpers next to the canonical one in air::.
- Add air::getDmaInnerElementAlignment(memrefTy, op) so both tileIllegalWrapDim
  and canonicalizeWrapAndStrideList derive alignment the same way (DataLayout
  for the element width, fixed 32-bit shim address granularity matching
  AIE2/AIE2P AIETargetModel::getAddressGenGranularity).
- Apply the alignment fix to air::canonicalizeWrapAndStrideList via a new
  innerAlignment parameter (default 1, no behavior change for callers that
  don't opt in). Update the three shim-bound call sites in AIRToAIEPass.cpp
  and AIRDependencyScheduleOpt.cpp to pass the derived alignment so the bug
  is caught earlier in the pipeline. When no aligned factor exists at this
  layer, leave the dim oversized so AIRRtToNpuPass emits the diagnostic with
  full op context — avoids a bare LogicalResult failure that callers ignore.
- Tighten the inline comment in tileIllegalWrapDim now that the bug story
  lives in the commit message and the helper docstring.
- Add an i8 sub-test (length 1028 = 4*257; alignment=4 forces inner wrap to
  drop from 514 to 4) and an NPU2 sub-test (mirrors the bf16 case, guarding
  against a future device divergence in addressGenGranularity).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
getDmaInnerElementAlignment now consults the parent AIE::DeviceOp's
target model via getAddressGenGranularity() instead of hardcoding 32.
The hardcoded 32 stays as a fallback for when the op has no DeviceOp
ancestor (early-pipeline contexts) or when AIR is built with
AIR_ENABLE_AIE=OFF.

Pulls AIE into AIRUtil's link libs conditionally on AIR_ENABLE_AIE,
matching the pattern used by AIRConversionPasses and AIRTransformPasses.
The include is similarly guarded so the AIE-disabled build still works.

For both AIE2 and AIE2P (the only current devices) this reads the same
32 the fallback would have produced, so no test output changes — but a
future device with a different addressGenGranularity will now Just Work
without anyone having to remember to update a constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx added this pull request to the merge queue May 6, 2026
Merged via the queue into Xilinx:main with commit dcbe37e May 6, 2026
27 checks passed
@erwei-xilinx erwei-xilinx deleted the mlir-air-fix-tile-illegal-wrap-alignment branch May 6, 2026 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant