Skip to content

Fix air-to-std crash for DMA ops with >4 offset/size dimensions#1517

Draft
erwei-xilinx wants to merge 3 commits into
Xilinx:mainfrom
erwei-xilinx:erwei/fix-air-to-std-oob
Draft

Fix air-to-std crash for DMA ops with >4 offset/size dimensions#1517
erwei-xilinx wants to merge 3 commits into
Xilinx:mainfrom
erwei-xilinx:erwei/fix-air-to-std-oob

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

Summary

  • Fix AIRDmaMemcpyNdToAIRRtConversion SmallVector out-of-bounds crash when DMA ops have >4 offset/size dimensions
  • Apply drop_front logic (already used for strides) to offsets and sizes to fit the 4D airrt format

Root cause

The conversion pattern assumed DMA offsets/sizes have at most 4 elements (matching memref rank). However, block layout lowering produces 6D patterns (e.g., [herd_m, herd_n, n_blks, m_blks, mmul_m, mmul_k]), and the BD optimization pass can create >4D offsets. When src.getRank() > 4, idx = 4 - rank becomes negative, causing offsets[idx++] to write past the end of a SmallVector<Value, 4>.

Impact

Fixes crashes at large problem sizes for:

  • Reference matmul (run.py without --direct-codegen) at 2048×2048×2048 and larger
  • Fused SwiGLU at 2048×2048×8192

Performance after fix

Workload Size Result
Reference matmul 2048×2048×2048 5,048 GFLOPS (was crashing)
Reference matmul 2048×2048×8192 4,599 GFLOPS (was crashing)
Fused SwiGLU 512×512×512 PASS (correctness)
Fused SwiGLU 2048×2048×8192 1,813 GFLOPS (was crashing)

Test plan

  • All 365 check-air-mlir tests pass
  • Reference matmul correctness and profiling at 2048×2048×2048, 2048×2048×8192
  • Fused SwiGLU correctness at 512×512×512

🤖 Generated with Claude Code

@erwei-xilinx erwei-xilinx requested a review from fifield as a code owner April 8, 2026 23:45
Copilot AI review requested due to automatic review settings April 8, 2026 23:45
AIRDmaMemcpyNdToAIRRtConversion assumed DMA offsets and sizes have at
most 4 elements (matching memref rank). However, the BD optimization
pass and block layout lowering can produce DMA ops with 6+ dimensions
in offsets/sizes (e.g., 6D block layout for matmul). This caused a
SmallVector out-of-bounds access when converting to the 4D airrt
format.

Apply the same drop_front logic (already used for strides) to offsets
and sizes: when N > 4, take the last 4 elements. This matches the
hardware's 4D BD dimension limit.

Fixes crashes at large problem sizes (>1k) for both reference matmul
(run.py without --direct-codegen) and fused SwiGLU designs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a crash in the -air-to-std lowering when converting air.dma_memcpy_nd ops whose offsets/sizes have more than 4 dimensions (e.g., 6D patterns produced by block layout lowering / BD optimization), by truncating offsets and sizes to the last 4 elements to match the 4D AIRRt DMA format.

Changes:

  • Drop leading offset elements when offsets.size() > 4, mirroring existing stride behavior.
  • Drop leading size elements when sizes.size() > 4, preventing negative indexing / SmallVector<Value, 4> OOB writes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Conversion/AIRLoweringPass.cpp Outdated
erwei-xilinx and others added 2 commits April 8, 2026 17:13
- Replace duplicated if/drop_front truncation with a shared
  truncateToLast4 lambda using take_back(4)
- Add assertion verifying dropped leading offsets are zero
- Add comment documenting the safety invariant
- Add LIT test (air_dma_nd_6d_to_airrt.mlir) exercising the 6D→4D path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract the hardcoded airrt 4-dimension limit into a named constant
kAIRRtMaxNDims, used in both AIRDmaMemcpyNdToAIRRtConversion and
AIRChannelInterfaceToAIRRtConversionImpl. This makes the relationship
to the airrt.dma_memcpy_nd / airrt.memcpy_nd TableGen definitions
explicit and easier to update if future architectures change the limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the erwei/fix-air-to-std-oob branch 2 times, most recently from cb22460 to 767dd14 Compare April 9, 2026 04:49
@erwei-xilinx erwei-xilinx marked this pull request as draft April 9, 2026 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants