Skip to content

Refactor rms_norm transform script to follow mlir-air layernorm prototype#65

Merged
erwei-xilinx merged 1 commit into
amd:mainfrom
erwei-xilinx:fix-rms-norm-aie2p-transform
May 11, 2026
Merged

Refactor rms_norm transform script to follow mlir-air layernorm prototype#65
erwei-xilinx merged 1 commit into
amd:mainfrom
erwei-xilinx:fix-rms-norm-aie2p-transform

Conversation

@erwei-xilinx

@erwei-xilinx erwei-xilinx commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Refactors examples/rms_norm/transform_aie2p.mlir to follow the structure of the mlir-air xrt prototype test/xrt/43_triton_layernorm/transform_aie2p.mlir exactly.

Fixes the spurious stderr reported during testing of #64 on Windows aie2p:
```
loc("-":83:11): error: application of transform.air.copy_to_dma expected to produce 1 results (actually produced 0).
```

Root cause

The previous script used `transform.air.linalg_promote` (post-bufferize) to stage L2 subviews into L1. When the linalg op's destination is itself an L2 subview, `linalg_promote` emits `memref.copy %sv, %sv` (literal SSA self-copy). `transform.air.copy_to_dma` then detects each self-copy in `CopyToDmaOp::applyToOne`, erases the op, and returns success without pushing a result handle — violating the framework's 1-result contract for `applyToOne` and producing the diagnostic above. Reproduces deterministically on Linux aie2p.

Fix

Replace the post-bufferize `linalg_promote` mechanism with the layernorm prototype's pre-bufferize approach:

  • Per-op `bufferize_to_allocation {memory_space = 2}` for each linalg op (fills, intermediate generic, reduce, output generic) → L1 alloc as the bufferization destination
  • `promote_tensor to 2` for the input tensor → L1 input copy
  • Drop `linalg_promote` and the custom `fuse_multi_op_linalg` of sq+reduce
  • Add `generalize` of remaining `linalg.reduce` (delayed to PHASE 9), `convert_divf_sqrt_to_rsqrt`, and `broadcast_before_unary` for math.rsqrt — all matching the layernorm prototype

Self-copies are never emitted, so the diagnostic disappears structurally rather than via a workaround. Resulting AIR shows clean separated loops (square → reduce → output) with a single shared L1 input alloc.

Test plan

  • `scripts/run_tests.py --device aie2p` on local NPU2 (Strix): 17/18 pass (only pre-existing matvec unrelated)
  • Standalone `python examples/rms_norm/rms_norm.py` on aie2p: pass, no stderr diagnostics
  • Compilation produces `aie.elf` + `full_elf_config.json` cleanly
  • Verify on Windows aie2p (the platform where Add Windows wheel release job to nightly-wheels CI #64 reported it) — expected to pass since fix is host-OS-independent
  • aie2 path unaffected (no `transform_aie2.mlir` exists for this kernel; aie2p-only)

🤖 Generated with Claude Code

…type

Replace post-bufferize linalg_promote (which leaks self-copies that crash
transform.air.copy_to_dma) with pre-bufferize bufferize_to_allocation +
promote_tensor for L1 staging, mirroring mlir-air xrt 43_triton_layernorm.
Eliminates "expected to produce 1 results (actually produced 0)" stderr
on aie2p reported in amd#64.
Copilot AI review requested due to automatic review settings May 11, 2026 21:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the AIE2P MLIR transform sequence for the rms_norm example to mirror the mlir-air layernorm XRT prototype structure, replacing the prior post-bufferize linalg_promote-based L1 staging approach that could generate SSA self-copies and trigger transform.air.copy_to_dma result-contract diagnostics.

Changes:

  • Reworks the transform pipeline into phased steps (canonicalize/fuse, dataflow navigation, tiling+fusion, pre-bufferize allocations/promotions, bufferize, vectorization prep, herd/DMA/vectorization, type casts).
  • Replaces post-bufferize transform.air.linalg_promote staging with pre-bufferize bufferize_to_allocation {memory_space = 2} and promote_tensor to 2 to avoid self-copy patterns.
  • Adds prototype-aligned transforms: delayed generalize of remaining linalg.reduce, convert_divf_sqrt_to_rsqrt, and broadcast_before_unary for math.rsqrt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@erwei-xilinx erwei-xilinx merged commit 4bcbe55 into amd:main May 11, 2026
16 of 17 checks passed
@erwei-xilinx erwei-xilinx deleted the fix-rms-norm-aie2p-transform branch May 11, 2026 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants