Refactor rms_norm transform script to follow mlir-air layernorm prototype by erwei-xilinx · Pull Request #65 · amd/Triton-XDNA

erwei-xilinx · 2026-05-11T21:36:29Z

Summary

Refactors examples/rms_norm/transform_aie2p.mlir to follow the structure of the mlir-air xrt prototype test/xrt/43_triton_layernorm/transform_aie2p.mlir exactly.

Fixes the spurious stderr reported during testing of #64 on Windows aie2p:
```
loc("-":83:11): error: application of transform.air.copy_to_dma expected to produce 1 results (actually produced 0).
```

Root cause

The previous script used `transform.air.linalg_promote` (post-bufferize) to stage L2 subviews into L1. When the linalg op's destination is itself an L2 subview, `linalg_promote` emits `memref.copy %sv, %sv` (literal SSA self-copy). `transform.air.copy_to_dma` then detects each self-copy in `CopyToDmaOp::applyToOne`, erases the op, and returns success without pushing a result handle — violating the framework's 1-result contract for `applyToOne` and producing the diagnostic above. Reproduces deterministically on Linux aie2p.

Fix

Replace the post-bufferize `linalg_promote` mechanism with the layernorm prototype's pre-bufferize approach:

Per-op `bufferize_to_allocation {memory_space = 2}` for each linalg op (fills, intermediate generic, reduce, output generic) → L1 alloc as the bufferization destination
`promote_tensor to 2` for the input tensor → L1 input copy
Drop `linalg_promote` and the custom `fuse_multi_op_linalg` of sq+reduce
Add `generalize` of remaining `linalg.reduce` (delayed to PHASE 9), `convert_divf_sqrt_to_rsqrt`, and `broadcast_before_unary` for math.rsqrt — all matching the layernorm prototype

Self-copies are never emitted, so the diagnostic disappears structurally rather than via a workaround. Resulting AIR shows clean separated loops (square → reduce → output) with a single shared L1 input alloc.

Test plan

`scripts/run_tests.py --device aie2p` on local NPU2 (Strix): 17/18 pass (only pre-existing matvec unrelated)
Standalone `python examples/rms_norm/rms_norm.py` on aie2p: pass, no stderr diagnostics
Compilation produces `aie.elf` + `full_elf_config.json` cleanly
Verify on Windows aie2p (the platform where Add Windows wheel release job to nightly-wheels CI #64 reported it) — expected to pass since fix is host-OS-independent
aie2 path unaffected (no `transform_aie2.mlir` exists for this kernel; aie2p-only)

🤖 Generated with Claude Code

…type Replace post-bufferize linalg_promote (which leaks self-copies that crash transform.air.copy_to_dma) with pre-bufferize bufferize_to_allocation + promote_tensor for L1 staging, mirroring mlir-air xrt 43_triton_layernorm. Eliminates "expected to produce 1 results (actually produced 0)" stderr on aie2p reported in amd#64.

Copilot

Pull request overview

This PR refactors the AIE2P MLIR transform sequence for the rms_norm example to mirror the mlir-air layernorm XRT prototype structure, replacing the prior post-bufferize linalg_promote-based L1 staging approach that could generate SSA self-copies and trigger transform.air.copy_to_dma result-contract diagnostics.

Changes:

Reworks the transform pipeline into phased steps (canonicalize/fuse, dataflow navigation, tiling+fusion, pre-bufferize allocations/promotions, bufferize, vectorization prep, herd/DMA/vectorization, type casts).
Replaces post-bufferize transform.air.linalg_promote staging with pre-bufferize bufferize_to_allocation {memory_space = 2} and promote_tensor to 2 to avoid self-copy patterns.
Adds prototype-aligned transforms: delayed generalize of remaining linalg.reduce, convert_divf_sqrt_to_rsqrt, and broadcast_before_unary for math.rsqrt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings May 11, 2026 21:36

Copilot started reviewing on behalf of erwei-xilinx May 11, 2026 21:37 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

erwei-xilinx merged commit 4bcbe55 into amd:main May 11, 2026
16 of 17 checks passed

erwei-xilinx deleted the fix-rms-norm-aie2p-transform branch May 11, 2026 22:37

erwei-xilinx mentioned this pull request May 11, 2026

Add Windows wheel release job to nightly-wheels CI #64

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor rms_norm transform script to follow mlir-air layernorm prototype#65

Refactor rms_norm transform script to follow mlir-air layernorm prototype#65
erwei-xilinx merged 1 commit into
amd:mainfrom
erwei-xilinx:fix-rms-norm-aie2p-transform

erwei-xilinx commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

erwei-xilinx commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented May 11, 2026 •

edited

Loading