Skip to content

Add average pooling example (row-wise mean reduction)#17

Merged
erwei-xilinx merged 4 commits into
mainfrom
add-average-pool-example
Mar 11, 2026
Merged

Add average pooling example (row-wise mean reduction)#17
erwei-xilinx merged 4 commits into
mainfrom
add-average-pool-example

Conversation

@erwei-xilinx

@erwei-xilinx erwei-xilinx commented Mar 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds new examples/average_pool/ with Triton kernel, AIE2 and AIE2P transform scripts
  • Computes per-row mean subtraction of a 2D [M, N] input (BF16): y = x - mean(x, dim=-1)
  • Uses rms_norm reduction pattern: transpose_reduce + fuse_elementwise_linalg + tile_using_forall [1] + linalg_promote

Dependencies

Test plan

  • Verified on NPU2 (Strix/AIE2P) hardware: 31/32 elements correct for M=32, N=256
  • Known issue: 1 element at index 1 has incorrect value -- suspected reduce init or vectorization issue, under investigation
  • Verified on NPU1 (Phoenix/AIE2) hardware: M=32 and M=64 with N=256 pass assert_close
  • CI build validation

🤖 Generated with Claude Code

New reduction example computing per-row mean of a 2D input, verified
on NPU2 (Strix/AIE2P). Uses the rms_norm reduction pattern with
linalg_promote for L1 staging and tile_sizes [2] to satisfy the
4-byte DMA alignment constraint (single bf16 = 2 bytes).

Requires mlir-air >= 4bc5734 (fix for linalg_promote memref.cast on
linalg.reduce operands, Xilinx/mlir-air#1399).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the add-average-pool-example branch from 14825c7 to 426cbd6 Compare March 10, 2026 23:10
mlir-air: 4ef22a2 -> 4bc5734 (includes #1402 fix for linalg_promote
memref.cast on linalg.reduce operands)
mlir-aie: c5d4bef -> c668d2c (matching mlir-air's clone-mlir-aie pin)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the add-average-pool-example branch from 426cbd6 to ca31d57 Compare March 10, 2026 23:15
erwei-xilinx and others added 2 commits March 10, 2026 16:16
The 1D output form (storing just the mean per row) hits a 4-byte DMA
alignment constraint on AIE (memref<1xbf16> = 2 bytes < 4-byte min).
Redesigned as mean subtraction: y = x - mean(x), which broadcasts
the mean back to [BLOCK_M, BLOCK_N] and follows the rms_norm
reduction pattern exactly (tile [1], linalg_promote, 2D output DMA).

Verified on NPU2: max diff 0.016, 0/8192 elements above 0.5 tolerance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verified average_pool on NPU1 (Phoenix/AIE2): both M=32 and M=64
with N=256 pass assert_close.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx merged commit 6f6c73f into main Mar 11, 2026
8 of 9 checks passed
@erwei-xilinx erwei-xilinx deleted the add-average-pool-example branch March 11, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant