Add padded matmul with BF16 emulation and non-aligned dimensions by erwei-xilinx · Pull Request #28 · amd/Triton-XDNA

erwei-xilinx · 2026-03-26T05:12:25Z

Summary

Port mlir-air test 54 (54_matmul_padding_f32_bf16_emulation) to Triton-XDNA as examples/padded_matmul/, driven from a @triton.jit kernel
Add driver support for non-tile-aligned dimensions via actual-sizes extraction from JIT constexprs
Add AMD_TRITON_NPU_BF16_EMULATION env var for BF16 hardware emulation (f32 inputs truncated to bf16 before multiply, f32 accumulation)
Bump mlir-air/mlir-aie to versions that include the padding rank validation fix

Details

New example (examples/padded_matmul/):

F32 matmul with BF16 emulation on NPU2/Strix (AIE2P)
A stored in K×M layout (transposed), B in K×N layout
Tile sizes: TILE_M=64, TILE_N=32, HERD=4×4, LAUNCH_TILE=256×128
Default test: M=500, N=500, K=1024 (non-tile-aligned M and N)
Stochastic validation with bf16-aware golden reference

Driver changes (amd_triton_npu/backend/driver.py):

NPULauncher.__init__: extracts M/N from src.constants + src.fn.arg_names at JIT compile time; only sets actual-sizes when M % BLOCK_SIZE_M != 0 or N % BLOCK_SIZE_N != 0
_ttshared_to_air: passes actual-sizes to air-wrap-func-with-parallel to enable air-split-launch-for-padding
compile_module: threads actual_sizes and adds --bf16-emulation to aircc when env var is set
Cache key includes bf16emu flag

Dependency bumps:

mlir-air: f954272 → b312418 (includes fix for air.channel.put padding rank validation)
mlir-aie: d8acbc6 → df5c9a4 (matching mlir-air pin)

Test plan

padded_matmul with M=500, N=500, K=1024 (non-aligned): PASS on NPU2
padded_matmul with M=256, N=128, K=1024 (aligned): PASS on NPU2
All 15 existing examples pass on NPU2 (no regressions)
CI build validation

🤖 Generated with Claude Code

…on support Port mlir-air test 54 (54_matmul_padding_f32_bf16_emulation) to Triton-XDNA as a new example driven from a @triton.jit kernel. This demonstrates F32 matmul with BF16 hardware emulation on NPU2/Strix, supporting non-tile-aligned M and N dimensions via DMA padding. Driver changes: - Extract actual problem sizes (M, N) from JIT constexpr args at compile time and pass them as actual-sizes to air-wrap-func-with-parallel, enabling air-split-launch-for-padding on boundary tiles - Add AMD_TRITON_NPU_BF16_EMULATION env var support to pass --bf16-emulation flag to aircc - Include bf16_emulation in compilation cache key Dependency updates: - Bump mlir-air to b312418 (fixes air.channel.put padding rank validation) - Bump mlir-aie to df5c9a4 (matching mlir-air pin) Tested on NPU2 hardware: padded_matmul passes with M=500, N=500, K=1024; all 15 existing examples pass (no regressions). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Ports an MLIR-AIR matmul padding test to Triton-XDNA and extends the NPU driver to support non-tile-aligned problem sizes and optional BF16 emulation during AIR compilation.

Changes:

Add a new examples/padded_matmul/ example (Triton kernel + AIE2P transform script) demonstrating padding + BF16 emulation behavior.
Extend amd_triton_npu/backend/driver.py to plumb actual-sizes into air-wrap-func-with-parallel and add a BF16-emulation aircc flag + cache-key component.
Bump pinned mlir-air and mlir-aie hashes to newer commits.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
utils/mlir-air-hash.txt	Update pinned `mlir-air` commit/timestamp.
utils/mlir-aie-hash.txt	Update pinned `mlir-aie` commit/timestamp.
examples/padded_matmul/transform_aie2p.mlir	New AIE2P transform pipeline for padded matmul + BF16 emulation.
examples/padded_matmul/padded_matmul.py	New runnable example + stochastic validation for non-aligned M/N.
amd_triton_npu/backend/driver.py	Add `actual-sizes` plumbing + BF16 emulation flag and caching behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Use helper that tries multiple key forms ((idx,), idx, name) for constexpr lookup, ensuring padding support works across Triton versions - Fix comment typo: "BFP16 emulation" -> "BF16 emulation" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 26, 2026 05:12

Copilot started reviewing on behalf of erwei-xilinx March 26, 2026 05:13 View session

Fix black formatting in driver.py and padded_matmul.py

5eaeb71

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Comment thread amd_triton_npu/backend/driver.py Outdated

Comment thread examples/padded_matmul/transform_aie2p.mlir Outdated

erwei-xilinx merged commit f7cad78 into main Mar 26, 2026
8 of 9 checks passed

erwei-xilinx deleted the padded-matmul-bf16-emulation branch March 26, 2026 16:21

erwei-xilinx mentioned this pull request Mar 26, 2026

Add padded_matmul to examples dashboard registry #29

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add padded matmul with BF16 emulation and non-aligned dimensions#28

Add padded matmul with BF16 emulation and non-aligned dimensions#28
erwei-xilinx merged 3 commits into
mainfrom
padded-matmul-bf16-emulation

erwei-xilinx commented Mar 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

erwei-xilinx commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented Mar 26, 2026 •

edited

Loading