Skip to content

Add padded matmul with BF16 emulation and non-aligned dimensions#28

Merged
erwei-xilinx merged 3 commits into
mainfrom
padded-matmul-bf16-emulation
Mar 26, 2026
Merged

Add padded matmul with BF16 emulation and non-aligned dimensions#28
erwei-xilinx merged 3 commits into
mainfrom
padded-matmul-bf16-emulation

Conversation

@erwei-xilinx

@erwei-xilinx erwei-xilinx commented Mar 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Port mlir-air test 54 (54_matmul_padding_f32_bf16_emulation) to Triton-XDNA as examples/padded_matmul/, driven from a @triton.jit kernel
  • Add driver support for non-tile-aligned dimensions via actual-sizes extraction from JIT constexprs
  • Add AMD_TRITON_NPU_BF16_EMULATION env var for BF16 hardware emulation (f32 inputs truncated to bf16 before multiply, f32 accumulation)
  • Bump mlir-air/mlir-aie to versions that include the padding rank validation fix

Details

New example (examples/padded_matmul/):

  • F32 matmul with BF16 emulation on NPU2/Strix (AIE2P)
  • A stored in K×M layout (transposed), B in K×N layout
  • Tile sizes: TILE_M=64, TILE_N=32, HERD=4×4, LAUNCH_TILE=256×128
  • Default test: M=500, N=500, K=1024 (non-tile-aligned M and N)
  • Stochastic validation with bf16-aware golden reference

Driver changes (amd_triton_npu/backend/driver.py):

  • NPULauncher.__init__: extracts M/N from src.constants + src.fn.arg_names at JIT compile time; only sets actual-sizes when M % BLOCK_SIZE_M != 0 or N % BLOCK_SIZE_N != 0
  • _ttshared_to_air: passes actual-sizes to air-wrap-func-with-parallel to enable air-split-launch-for-padding
  • compile_module: threads actual_sizes and adds --bf16-emulation to aircc when env var is set
  • Cache key includes bf16emu flag

Dependency bumps:

  • mlir-air: f954272b312418 (includes fix for air.channel.put padding rank validation)
  • mlir-aie: d8acbc6df5c9a4 (matching mlir-air pin)

Test plan

  • padded_matmul with M=500, N=500, K=1024 (non-aligned): PASS on NPU2
  • padded_matmul with M=256, N=128, K=1024 (aligned): PASS on NPU2
  • All 15 existing examples pass on NPU2 (no regressions)
  • CI build validation

🤖 Generated with Claude Code

…on support

Port mlir-air test 54 (54_matmul_padding_f32_bf16_emulation) to Triton-XDNA
as a new example driven from a @triton.jit kernel. This demonstrates F32
matmul with BF16 hardware emulation on NPU2/Strix, supporting non-tile-aligned
M and N dimensions via DMA padding.

Driver changes:
- Extract actual problem sizes (M, N) from JIT constexpr args at compile time
  and pass them as actual-sizes to air-wrap-func-with-parallel, enabling
  air-split-launch-for-padding on boundary tiles
- Add AMD_TRITON_NPU_BF16_EMULATION env var support to pass --bf16-emulation
  flag to aircc
- Include bf16_emulation in compilation cache key

Dependency updates:
- Bump mlir-air to b312418 (fixes air.channel.put padding rank validation)
- Bump mlir-aie to df5c9a4 (matching mlir-air pin)

Tested on NPU2 hardware: padded_matmul passes with M=500, N=500, K=1024;
all 15 existing examples pass (no regressions).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 26, 2026 05:12
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports an MLIR-AIR matmul padding test to Triton-XDNA and extends the NPU driver to support non-tile-aligned problem sizes and optional BF16 emulation during AIR compilation.

Changes:

  • Add a new examples/padded_matmul/ example (Triton kernel + AIE2P transform script) demonstrating padding + BF16 emulation behavior.
  • Extend amd_triton_npu/backend/driver.py to plumb actual-sizes into air-wrap-func-with-parallel and add a BF16-emulation aircc flag + cache-key component.
  • Bump pinned mlir-air and mlir-aie hashes to newer commits.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
utils/mlir-air-hash.txt Update pinned mlir-air commit/timestamp.
utils/mlir-aie-hash.txt Update pinned mlir-aie commit/timestamp.
examples/padded_matmul/transform_aie2p.mlir New AIE2P transform pipeline for padded matmul + BF16 emulation.
examples/padded_matmul/padded_matmul.py New runnable example + stochastic validation for non-aligned M/N.
amd_triton_npu/backend/driver.py Add actual-sizes plumbing + BF16 emulation flag and caching behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread amd_triton_npu/backend/driver.py Outdated
Comment thread examples/padded_matmul/transform_aie2p.mlir Outdated
- Use helper that tries multiple key forms ((idx,), idx, name) for
  constexpr lookup, ensuring padding support works across Triton versions
- Fix comment typo: "BFP16 emulation" -> "BF16 emulation"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx merged commit f7cad78 into main Mar 26, 2026
8 of 9 checks passed
@erwei-xilinx erwei-xilinx deleted the padded-matmul-bf16-emulation branch March 26, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants