Add padded matmul with BF16 emulation and non-aligned dimensions#28
Merged
Conversation
…on support Port mlir-air test 54 (54_matmul_padding_f32_bf16_emulation) to Triton-XDNA as a new example driven from a @triton.jit kernel. This demonstrates F32 matmul with BF16 hardware emulation on NPU2/Strix, supporting non-tile-aligned M and N dimensions via DMA padding. Driver changes: - Extract actual problem sizes (M, N) from JIT constexpr args at compile time and pass them as actual-sizes to air-wrap-func-with-parallel, enabling air-split-launch-for-padding on boundary tiles - Add AMD_TRITON_NPU_BF16_EMULATION env var support to pass --bf16-emulation flag to aircc - Include bf16_emulation in compilation cache key Dependency updates: - Bump mlir-air to b312418 (fixes air.channel.put padding rank validation) - Bump mlir-aie to df5c9a4 (matching mlir-air pin) Tested on NPU2 hardware: padded_matmul passes with M=500, N=500, K=1024; all 15 existing examples pass (no regressions). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Ports an MLIR-AIR matmul padding test to Triton-XDNA and extends the NPU driver to support non-tile-aligned problem sizes and optional BF16 emulation during AIR compilation.
Changes:
- Add a new
examples/padded_matmul/example (Triton kernel + AIE2P transform script) demonstrating padding + BF16 emulation behavior. - Extend
amd_triton_npu/backend/driver.pyto plumbactual-sizesintoair-wrap-func-with-paralleland add a BF16-emulation aircc flag + cache-key component. - Bump pinned
mlir-airandmlir-aiehashes to newer commits.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| utils/mlir-air-hash.txt | Update pinned mlir-air commit/timestamp. |
| utils/mlir-aie-hash.txt | Update pinned mlir-aie commit/timestamp. |
| examples/padded_matmul/transform_aie2p.mlir | New AIE2P transform pipeline for padded matmul + BF16 emulation. |
| examples/padded_matmul/padded_matmul.py | New runnable example + stochastic validation for non-aligned M/N. |
| amd_triton_npu/backend/driver.py | Add actual-sizes plumbing + BF16 emulation flag and caching behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Use helper that tries multiple key forms ((idx,), idx, name) for constexpr lookup, ensuring padding support works across Triton versions - Fix comment typo: "BFP16 emulation" -> "BF16 emulation" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
54_matmul_padding_f32_bf16_emulation) to Triton-XDNA asexamples/padded_matmul/, driven from a@triton.jitkernelactual-sizesextraction from JIT constexprsAMD_TRITON_NPU_BF16_EMULATIONenv var for BF16 hardware emulation (f32 inputs truncated to bf16 before multiply, f32 accumulation)Details
New example (
examples/padded_matmul/):Driver changes (
amd_triton_npu/backend/driver.py):NPULauncher.__init__: extracts M/N fromsrc.constants+src.fn.arg_namesat JIT compile time; only setsactual-sizeswhenM % BLOCK_SIZE_M != 0orN % BLOCK_SIZE_N != 0_ttshared_to_air: passesactual-sizestoair-wrap-func-with-parallelto enableair-split-launch-for-paddingcompile_module: threadsactual_sizesand adds--bf16-emulationto aircc when env var is setbf16emuflagDependency bumps:
f954272→b312418(includes fix forair.channel.putpadding rank validation)d8acbc6→df5c9a4(matching mlir-air pin)Test plan
padded_matmulwith M=500, N=500, K=1024 (non-aligned): PASS on NPU2padded_matmulwith M=256, N=128, K=1024 (aligned): PASS on NPU2🤖 Generated with Claude Code