Fix elementwise herd width: tile by num_threads, not fixed tile size by erwei-xilinx · Pull Request #70 · amd/Triton-XDNA

erwei-xilinx · 2026-06-29T23:07:01Z

Summary

@flatten_tile_forall (shared by all elementwise examples) tiled the flattened op with tile_sizes [256], making the herd width = BLOCK_SIZE_N / 256. That width is unbounded in both directions and breaks AIR→AIE lowering at the edges. This PR switches the forall tiling to num_threads [4], fixing the herd width at ≤ 4 columns for any block size.

Root cause (what `tile_sizes [256]` did wrong)

width 1 (BLOCK ≤ 256): the single-trip scf.forall is canonicalized away before herd mapping, so par_to_herd matches nothing — it emits "expected a single payload op" which is silently non-fatal — and the compute is never placed in a herd. aircc then fails downstream in air-dma-to-channel with "operand does not dominate this use".
width ≥ 8 (BLOCK ≥ 2048): the herd is wider than the 4-column array → "No valid placement found".

num_threads [4] always produces ≤ 4 tiles, so the forall never folds to zero and never overflows the array.

Validation

Tested on npu1 hardware, vec-add, maskless, triton cache cleared before every config:

BLOCK_SIZE_N	before (`tile_sizes [256]`)	after (`num_threads [4]`)
≤ 64	FAIL	MISMATCH (see limitations)
128 – 32768 (pow2)	only 512, 1024 PASS	all PASS
≥ 65536	FAIL	OOM (L2 capacity)

Multi-block (e.g. N=8192, BLOCK=1024) and ragged tails (e.g. N=1500, BLOCK=512) PASS. The canonical examples/vec-add/vec-add.py sweep (BLOCK=1024, N=1024…32768) still passes — no regression.

Shape coverage after this change

Supported ⟺ BLOCK_SIZE_N is a power of 2 AND 128 ≤ BLOCK_SIZE_N ≤ 32768.

N (data size) is unrestricted — grid = cdiv(N, BLOCK) scales it and the ragged last block is handled by the launch-level size clamp. e.g. N=8 works fine with BLOCK=128 or 1024; only BLOCK matters.
Lower bound BLOCK ≥ 128: per-core tile = BLOCK/4 must be ≥ 2 vectors (≥ 32 for vector-16).
Upper bound BLOCK ≤ 32768: L2/memtile capacity (binary op = 3 bf16 buffers); BLOCK=65536 OOMs.

Known limitations

Masks are not supported. Any kernel using tl.load/store(..., mask=...) aborts aircc with SIGABRT, regardless of size — even a statically all-true mask. This is an independent backend bug, not addressed here. (Note: ragged-tail bounds safety is already provided by AIR's launch-level size clamp, so maskless kernels with a partial last block still compute correct results.)
BLOCK_SIZE_N < 128 produces wrong results (compiles, MISMATCH). At BLOCK=64 each core's tile is a single 16-lane vector, so the vectorize tile_using_for [16] becomes single-trip, gets folded, and corrupts — the same single-trip pathology as the herd forall, one level down. Lowering this floor needs an adaptive vector width / scalar fallback in @vectorize_generics_at_*.
num_threads [4] is hardcoded to npu1's 4 columns. On npu2/AIE2P (more columns) this is correct but under-utilizes the array (caps at 4 cores). A follow-up should make the thread count target-aware (the textual transform.include mechanism can't pass params, so this likely needs the driver to inject it).
Scope of validation: only vec-add on npu1, maskless was run on hardware. @flatten_tile_forall is shared by relu, silu, gelu, swiglu, leaky_relu (aie2 and aie2p). The change is structurally identical for those, but they have not been validated here — reviewers should confirm before relying on them, especially on npu2.

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR updates the shared elementwise transform sequence to avoid producing an unbounded herd width by changing the scf.forall tiling strategy from a fixed tile size to a fixed number of threads, improving robustness of AIR→AIE lowering across block sizes.

Changes:

Switch @flatten_tile_forall from tile_sizes [256] to num_threads [4] to keep herd width bounded.
Expand the in-file rationale comment describing why num_threads is used.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The @flatten_tile_forall sequence tiled the flattened elementwise op with tile_sizes [256], making the herd width = BLOCK_SIZE_N / 256. That width is unbounded in both directions and breaks lowering at the edges: - width 1 (BLOCK <= 256): the single-trip scf.forall is canonicalized away before herd mapping, so par_to_herd matches nothing ("expected a single payload op", silently non-fatal) and the compute is never placed in a herd. aircc then fails in air-dma-to-channel with "operand does not dominate this use". - width >= 8 (BLOCK >= 2048): the herd is wider than the 4-column array, so placement fails with "No valid placement found". Switching to num_threads [4] fixes the herd width at <= 4 columns for any block size: it never folds to zero and never overflows the array. Validated on npu1 hardware (vec-add, maskless): PASS for any N with BLOCK_SIZE_N a power of 2 in [128, 32768], single- and multi-block, including ragged tails. Previously only BLOCK in {512, 1024} worked. Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>

Address review feedback: the rationale referenced npu1-specific numbers (BLOCK/256, 4-column array, BLOCK>=2048). Reword in terms of ceildiv(block, tile) and the target's column count, and note the thread count is sized for npu1. Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>

Address review feedback: explain in-code that the thread count is the npu1 4-column count, that AIE2P/npu2 scripts sharing this sequence are capped at 4 of 8 columns (correct but under-utilizing), and that a target-aware count is a follow-up. 4 is kept as the npu1-hardware-validated value. Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 29, 2026 23:07

Copilot started reviewing on behalf of erwei-xilinx June 29, 2026 23:07 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

Comment thread amd_triton_npu/backend/transform_library/elementwise.mlir Outdated

Comment thread amd_triton_npu/backend/transform_library/elementwise.mlir

erwei-xilinx and others added 3 commits June 29, 2026 16:16

erwei-xilinx mentioned this pull request Jun 30, 2026

Add AIE2P (npu2) elementwise herd width: 8 threads for 8-column array #71

Merged

4 tasks

erwei-xilinx merged commit 81da30a into main Jun 30, 2026
12 of 13 checks passed

erwei-xilinx deleted the fix-elementwise-herd-width-numthreads branch June 30, 2026 04:01

erwei-xilinx restored the fix-elementwise-herd-width-numthreads branch June 30, 2026 04:04

erwei-xilinx deleted the fix-elementwise-herd-width-numthreads branch June 30, 2026 04:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix elementwise herd width: tile by num_threads, not fixed tile size#70

Fix elementwise herd width: tile by num_threads, not fixed tile size#70
erwei-xilinx merged 3 commits into
mainfrom
fix-elementwise-herd-width-numthreads

erwei-xilinx commented Jun 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

erwei-xilinx commented Jun 29, 2026

Summary

Root cause (what tile_sizes [256] did wrong)

Validation

Shape coverage after this change

Known limitations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Root cause (what `tile_sizes [256]` did wrong)