Skip to content

Fix elementwise herd width: tile by num_threads, not fixed tile size#70

Merged
erwei-xilinx merged 3 commits into
mainfrom
fix-elementwise-herd-width-numthreads
Jun 30, 2026
Merged

Fix elementwise herd width: tile by num_threads, not fixed tile size#70
erwei-xilinx merged 3 commits into
mainfrom
fix-elementwise-herd-width-numthreads

Conversation

@erwei-xilinx

Copy link
Copy Markdown
Collaborator

Summary

@flatten_tile_forall (shared by all elementwise examples) tiled the flattened op with tile_sizes [256], making the herd width = BLOCK_SIZE_N / 256. That width is unbounded in both directions and breaks AIR→AIE lowering at the edges. This PR switches the forall tiling to num_threads [4], fixing the herd width at ≤ 4 columns for any block size.

Root cause (what tile_sizes [256] did wrong)

  • width 1 (BLOCK ≤ 256): the single-trip scf.forall is canonicalized away before herd mapping, so par_to_herd matches nothing — it emits "expected a single payload op" which is silently non-fatal — and the compute is never placed in a herd. aircc then fails downstream in air-dma-to-channel with "operand does not dominate this use".
  • width ≥ 8 (BLOCK ≥ 2048): the herd is wider than the 4-column array → "No valid placement found".

num_threads [4] always produces ≤ 4 tiles, so the forall never folds to zero and never overflows the array.

Validation

Tested on npu1 hardware, vec-add, maskless, triton cache cleared before every config:

BLOCK_SIZE_N before (tile_sizes [256]) after (num_threads [4])
≤ 64 FAIL MISMATCH (see limitations)
128 – 32768 (pow2) only 512, 1024 PASS all PASS
≥ 65536 FAIL OOM (L2 capacity)

Multi-block (e.g. N=8192, BLOCK=1024) and ragged tails (e.g. N=1500, BLOCK=512) PASS. The canonical examples/vec-add/vec-add.py sweep (BLOCK=1024, N=1024…32768) still passes — no regression.

Shape coverage after this change

Supported ⟺ BLOCK_SIZE_N is a power of 2 AND 128 ≤ BLOCK_SIZE_N ≤ 32768.

  • N (data size) is unrestrictedgrid = cdiv(N, BLOCK) scales it and the ragged last block is handled by the launch-level size clamp. e.g. N=8 works fine with BLOCK=128 or 1024; only BLOCK matters.
  • Lower bound BLOCK ≥ 128: per-core tile = BLOCK/4 must be ≥ 2 vectors (≥ 32 for vector-16).
  • Upper bound BLOCK ≤ 32768: L2/memtile capacity (binary op = 3 bf16 buffers); BLOCK=65536 OOMs.

Known limitations

  1. Masks are not supported. Any kernel using tl.load/store(..., mask=...) aborts aircc with SIGABRT, regardless of size — even a statically all-true mask. This is an independent backend bug, not addressed here. (Note: ragged-tail bounds safety is already provided by AIR's launch-level size clamp, so maskless kernels with a partial last block still compute correct results.)
  2. BLOCK_SIZE_N < 128 produces wrong results (compiles, MISMATCH). At BLOCK=64 each core's tile is a single 16-lane vector, so the vectorize tile_using_for [16] becomes single-trip, gets folded, and corrupts — the same single-trip pathology as the herd forall, one level down. Lowering this floor needs an adaptive vector width / scalar fallback in @vectorize_generics_at_*.
  3. num_threads [4] is hardcoded to npu1's 4 columns. On npu2/AIE2P (more columns) this is correct but under-utilizes the array (caps at 4 cores). A follow-up should make the thread count target-aware (the textual transform.include mechanism can't pass params, so this likely needs the driver to inject it).
  4. Scope of validation: only vec-add on npu1, maskless was run on hardware. @flatten_tile_forall is shared by relu, silu, gelu, swiglu, leaky_relu (aie2 and aie2p). The change is structurally identical for those, but they have not been validated here — reviewers should confirm before relying on them, especially on npu2.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings June 29, 2026 23:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the shared elementwise transform sequence to avoid producing an unbounded herd width by changing the scf.forall tiling strategy from a fixed tile size to a fixed number of threads, improving robustness of AIR→AIE lowering across block sizes.

Changes:

  • Switch @flatten_tile_forall from tile_sizes [256] to num_threads [4] to keep herd width bounded.
  • Expand the in-file rationale comment describing why num_threads is used.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread amd_triton_npu/backend/transform_library/elementwise.mlir Outdated
Comment thread amd_triton_npu/backend/transform_library/elementwise.mlir
erwei-xilinx and others added 3 commits June 29, 2026 16:16
The @flatten_tile_forall sequence tiled the flattened elementwise op with
tile_sizes [256], making the herd width = BLOCK_SIZE_N / 256. That width is
unbounded in both directions and breaks lowering at the edges:

  - width 1 (BLOCK <= 256): the single-trip scf.forall is canonicalized away
    before herd mapping, so par_to_herd matches nothing ("expected a single
    payload op", silently non-fatal) and the compute is never placed in a
    herd. aircc then fails in air-dma-to-channel with "operand does not
    dominate this use".
  - width >= 8 (BLOCK >= 2048): the herd is wider than the 4-column array,
    so placement fails with "No valid placement found".

Switching to num_threads [4] fixes the herd width at <= 4 columns for any
block size: it never folds to zero and never overflows the array.

Validated on npu1 hardware (vec-add, maskless): PASS for any N with
BLOCK_SIZE_N a power of 2 in [128, 32768], single- and multi-block,
including ragged tails. Previously only BLOCK in {512, 1024} worked.

Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Address review feedback: the rationale referenced npu1-specific numbers
(BLOCK/256, 4-column array, BLOCK>=2048). Reword in terms of
ceildiv(block, tile) and the target's column count, and note the thread
count is sized for npu1.

Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Address review feedback: explain in-code that the thread count is the npu1
4-column count, that AIE2P/npu2 scripts sharing this sequence are capped at
4 of 8 columns (correct but under-utilizing), and that a target-aware count
is a follow-up. 4 is kept as the npu1-hardware-validated value.

Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx merged commit 81da30a into main Jun 30, 2026
12 of 13 checks passed
@erwei-xilinx erwei-xilinx deleted the fix-elementwise-herd-width-numthreads branch June 30, 2026 04:01
@erwei-xilinx erwei-xilinx restored the fix-elementwise-herd-width-numthreads branch June 30, 2026 04:04
@erwei-xilinx erwei-xilinx deleted the fix-elementwise-herd-width-numthreads branch June 30, 2026 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants