Fix elementwise herd width: tile by num_threads, not fixed tile size#70
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the shared elementwise transform sequence to avoid producing an unbounded herd width by changing the scf.forall tiling strategy from a fixed tile size to a fixed number of threads, improving robustness of AIR→AIE lowering across block sizes.
Changes:
- Switch
@flatten_tile_forallfromtile_sizes [256]tonum_threads [4]to keep herd width bounded. - Expand the in-file rationale comment describing why
num_threadsis used.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The @flatten_tile_forall sequence tiled the flattened elementwise op with
tile_sizes [256], making the herd width = BLOCK_SIZE_N / 256. That width is
unbounded in both directions and breaks lowering at the edges:
- width 1 (BLOCK <= 256): the single-trip scf.forall is canonicalized away
before herd mapping, so par_to_herd matches nothing ("expected a single
payload op", silently non-fatal) and the compute is never placed in a
herd. aircc then fails in air-dma-to-channel with "operand does not
dominate this use".
- width >= 8 (BLOCK >= 2048): the herd is wider than the 4-column array,
so placement fails with "No valid placement found".
Switching to num_threads [4] fixes the herd width at <= 4 columns for any
block size: it never folds to zero and never overflows the array.
Validated on npu1 hardware (vec-add, maskless): PASS for any N with
BLOCK_SIZE_N a power of 2 in [128, 32768], single- and multi-block,
including ragged tails. Previously only BLOCK in {512, 1024} worked.
Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Address review feedback: the rationale referenced npu1-specific numbers (BLOCK/256, 4-column array, BLOCK>=2048). Reword in terms of ceildiv(block, tile) and the target's column count, and note the thread count is sized for npu1. Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Address review feedback: explain in-code that the thread count is the npu1 4-column count, that AIE2P/npu2 scripts sharing this sequence are capped at 4 of 8 columns (correct but under-utilizing), and that a target-aware count is a follow-up. 4 is kept as the npu1-hardware-validated value. Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@flatten_tile_forall(shared by all elementwise examples) tiled the flattened op withtile_sizes [256], making the herd width =BLOCK_SIZE_N / 256. That width is unbounded in both directions and breaks AIR→AIE lowering at the edges. This PR switches the forall tiling tonum_threads [4], fixing the herd width at ≤ 4 columns for any block size.Root cause (what
tile_sizes [256]did wrong)BLOCK ≤ 256): the single-tripscf.forallis canonicalized away before herd mapping, sopar_to_herdmatches nothing — it emits"expected a single payload op"which is silently non-fatal — and the compute is never placed in a herd.airccthen fails downstream inair-dma-to-channelwith"operand does not dominate this use".BLOCK ≥ 2048): the herd is wider than the 4-column array →"No valid placement found".num_threads [4]always produces ≤ 4 tiles, so the forall never folds to zero and never overflows the array.Validation
Tested on npu1 hardware, vec-add, maskless, triton cache cleared before every config:
tile_sizes [256])num_threads [4])Multi-block (e.g.
N=8192, BLOCK=1024) and ragged tails (e.g.N=1500, BLOCK=512) PASS. The canonicalexamples/vec-add/vec-add.pysweep (BLOCK=1024, N=1024…32768) still passes — no regression.Shape coverage after this change
Supported ⟺
BLOCK_SIZE_Nis a power of 2 AND128 ≤ BLOCK_SIZE_N ≤ 32768.grid = cdiv(N, BLOCK)scales it and the ragged last block is handled by the launch-level size clamp. e.g.N=8works fine withBLOCK=128or1024; onlyBLOCKmatters.BLOCK ≥ 128: per-core tile =BLOCK/4must be ≥ 2 vectors (≥ 32 for vector-16).BLOCK ≤ 32768: L2/memtile capacity (binary op = 3 bf16 buffers);BLOCK=65536OOMs.Known limitations
tl.load/store(..., mask=...)abortsairccwith SIGABRT, regardless of size — even a statically all-true mask. This is an independent backend bug, not addressed here. (Note: ragged-tail bounds safety is already provided by AIR's launch-level size clamp, so maskless kernels with a partial last block still compute correct results.)BLOCK_SIZE_N < 128produces wrong results (compiles, MISMATCH). AtBLOCK=64each core's tile is a single 16-lane vector, so the vectorizetile_using_for [16]becomes single-trip, gets folded, and corrupts — the same single-trip pathology as the herd forall, one level down. Lowering this floor needs an adaptive vector width / scalar fallback in@vectorize_generics_at_*.num_threads [4]is hardcoded to npu1's 4 columns. On npu2/AIE2P (more columns) this is correct but under-utilizes the array (caps at 4 cores). A follow-up should make the thread count target-aware (the textualtransform.includemechanism can't pass params, so this likely needs the driver to inject it).@flatten_tile_forallis shared by relu, silu, gelu, swiglu, leaky_relu (aie2 and aie2p). The change is structurally identical for those, but they have not been validated here — reviewers should confirm before relying on them, especially on npu2.🤖 Generated with Claude Code