From 7b7e8230a7cfa88c738f7bc997a78d996a04d4b4 Mon Sep 17 00:00:00 2001 From: erwei-xilinx Date: Mon, 29 Jun 2026 16:16:04 -0700 Subject: [PATCH 1/3] Fix elementwise herd width: tile by num_threads, not fixed tile size The @flatten_tile_forall sequence tiled the flattened elementwise op with tile_sizes [256], making the herd width = BLOCK_SIZE_N / 256. That width is unbounded in both directions and breaks lowering at the edges: - width 1 (BLOCK <= 256): the single-trip scf.forall is canonicalized away before herd mapping, so par_to_herd matches nothing ("expected a single payload op", silently non-fatal) and the compute is never placed in a herd. aircc then fails in air-dma-to-channel with "operand does not dominate this use". - width >= 8 (BLOCK >= 2048): the herd is wider than the 4-column array, so placement fails with "No valid placement found". Switching to num_threads [4] fixes the herd width at <= 4 columns for any block size: it never folds to zero and never overflows the array. Validated on npu1 hardware (vec-add, maskless): PASS for any N with BLOCK_SIZE_N a power of 2 in [128, 32768], single- and multi-block, including ragged tails. Previously only BLOCK in {512, 1024} worked. Co-Authored-By: Claude Sonnet 4 (1M context) --- amd_triton_npu/backend/transform_library/elementwise.mlir | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/amd_triton_npu/backend/transform_library/elementwise.mlir b/amd_triton_npu/backend/transform_library/elementwise.mlir index 26fda74..eaf95c8 100644 --- a/amd_triton_npu/backend/transform_library/elementwise.mlir +++ b/amd_triton_npu/backend/transform_library/elementwise.mlir @@ -21,7 +21,10 @@ transform.named_sequence @fuse_elementwise_and_canonicalize( transform.yield } -// Flatten to 1D, allocate result in L2, tile forall [256] for multi-core. +// Flatten to 1D, allocate result in L2, split across a fixed number of cores. +// num_threads (not tile_sizes) fixes the herd width regardless of block size: +// tile_sizes [256] made width = BLOCK/256, which folds to no herd when +// BLOCK <= 256 and overflows the 4-column array when BLOCK >= 2048. transform.named_sequence @flatten_tile_forall( %module: !transform.any_op {transform.readonly}) { %op = transform.structured.match ops{["linalg.generic"]} in %module @@ -35,7 +38,7 @@ transform.named_sequence @flatten_tile_forall( %op_1 = transform.structured.match ops{["linalg.generic"]} in %module : (!transform.any_op) -> !transform.any_op %tiled_op_1, %forall_op_1 = - transform.structured.tile_using_forall %op_1 tile_sizes [256] + transform.structured.tile_using_forall %op_1 num_threads [4] : (!transform.any_op) -> (!transform.any_op, !transform.any_op) transform.yield } From b8efaf13de957459a77f8d9752c25bd17855f1d4 Mon Sep 17 00:00:00 2001 From: erwei-xilinx Date: Mon, 29 Jun 2026 16:22:22 -0700 Subject: [PATCH 2/3] Generalize flatten_tile_forall comment to be target-agnostic Address review feedback: the rationale referenced npu1-specific numbers (BLOCK/256, 4-column array, BLOCK>=2048). Reword in terms of ceildiv(block, tile) and the target's column count, and note the thread count is sized for npu1. Co-Authored-By: Claude Sonnet 4 (1M context) --- .../backend/transform_library/elementwise.mlir | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/amd_triton_npu/backend/transform_library/elementwise.mlir b/amd_triton_npu/backend/transform_library/elementwise.mlir index eaf95c8..51d78f7 100644 --- a/amd_triton_npu/backend/transform_library/elementwise.mlir +++ b/amd_triton_npu/backend/transform_library/elementwise.mlir @@ -22,9 +22,12 @@ transform.named_sequence @fuse_elementwise_and_canonicalize( } // Flatten to 1D, allocate result in L2, split across a fixed number of cores. -// num_threads (not tile_sizes) fixes the herd width regardless of block size: -// tile_sizes [256] made width = BLOCK/256, which folds to no herd when -// BLOCK <= 256 and overflows the 4-column array when BLOCK >= 2048. +// num_threads (not tile_sizes) keeps the herd width independent of block size. +// With tile_sizes the width was ceildiv(block, tile): a single trip when the +// block fits one tile (the forall is then folded away, leaving no herd) and +// wider than the target's column count for large blocks (placement fails). A +// fixed thread count avoids both. NOTE: the count below is sized for the npu1 +// 4-column array; targets with more columns (AIE2P) may want a larger value. transform.named_sequence @flatten_tile_forall( %module: !transform.any_op {transform.readonly}) { %op = transform.structured.match ops{["linalg.generic"]} in %module From 4eba6c291e93a0473f082db887b45d9b4b9b488f Mon Sep 17 00:00:00 2001 From: erwei-xilinx Date: Mon, 29 Jun 2026 16:24:50 -0700 Subject: [PATCH 3/3] Document why num_threads is hardcoded to 4 (npu1 columns) Address review feedback: explain in-code that the thread count is the npu1 4-column count, that AIE2P/npu2 scripts sharing this sequence are capped at 4 of 8 columns (correct but under-utilizing), and that a target-aware count is a follow-up. 4 is kept as the npu1-hardware-validated value. Co-Authored-By: Claude Sonnet 4 (1M context) --- .../backend/transform_library/elementwise.mlir | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/amd_triton_npu/backend/transform_library/elementwise.mlir b/amd_triton_npu/backend/transform_library/elementwise.mlir index 51d78f7..2fbe3ac 100644 --- a/amd_triton_npu/backend/transform_library/elementwise.mlir +++ b/amd_triton_npu/backend/transform_library/elementwise.mlir @@ -26,8 +26,14 @@ transform.named_sequence @fuse_elementwise_and_canonicalize( // With tile_sizes the width was ceildiv(block, tile): a single trip when the // block fits one tile (the forall is then folded away, leaving no herd) and // wider than the target's column count for large blocks (placement fails). A -// fixed thread count avoids both. NOTE: the count below is sized for the npu1 -// 4-column array; targets with more columns (AIE2P) may want a larger value. +// fixed thread count avoids both. +// +// The count is intentionally hardcoded to 4 for the npu1 4-column array. This +// sequence is also included by AIE2P (npu2) elementwise scripts, where 4 caps +// the herd at 4 of the 8 available columns -- correct, but it under-utilizes +// the array for large blocks. Making the count target-aware (a per-target +// sequence, or a driver-injected parameter) is left as a follow-up; 4 is kept +// for now because it is the value validated on npu1 hardware. transform.named_sequence @flatten_tile_forall( %module: !transform.any_op {transform.readonly}) { %op = transform.structured.match ops{["linalg.generic"]} in %module @@ -41,6 +47,7 @@ transform.named_sequence @flatten_tile_forall( %op_1 = transform.structured.match ops{["linalg.generic"]} in %module : (!transform.any_op) -> !transform.any_op %tiled_op_1, %forall_op_1 = + // 4 = npu1 column count (hardcoded; see note above for AIE2P/npu2). transform.structured.tile_using_forall %op_1 num_threads [4] : (!transform.any_op) -> (!transform.any_op, !transform.any_op) transform.yield