From 7b7e8230a7cfa88c738f7bc997a78d996a04d4b4 Mon Sep 17 00:00:00 2001
From: erwei-xilinx <erwei.wang@amd.com>
Date: Mon, 29 Jun 2026 16:16:04 -0700
Subject: [PATCH 1/3] Fix elementwise herd width: tile by num_threads, not
 fixed tile size

The @flatten_tile_forall sequence tiled the flattened elementwise op with
tile_sizes [256], making the herd width = BLOCK_SIZE_N / 256. That width is
unbounded in both directions and breaks lowering at the edges:

  - width 1 (BLOCK <= 256): the single-trip scf.forall is canonicalized away
    before herd mapping, so par_to_herd matches nothing ("expected a single
    payload op", silently non-fatal) and the compute is never placed in a
    herd. aircc then fails in air-dma-to-channel with "operand does not
    dominate this use".
  - width >= 8 (BLOCK >= 2048): the herd is wider than the 4-column array,
    so placement fails with "No valid placement found".

Switching to num_threads [4] fixes the herd width at <= 4 columns for any
block size: it never folds to zero and never overflows the array.

Validated on npu1 hardware (vec-add, maskless): PASS for any N with
BLOCK_SIZE_N a power of 2 in [128, 32768], single- and multi-block,
including ragged tails. Previously only BLOCK in {512, 1024} worked.

Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
---
 amd_triton_npu/backend/transform_library/elementwise.mlir | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/amd_triton_npu/backend/transform_library/elementwise.mlir b/amd_triton_npu/backend/transform_library/elementwise.mlir
index 26fda74..eaf95c8 100644
--- a/amd_triton_npu/backend/transform_library/elementwise.mlir
+++ b/amd_triton_npu/backend/transform_library/elementwise.mlir
@@ -21,7 +21,10 @@ transform.named_sequence @fuse_elementwise_and_canonicalize(
   transform.yield
 }
 
-// Flatten to 1D, allocate result in L2, tile forall [256] for multi-core.
+// Flatten to 1D, allocate result in L2, split across a fixed number of cores.
+// num_threads (not tile_sizes) fixes the herd width regardless of block size:
+// tile_sizes [256] made width = BLOCK/256, which folds to no herd when
+// BLOCK <= 256 and overflows the 4-column array when BLOCK >= 2048.
 transform.named_sequence @flatten_tile_forall(
     %module: !transform.any_op {transform.readonly}) {
   %op = transform.structured.match ops{["linalg.generic"]} in %module
@@ -35,7 +38,7 @@ transform.named_sequence @flatten_tile_forall(
   %op_1 = transform.structured.match ops{["linalg.generic"]} in %module
       : (!transform.any_op) -> !transform.any_op
   %tiled_op_1, %forall_op_1 =
-      transform.structured.tile_using_forall %op_1 tile_sizes [256]
+      transform.structured.tile_using_forall %op_1 num_threads [4]
       : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
   transform.yield
 }

From b8efaf13de957459a77f8d9752c25bd17855f1d4 Mon Sep 17 00:00:00 2001
From: erwei-xilinx <erwei.wang@amd.com>
Date: Mon, 29 Jun 2026 16:22:22 -0700
Subject: [PATCH 2/3] Generalize flatten_tile_forall comment to be
 target-agnostic

Address review feedback: the rationale referenced npu1-specific numbers
(BLOCK/256, 4-column array, BLOCK>=2048). Reword in terms of
ceildiv(block, tile) and the target's column count, and note the thread
count is sized for npu1.

Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
---
 .../backend/transform_library/elementwise.mlir           | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/amd_triton_npu/backend/transform_library/elementwise.mlir b/amd_triton_npu/backend/transform_library/elementwise.mlir
index eaf95c8..51d78f7 100644
--- a/amd_triton_npu/backend/transform_library/elementwise.mlir
+++ b/amd_triton_npu/backend/transform_library/elementwise.mlir
@@ -22,9 +22,12 @@ transform.named_sequence @fuse_elementwise_and_canonicalize(
 }
 
 // Flatten to 1D, allocate result in L2, split across a fixed number of cores.
-// num_threads (not tile_sizes) fixes the herd width regardless of block size:
-// tile_sizes [256] made width = BLOCK/256, which folds to no herd when
-// BLOCK <= 256 and overflows the 4-column array when BLOCK >= 2048.
+// num_threads (not tile_sizes) keeps the herd width independent of block size.
+// With tile_sizes the width was ceildiv(block, tile): a single trip when the
+// block fits one tile (the forall is then folded away, leaving no herd) and
+// wider than the target's column count for large blocks (placement fails). A
+// fixed thread count avoids both. NOTE: the count below is sized for the npu1
+// 4-column array; targets with more columns (AIE2P) may want a larger value.
 transform.named_sequence @flatten_tile_forall(
     %module: !transform.any_op {transform.readonly}) {
   %op = transform.structured.match ops{["linalg.generic"]} in %module

From 4eba6c291e93a0473f082db887b45d9b4b9b488f Mon Sep 17 00:00:00 2001
From: erwei-xilinx <erwei.wang@amd.com>
Date: Mon, 29 Jun 2026 16:24:50 -0700
Subject: [PATCH 3/3] Document why num_threads is hardcoded to 4 (npu1 columns)

Address review feedback: explain in-code that the thread count is the npu1
4-column count, that AIE2P/npu2 scripts sharing this sequence are capped at
4 of 8 columns (correct but under-utilizing), and that a target-aware count
is a follow-up. 4 is kept as the npu1-hardware-validated value.

Co-Authored-By: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
---
 .../backend/transform_library/elementwise.mlir        | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/amd_triton_npu/backend/transform_library/elementwise.mlir b/amd_triton_npu/backend/transform_library/elementwise.mlir
index 51d78f7..2fbe3ac 100644
--- a/amd_triton_npu/backend/transform_library/elementwise.mlir
+++ b/amd_triton_npu/backend/transform_library/elementwise.mlir
@@ -26,8 +26,14 @@ transform.named_sequence @fuse_elementwise_and_canonicalize(
 // With tile_sizes the width was ceildiv(block, tile): a single trip when the
 // block fits one tile (the forall is then folded away, leaving no herd) and
 // wider than the target's column count for large blocks (placement fails). A
-// fixed thread count avoids both. NOTE: the count below is sized for the npu1
-// 4-column array; targets with more columns (AIE2P) may want a larger value.
+// fixed thread count avoids both.
+//
+// The count is intentionally hardcoded to 4 for the npu1 4-column array. This
+// sequence is also included by AIE2P (npu2) elementwise scripts, where 4 caps
+// the herd at 4 of the 8 available columns -- correct, but it under-utilizes
+// the array for large blocks. Making the count target-aware (a per-target
+// sequence, or a driver-injected parameter) is left as a follow-up; 4 is kept
+// for now because it is the value validated on npu1 hardware.
 transform.named_sequence @flatten_tile_forall(
     %module: !transform.any_op {transform.readonly}) {
   %op = transform.structured.match ops{["linalg.generic"]} in %module
@@ -41,6 +47,7 @@ transform.named_sequence @flatten_tile_forall(
   %op_1 = transform.structured.match ops{["linalg.generic"]} in %module
       : (!transform.any_op) -> !transform.any_op
   %tiled_op_1, %forall_op_1 =
+      // 4 = npu1 column count (hardcoded; see note above for AIE2P/npu2).
       transform.structured.tile_using_forall %op_1 num_threads [4]
       : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
   transform.yield