You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The shared @flatten_tile_forall sequence tiles into num_threads [4] for
npu1's 4-column array. On npu2 (AIE2P / Strix) the array is 8 columns
wide, so 4 threads leave half the array idle.
Add @flatten_tile_forall_aie2p, an 8-thread variant, and point every
AIE2P elementwise script (vec-add, relu, silu, gelu, sigmoid, swiglu,
axpy, leaky_relu) at it. The npu1 sequence and the aie2 scripts are
unchanged.
NOTE: correct multi-program (grid > 1) execution on npu2 depends on
mlir-air PR #1696 (Xilinx/mlir-air), which fixes air-split-l2-memref
dropping the per-iteration air.launch base offset when it splits the L2
buffer across the 8 columns. The 8-way split added here is what exposes
that bug. Without an mlir-air build containing the fix, grid > 1
elementwise kernels move only the first program's data on npu2; grid ==
1 (one large block split across the herd) is correct regardless. See the
dependency note on @flatten_tile_forall_aie2p in elementwise.mlir.
0 commit comments