Open
Description
Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.
Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length
In reality, we can't user Intel AMX with float32
inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.