Using Mat-Mul Instructions, like Arm SME and Intel AMX

Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones. 

Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length $N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$-element vector of ones, and accumulating into $16$ other floats.

In reality, we can't user Intel AMX with `float32` inputs, but we can use Arm SME, and later apply similar techniques to [SimSIMD](https://github.com/ashvardanian/SimSIMD/pull/218).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using Mat-Mul Instructions, like Arm SME and Intel AMX #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Using Mat-Mul Instructions, like Arm SME and Intel AMX #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions