Optimize Arm f32 GEMM using FMLA by element #679

robertknight · 2025-04-23T06:34:18Z

Arm supports an FMLA instruction variant ¹ which multiples one lane of a
vector by values in another vector and then accumulates into a destination.
This enables reducing the number of loads from A in each iteration of the GEMM
microkernel by a factor of 4. Instead of broadcasting one scalar value from A at
a time into a register used as an FMA operand, we can load a vector of 4
elements from A and then each FMA specifies which of the 4 elements to use.

The way this is implemented is by using broadcast_lane + mul_add generic
operations (vdupq_laneq_f32 followed by fmla) which LLVM will fuse into a
single fmla.

Future work:

Apply the same operation for int8 GEMM via UDOT-by-element (probably a separate PR)

https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--by-element---Floating-point-fused-Multiply-Add-to-accumulator--by-element--?lang=en ↩

robertknight · 2025-04-23T06:38:34Z

There is a vfmaq_laneq_f32 intrinsic which exposes the fused broadcast + FMA in a more explicit way. However using it generates worse code than using vdupq_laneq_f32 + vfmaq_f32.

The broadcast lane + FMA fusion is convenient as broadcasting a lane is a more generally useful operation.

Suppress this new lint until I have time to fix existing failures.

This method is no longer `unsafe`.

On Arm this maps to the `vdupq_laneq_*` instructions. LLVM is conveniently able to fuse `vdupq_laneq` + `vfmaq` into an indexed FMLA operation. For other architectures we currently fall back to store + load.

Arm supports an FMLA instruction variant [^1] which multiples one lane of a vector by values in another vector and then accumulates into a destination. This enables reducing the number of loads from A in each iteration of the GEMM microkernel by a factor of 4. Instead of broadcasting one scalar value from A at a time into a register used as an FMA operand, we can load a vector of 4 elements from A and then each FMA specifies which of the 4 elements to use. The way this is implemented is by using `broadcast_lane` + `mul_add` generic operations (`vdupq_laneq_f32` followed by `fmla`) which LLVM will fuse into a single `fmla`. [^1]: https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--by-element---Floating-point-fused-Multiply-Add-to-accumulator--by-element--?lang=en

robertknight · 2025-04-27T05:34:01Z

Test with a ModernBERT base model on an M3 Pro:

 cargo run -p rten-cli -r -- modernbert-base.rten -s sequence_length=512 -n 10 -p

Before: ~420ms mean
After: ~400ms mean
ONNX Runtime (for comparison): ~372ms mean

Similar to #679, use the by-element variant of UDOT [^1] to reduce the number of loads in the int8 GEMM microkernel. Instead of performing 4 scalar 32-bit loads and broadcasting the result, perform one 128-bit load and use indexed UDOT to implicitly broadcast a 32-bit lane for each of the 4 rows. [^1]: https://developer.arm.com/documentation/100069/0609/A64-SIMD-Vector-Instructions/UDOT--vector--by-element-

robertknight changed the title ~~Optimize Arm GEMM using FMLA (by element)~~ Optimize Arm f32 GEMM using FMLA (by element) Apr 23, 2025

robertknight mentioned this pull request Apr 24, 2025

Adjust matmul tile size on Arm #553

Closed

robertknight added 3 commits April 27, 2025 06:17

Temporarily suppress manual_repeat_n clippy warning

b10150d

Suppress this new lint until I have time to fix existing failures.

Remove an obsolete safety comment

1ca7c49

This method is no longer `unsafe`.

Add broadcast_lane operation

e6bbf8d

On Arm this maps to the `vdupq_laneq_*` instructions. LLVM is conveniently able to fuse `vdupq_laneq` + `vfmaq` into an indexed FMLA operation. For other architectures we currently fall back to store + load.

robertknight force-pushed the arm-gemm-fmla-by-element branch from 0d6495b to e5f241a Compare April 27, 2025 05:20

robertknight added 2 commits April 27, 2025 06:28

Fix nightly build with avx512 feature enabled

ffeb6fd

robertknight force-pushed the arm-gemm-fmla-by-element branch from 31764c1 to ffeb6fd Compare April 27, 2025 05:28

robertknight marked this pull request as ready for review April 27, 2025 05:31

robertknight merged commit 30e2934 into main Apr 27, 2025
3 checks passed

robertknight deleted the arm-gemm-fmla-by-element branch April 27, 2025 05:34

robertknight mentioned this pull request Apr 27, 2025

Optimize Arm int8 GEMM using UDOT-by-element #680

Merged

robertknight changed the title ~~Optimize Arm f32 GEMM using FMLA (by element)~~ Optimize Arm f32 GEMM using FMLA by element Apr 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize Arm f32 GEMM using FMLA by element #679

Optimize Arm f32 GEMM using FMLA by element #679

Uh oh!

robertknight commented Apr 23, 2025 •

edited

Loading

Uh oh!

robertknight commented Apr 23, 2025

Uh oh!

robertknight commented Apr 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Optimize Arm f32 GEMM using FMLA by element #679

Optimize Arm f32 GEMM using FMLA by element #679

Uh oh!

Conversation

robertknight commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

robertknight commented Apr 23, 2025

Uh oh!

robertknight commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robertknight commented Apr 23, 2025 •

edited

Loading

robertknight commented Apr 27, 2025 •

edited

Loading