Optimize Arm f32 GEMM using FMLA (by element) #679
Merged
+224
−21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Arm supports an FMLA instruction variant 1 which multiples one lane of a
vector by values in another vector and then accumulates into a destination.
This enables reducing the number of loads from A in each iteration of the GEMM
microkernel by a factor of 4. Instead of broadcasting one scalar value from A at
a time into a register used as an FMA operand, we can load a vector of 4
elements from A and then each FMA specifies which of the 4 elements to use.
The way this is implemented is by using
broadcast_lane
+mul_add
genericoperations (
vdupq_laneq_f32
followed byfmla
) which LLVM will fuse into asingle
fmla
.Future work:
Footnotes
https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--by-element---Floating-point-fused-Multiply-Add-to-accumulator--by-element--?lang=en ↩