Improve performance of simde_mm512_add_epi32 #1126
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Improve and simplify implementation of
simde_mm512_add_epi32
as follows:Remove the explicit SVE implementation. For SVE vector lengths of VL={128, 256}, this explicit vector length agnostic (VLA) SVE loop performs significantly worse than the Neon equivalent, which can be executed using fewer instructions. This sequence of SVE intrinsics is also malformed according to clang, so it fails to compile altogether.
Preferentially use GCC's vector extension if available, instead of repeated calls to
simde_mm256_add_epi32
. There are a couple of reasons for this:The added indirection results in worse code generation. See the code generation attached to commit message for an example with GCC 13.
GCC's vector extension is an easier optimization target for compilers, allowing them to appropriately output performant code generation depending on their own internal cost & tuning models. See the snippets attached to commit message for an example of improved code-gen in a vector length specific (VLS) context.
This brings the implementation of
simde_mm512_add_epi32
back in line with other similar AVX512 intrinsics, such assimde_mm512_sub_epi32
andsimde_mm512_mul_ps
.Fixes #980.