Skip to content

Investigate performance of integer comparisons for mask reductions #157

Open
@gnzlbg

Description

@gnzlbg

Currently, all / any mask vector reductions are implemented as:

  • x86/x86_64:
    • mmx: using _mm_movemask_pi8
    • sse2: using _mm_movemask_epi8
    • avx: using _mm256_testc_si256, _mm256_testz_si256
  • arm/aarch64:
    • neon: using vpmin, vpmax
  • other architectures call the llvm.experimental.vector.reduce.{and, or} intrinsics.

It might be potentially much better to just cast all vectors <= 128-bit wide to their corresponding integer and do a != 0 for any, and a == iX::max_value() for the all reduction.

For 256-bit and 512-bit wide vectors we could cast them into an [u128; N] array and perform the integer comparisons against that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-AArch64ARM 64-bit architectureA-armARM 32-bit architectureA-x86_64x86_64 architectureEnhancementNew feature or requestP-lowPerformanceSomething isn't fast

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions