Skip to content

L2Sqr NEON, unrolled loop, prefetching #86

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xbasel
Copy link
Member

@xbasel xbasel commented Apr 5, 2025

DRAFT - no ready

L2Sqr SIMD version for ARM Neon + unrolled loop + prefetching

graviton 3:

  Scalar time:     0.771 sec
  NEON time:       0.125 sec
  Speedup:          6.19x

Please note that there's relative error in scalar vs simd most likely due to floating point rounding and summation order, fp arithmetic is not really associative
this was not compiled with -ffast-math (not sure how and if this impacts the result)

Please note that there's already SIMD impl in https://github.com/valkey-io/valkey-search/blob/main/third_party/simsimd/include/simsimd/spatial.h , but I believe this impl will outperform it as has unrolled loop and it prefetches the memory

=====
side note, please note that the scalar impl is already PARTIALLY simded, as I can see this in the generate bytecode:

ldr     q16, [x0, x2]
ldr     q5, [x1, x2]
fsub    v5.4s, v16.4s, v5.4s
fmul    v5.4s, v5.4s, v5.4s

however, the summation, is not simded (I see fadd, which is scalar)

Benchmark Results (1M elements):
  Scalar time:     0.771 sec
  NEON time:       0.125 sec
  Speedup:          6.19x
@xbasel xbasel changed the title L2Sqr NEON, unrolled loop, + prefetching L2Sqr NEON, unrolled loop, prefetching Apr 5, 2025
@xbasel xbasel marked this pull request as draft April 5, 2025 01:32
@yairgott
Copy link
Collaborator

yairgott commented May 5, 2025

Can you provide benchmark numbers which show the change benefits using VectorDBBench?

We have been using a forked version of VectorDBBench. As a client, please use memorydb and for --case-type, please use Performance768D10M or Performance768D1M.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants