-
Notifications
You must be signed in to change notification settings - Fork 55
perf: SIMD optimizations (+70% query efficiency, +30% indexing speed) #567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Enable Graviton-specific compiler flags (armv8.2-a, neoverse-n1 tuning) - Add SVE support detection - Increase allocator alignment to 16 bytes for better SIMD performance - Add is_closer_distance helper in HNSW to handle fast-math edge cases - Relax vector test tolerance - Enable USE_SIMSIMD Signed-off-by: Zvi Schneider <[email protected]>
6d80ea0 to
da0892d
Compare
|
raw results data can be reviewed here: https://github.com/zvi-code/valkey-bench-rs/tree/unstable/results |
| target_compile_options(${TARGET} PRIVATE -mprfchw) | ||
| elseif(VALKEY_SEARCH_IS_GRAV) | ||
| # Graviton-optimized compilation flags | ||
| # Reference: https://aws.github.io/graviton/perfrunbook/optimization_recommendation.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly reviewed the referenced link but couldn't find specific mentions of the flags listed below. Could we add direct references/links for each of these compilation optimizations?
| # - fp16: Half-precision floating point (critical for ML workloads)+ | ||
| # - rcpc: Release Consistent Processor Consistent (better atomics) | ||
| # - dotprod: Dot product instructions (critical for vector operations) | ||
| target_compile_options(${TARGET} PRIVATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading online, it seems that for x86_64 it is safe to set -mavx512vnni (equivalent to +dotprod) and -mf16c. WDYT?
| message(STATUS "Current platform is aarch64") | ||
| set(VALKEY_SEARCH_IS_ARM 1) | ||
| set(VALKEY_SEARCH_IS_X86 0) | ||
| set(VALKEY_SEARCH_IS_GRAV 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VALKEY_SEARCH_IS_ARM is currently only used on line 233:
target_link_libraries(${__TARGET} PRIVATE pthread)
Since Graviton is ARM-based, it’s curious that this isn't required there as well. Is it possible that explicit pthread linkage is no longer necessary for ARM compilation in our environment? If so, we should consider removing entirely VALKEY_SEARCH_IS_ARM or consolidate it with VALKEY_SEARCH_IS_GRAV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ARM things should be separated from GRAVITON things (see below)
| : size_(size), require_ptr_alignment_(require_ptr_alignment) { | ||
| if (require_ptr_alignment_) { | ||
| size_ = UpperBoundToMultipleOf8(size); | ||
| size_ = UpperBoundToMultipleOf16(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should such UpperBoundToMultipleOf16 usage be conditioned on the CPU SIMD support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll probably help non SIMD also, but maybe not as much. IMO SIMD is likely to be present on > 99% of the CPUs that we'll be running on, so I'd vote to avoid the complexity of testing for 8 vs 16 alignment issues.
| // NaN handling by disabling fast-math optimizations only for this comparison. | ||
| template <typename dist_t> | ||
| #if defined(__GNUC__) && !defined(__clang__) | ||
| __attribute__((optimize("no-fast-math", "no-unsafe-math-optimizations"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we address the the NaN comparison rather than disabling no-unsafe-math-optimizations? Also, can you elaborate why this is conditioned on not clang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to address the NaN issue in the FT.AGGREGATE code. If we start fiddling with the compilation options it may or may not affect that code. It might simplify it and speed things up.
|
@yairgott , thanks for the feedback! will review your comments, was looking to get initial feedback, it's not ready for merge, converted to draft. Will address the comments and then change back to PR |
| # Try to detect SVE support for Graviton3+ optimization | ||
| # SVE provides 30-50% improvement on vector operations on Graviton3 | ||
| execute_process( | ||
| COMMAND ${CMAKE_CXX_COMPILER} -march=armv8.2-a+sve -E -x c /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the flags -E -x c are generic, applicable to both ARM and X86_64, rather than specific to Graviton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think restructuring the options into three buckets makes sense.
- Architecture independent options, i.e., -E -x , etc.
- Architecture dependent options (ARM vs x86 vs Power, etc.)
- CPU dependent options (Graviton, ...)
| # Try to detect SVE support for Graviton3+ optimization | ||
| # SVE provides 30-50% improvement on vector operations on Graviton3 | ||
| execute_process( | ||
| COMMAND ${CMAKE_CXX_COMPILER} -march=armv8.2-a+sve -E -x c /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to use -march=armv8.5-a+sve2 -E -x c /dev/null for a Graviton 4+ targets?
|
@zvi-code , Kudos on https://github.com/zvi-code/valkey-bench-rs!, having a consolidated perf testing framework would be very useful! I wonder if you plan to make it an official project under valkey? |
Thanks, appreciate the feedback, I would love to see this contributed and used\evolved. This was my goal from the start (started with a c code version and recently decided to move to rust for safety and code maintenance - you want to trust your loader), will be happy to chat about this when we meet if there is interest... |
| // NaN handling by disabling fast-math optimizations only for this comparison. | ||
| template <typename dist_t> | ||
| #if defined(__GNUC__) && !defined(__clang__) | ||
| __attribute__((optimize("no-fast-math", "no-unsafe-math-optimizations"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to address the NaN issue in the FT.AGGREGATE code. If we start fiddling with the compilation options it may or may not affect that code. It might simplify it and speed things up.
| if (isCancelled && isCancelled->isCancelled()) { // VALKEYSEARCH | ||
| flag_stop_search = true; // VALKEYSEARCH | ||
| } else // VALKEYSEARCH | ||
| if (stop_condition) { | ||
| flag_stop_search = | ||
| stop_condition->should_stop_search(candidate_dist, lowerBound); | ||
| } else { | ||
| flag_stop_search = | ||
| candidate_dist > lowerBound && top_candidates.size() == ef; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this should be reverted along with a bunch of the changes after this.
| : size_(size), require_ptr_alignment_(require_ptr_alignment) { | ||
| if (require_ptr_alignment_) { | ||
| size_ = UpperBoundToMultipleOf8(size); | ||
| size_ = UpperBoundToMultipleOf16(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll probably help non SIMD also, but maybe not as much. IMO SIMD is likely to be present on > 99% of the CPUs that we'll be running on, so I'd vote to avoid the complexity of testing for 8 vs 16 alignment issues.
| # Try to detect SVE support for Graviton3+ optimization | ||
| # SVE provides 30-50% improvement on vector operations on Graviton3 | ||
| execute_process( | ||
| COMMAND ${CMAKE_CXX_COMPILER} -march=armv8.2-a+sve -E -x c /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think restructuring the options into three buckets makes sense.
- Architecture independent options, i.e., -E -x , etc.
- Architecture dependent options (ARM vs x86 vs Power, etc.)
- CPU dependent options (Graviton, ...)
| message(STATUS "Current platform is aarch64") | ||
| set(VALKEY_SEARCH_IS_ARM 1) | ||
| set(VALKEY_SEARCH_IS_X86 0) | ||
| set(VALKEY_SEARCH_IS_GRAV 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ARM things should be separated from GRAVITON things (see below)
This branch includes various performance optimizations:
Below are the performance of the Valkey-Search baseline implementation against a new branch containing SIMD (Single Instruction, Multiple Data) enhancements and other performance optimizations.
The results demonstrate substantial gains across all metrics, with the most dramatic improvements observed in vector loading speed (+30%) and query efficiency at lower thread counts (+50-70%).
Performance Comparison: SIMD Optimizations vs Baseline
Comparison: Baseline (commit
83c133a) vs SIMD Optimized (commit6d80ea0)Hardware: AWS EC2 r7g.16xlarge (64 vCPUs, Graviton3)
Dataset: Cohere-Large-10M (768d, Cosine)
benchmark was running using: https://github.com/zvi-code/valkey-bench-rs
Head-to-Head Summary
1. Vector Loading Performance
The SIMD optimizations provide a massive boost to the ingestion pipeline, likely due to faster distance calculations during the HNSW graph construction phase.
2. Query Throughput & Scaling Analysis
The SIMD branch demonstrates significantly higher per-thread efficiency.
Throughput by Reader Thread Count
Analysis
3. Latency Comparison
Latency improvements are consistent with the throughput gains, offering faster response times at the same concurrency levels.
4. Recall Stability
It is critical to ensure that performance optimizations do not degrade search accuracy.