Skip to content

Conversation

@zvi-code
Copy link
Collaborator

This branch includes various performance optimizations:

  • Enable Graviton-specific compiler flags (armv8.2-a, neoverse-n1 tuning)
  • Add SVE support detection
  • Increase allocator alignment to 16 bytes for better SIMD performance
  • Add is_closer_distance helper in HNSW to handle fast-math edge cases
  • Relax vector test tolerance
  • Enable USE_SIMSIMD

Below are the performance of the Valkey-Search baseline implementation against a new branch containing SIMD (Single Instruction, Multiple Data) enhancements and other performance optimizations.

The results demonstrate substantial gains across all metrics, with the most dramatic improvements observed in vector loading speed (+30%) and query efficiency at lower thread counts (+50-70%).

Performance Comparison: SIMD Optimizations vs Baseline

Comparison: Baseline (commit 83c133a) vs SIMD Optimized (commit 6d80ea0)
Hardware: AWS EC2 r7g.16xlarge (64 vCPUs, Graviton3)
Dataset: Cohere-Large-10M (768d, Cosine)

benchmark was running using: https://github.com/zvi-code/valkey-bench-rs

Head-to-Head Summary

Metric Baseline SIMD Optimized Improvement
Vector Load Throughput 8,842 req/s 11,510 req/s +30.2%
Peak Query Throughput 8,309 req/s 9,158 req/s +10.2%
Min Query Latency 7.11 ms 6.67 ms -6.2%
Per-Thread Efficiency ~200 req/s/thread ~340 req/s/thread ~70%

1. Vector Loading Performance

The SIMD optimizations provide a massive boost to the ingestion pipeline, likely due to faster distance calculations during the HNSW graph construction phase.

Metric Baseline SIMD Optimized Delta
Throughput 8,842 req/s 11,510 req/s +30.2%
Duration (10M vectors) 1,131 sec 869 sec -23.2%
P99 Latency 260.74 ms 245.76 ms -5.7%

Impact: Indexing time for 10 million vectors was reduced by over 4 minutes (from ~19 min to ~14.5 min).


2. Query Throughput & Scaling Analysis

The SIMD branch demonstrates significantly higher per-thread efficiency.

Throughput by Reader Thread Count

Reader Threads Baseline (req/s) SIMD Optimized (req/s) Improvement
2 404 687 +70.0%
4 810 1,349 +66.5%
8 1,607 2,596 +61.5%
12 2,382 3,643 +52.9%
16 3,140 4,704 +49.8%
24 4,570 6,789 +48.6%
32 5,876 8,158 +38.8%
56 (Low Conc.) 7,768 8,360 +7.6%
56 (High Conc.) 8,309 9,158 +10.2%

Analysis

  1. Massive Efficiency Gains: At lower thread counts (2-16), the SIMD branch delivers 50-70% higher throughput. This indicates that the core vector distance calculation—the "hot loop" of the search—is significantly faster.
  2. Scalability Limit: As thread count approaches the physical core count (64 vCPUs), the gap narrows to ~10%. This suggests that at high concurrency, the system shifts from being compute-bound (where SIMD helps most) to being bound by other factors but for the most part we move to be MainThread bounded

3. Latency Comparison

Latency improvements are consistent with the throughput gains, offering faster response times at the same concurrency levels.

Scenario Baseline Latency SIMD Latency Improvement
Min Latency (56 threads) 7.11 ms 6.67 ms 6.2% Faster
Avg Latency @ 32 threads 141.21 ms 6.76 ms* N/A

4. Recall Stability

It is critical to ensure that performance optimizations do not degrade search accuracy.

Metric Baseline SIMD Optimized Status
Recall@100 93.42% 93.48% ✅ Stable
Perfect Matches ~25.3% ~26.5% ✅ Stable

- Enable Graviton-specific compiler flags (armv8.2-a, neoverse-n1 tuning)
- Add SVE support detection
- Increase allocator alignment to 16 bytes for better SIMD performance
- Add is_closer_distance helper in HNSW to handle fast-math edge cases
- Relax vector test tolerance
- Enable USE_SIMSIMD

Signed-off-by: Zvi Schneider <[email protected]>
@zvi-code zvi-code force-pushed the performance-optimizations branch from 6d80ea0 to da0892d Compare December 29, 2025 15:11
@zvi-code
Copy link
Collaborator Author

raw results data can be reviewed here: https://github.com/zvi-code/valkey-bench-rs/tree/unstable/results

target_compile_options(${TARGET} PRIVATE -mprfchw)
elseif(VALKEY_SEARCH_IS_GRAV)
# Graviton-optimized compilation flags
# Reference: https://aws.github.io/graviton/perfrunbook/optimization_recommendation.html
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly reviewed the referenced link but couldn't find specific mentions of the flags listed below. Could we add direct references/links for each of these compilation optimizations?

# - fp16: Half-precision floating point (critical for ML workloads)+
# - rcpc: Release Consistent Processor Consistent (better atomics)
# - dotprod: Dot product instructions (critical for vector operations)
target_compile_options(${TARGET} PRIVATE
Copy link
Collaborator

@yairgott yairgott Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading online, it seems that for x86_64 it is safe to set -mavx512vnni (equivalent to +dotprod) and -mf16c. WDYT?

message(STATUS "Current platform is aarch64")
set(VALKEY_SEARCH_IS_ARM 1)
set(VALKEY_SEARCH_IS_X86 0)
set(VALKEY_SEARCH_IS_GRAV 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VALKEY_SEARCH_IS_ARM is currently only used on line 233:

target_link_libraries(${__TARGET} PRIVATE pthread)

Since Graviton is ARM-based, it’s curious that this isn't required there as well. Is it possible that explicit pthread linkage is no longer necessary for ARM compilation in our environment? If so, we should consider removing entirely VALKEY_SEARCH_IS_ARM or consolidate it with VALKEY_SEARCH_IS_GRAV.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ARM things should be separated from GRAVITON things (see below)

: size_(size), require_ptr_alignment_(require_ptr_alignment) {
if (require_ptr_alignment_) {
size_ = UpperBoundToMultipleOf8(size);
size_ = UpperBoundToMultipleOf16(size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should such UpperBoundToMultipleOf16 usage be conditioned on the CPU SIMD support?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll probably help non SIMD also, but maybe not as much. IMO SIMD is likely to be present on > 99% of the CPUs that we'll be running on, so I'd vote to avoid the complexity of testing for 8 vs 16 alignment issues.

// NaN handling by disabling fast-math optimizations only for this comparison.
template <typename dist_t>
#if defined(__GNUC__) && !defined(__clang__)
__attribute__((optimize("no-fast-math", "no-unsafe-math-optimizations")))
Copy link
Collaborator

@yairgott yairgott Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we address the the NaN comparison rather than disabling no-unsafe-math-optimizations? Also, can you elaborate why this is conditioned on not clang?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to address the NaN issue in the FT.AGGREGATE code. If we start fiddling with the compilation options it may or may not affect that code. It might simplify it and speed things up.

See https://github.com/valkey-io/valkey-search/blob/1b165430e7eda1f70ea1e15234f6660fae735abf/src/expr/value.cc#L28C13-L28C18

@zvi-code zvi-code marked this pull request as draft December 29, 2025 17:16
@zvi-code
Copy link
Collaborator Author

@yairgott , thanks for the feedback! will review your comments, was looking to get initial feedback, it's not ready for merge, converted to draft. Will address the comments and then change back to PR
[need to fix the macos as well]

# Try to detect SVE support for Graviton3+ optimization
# SVE provides 30-50% improvement on vector operations on Graviton3
execute_process(
COMMAND ${CMAKE_CXX_COMPILER} -march=armv8.2-a+sve -E -x c /dev/null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the flags -E -x c are generic, applicable to both ARM and X86_64, rather than specific to Graviton.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think restructuring the options into three buckets makes sense.

  1. Architecture independent options, i.e., -E -x , etc.
  2. Architecture dependent options (ARM vs x86 vs Power, etc.)
  3. CPU dependent options (Graviton, ...)

# Try to detect SVE support for Graviton3+ optimization
# SVE provides 30-50% improvement on vector operations on Graviton3
execute_process(
COMMAND ${CMAKE_CXX_COMPILER} -march=armv8.2-a+sve -E -x c /dev/null
Copy link
Collaborator

@yairgott yairgott Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use -march=armv8.5-a+sve2 -E -x c /dev/null for a Graviton 4+ targets?

@yairgott
Copy link
Collaborator

yairgott commented Dec 29, 2025

@zvi-code , Kudos on https://github.com/zvi-code/valkey-bench-rs!, having a consolidated perf testing framework would be very useful! I wonder if you plan to make it an official project under valkey?

@zvi-code
Copy link
Collaborator Author

@zvi-code , Kudos on https://github.com/zvi-code/valkey-bench-rs!, having a consolidated perf testing framework would be very useful! I wonder if you plan to make it an official project under valkey?

Thanks, appreciate the feedback, I would love to see this contributed and used\evolved. This was my goal from the start (started with a c code version and recently decided to move to rust for safety and code maintenance - you want to trust your loader), will be happy to chat about this when we meet if there is interest...

// NaN handling by disabling fast-math optimizations only for this comparison.
template <typename dist_t>
#if defined(__GNUC__) && !defined(__clang__)
__attribute__((optimize("no-fast-math", "no-unsafe-math-optimizations")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to address the NaN issue in the FT.AGGREGATE code. If we start fiddling with the compilation options it may or may not affect that code. It might simplify it and speed things up.

See https://github.com/valkey-io/valkey-search/blob/1b165430e7eda1f70ea1e15234f6660fae735abf/src/expr/value.cc#L28C13-L28C18

Comment on lines +407 to +416
if (isCancelled && isCancelled->isCancelled()) { // VALKEYSEARCH
flag_stop_search = true; // VALKEYSEARCH
} else // VALKEYSEARCH
if (stop_condition) {
flag_stop_search =
stop_condition->should_stop_search(candidate_dist, lowerBound);
} else {
flag_stop_search =
candidate_dist > lowerBound && top_candidates.size() == ef;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should be reverted along with a bunch of the changes after this.

: size_(size), require_ptr_alignment_(require_ptr_alignment) {
if (require_ptr_alignment_) {
size_ = UpperBoundToMultipleOf8(size);
size_ = UpperBoundToMultipleOf16(size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll probably help non SIMD also, but maybe not as much. IMO SIMD is likely to be present on > 99% of the CPUs that we'll be running on, so I'd vote to avoid the complexity of testing for 8 vs 16 alignment issues.

# Try to detect SVE support for Graviton3+ optimization
# SVE provides 30-50% improvement on vector operations on Graviton3
execute_process(
COMMAND ${CMAKE_CXX_COMPILER} -march=armv8.2-a+sve -E -x c /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think restructuring the options into three buckets makes sense.

  1. Architecture independent options, i.e., -E -x , etc.
  2. Architecture dependent options (ARM vs x86 vs Power, etc.)
  3. CPU dependent options (Graviton, ...)

message(STATUS "Current platform is aarch64")
set(VALKEY_SEARCH_IS_ARM 1)
set(VALKEY_SEARCH_IS_X86 0)
set(VALKEY_SEARCH_IS_GRAV 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ARM things should be separated from GRAVITON things (see below)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants