perf: SIMD optimizations (+70% query efficiency, +30% indexing speed) #567

zvi-code · 2025-12-29T15:04:32Z

This branch includes various performance optimizations:

Enable Graviton-specific compiler flags (armv8.2-a, neoverse-n1 tuning)
Add SVE support detection
Increase allocator alignment to 16 bytes for better SIMD performance
Add is_closer_distance helper in HNSW to handle fast-math edge cases
Relax vector test tolerance
Enable USE_SIMSIMD

Below are the performance of the Valkey-Search baseline implementation against a new branch containing SIMD (Single Instruction, Multiple Data) enhancements and other performance optimizations.

The results demonstrate substantial gains across all metrics, with the most dramatic improvements observed in vector loading speed (+30%) and query efficiency at lower thread counts (+50-70%).

Performance Comparison: SIMD Optimizations vs Baseline

Comparison: Baseline (commit 83c133a) vs SIMD Optimized (commit 6d80ea0)
Hardware: AWS EC2 r7g.16xlarge (64 vCPUs, Graviton3)
Dataset: Cohere-Large-10M (768d, Cosine)

benchmark was running using: https://github.com/zvi-code/valkey-bench-rs

Head-to-Head Summary

Metric	Baseline	SIMD Optimized	Improvement
Vector Load Throughput	8,842 req/s	11,510 req/s	+30.2%
Peak Query Throughput	8,309 req/s	9,158 req/s	+10.2%
Min Query Latency	7.11 ms	6.67 ms	-6.2%
Per-Thread Efficiency	~200 req/s/thread	~340 req/s/thread	~70%

1. Vector Loading Performance

The SIMD optimizations provide a massive boost to the ingestion pipeline, likely due to faster distance calculations during the HNSW graph construction phase.

Metric	Baseline	SIMD Optimized	Delta
Throughput	8,842 req/s	11,510 req/s	+30.2%
Duration (10M vectors)	1,131 sec	869 sec	-23.2%
P99 Latency	260.74 ms	245.76 ms	-5.7%

Impact: Indexing time for 10 million vectors was reduced by over 4 minutes (from ~19 min to ~14.5 min).

2. Query Throughput & Scaling Analysis

The SIMD branch demonstrates significantly higher per-thread efficiency.

Throughput by Reader Thread Count

Reader Threads	Baseline (req/s)	SIMD Optimized (req/s)	Improvement
2	404	687	+70.0%
4	810	1,349	+66.5%
8	1,607	2,596	+61.5%
12	2,382	3,643	+52.9%
16	3,140	4,704	+49.8%
24	4,570	6,789	+48.6%
32	5,876	8,158	+38.8%
56 (Low Conc.)	7,768	8,360	+7.6%
56 (High Conc.)	8,309	9,158	+10.2%

Analysis

Massive Efficiency Gains: At lower thread counts (2-16), the SIMD branch delivers 50-70% higher throughput. This indicates that the core vector distance calculation—the "hot loop" of the search—is significantly faster.
Scalability Limit: As thread count approaches the physical core count (64 vCPUs), the gap narrows to ~10%. This suggests that at high concurrency, the system shifts from being compute-bound (where SIMD helps most) to being bound by other factors but for the most part we move to be MainThread bounded

3. Latency Comparison

Latency improvements are consistent with the throughput gains, offering faster response times at the same concurrency levels.

Scenario	Baseline Latency	SIMD Latency	Improvement
Min Latency (56 threads)	7.11 ms	6.67 ms	6.2% Faster
Avg Latency @ 32 threads	141.21 ms	6.76 ms*	N/A

4. Recall Stability

It is critical to ensure that performance optimizations do not degrade search accuracy.

Metric	Baseline	SIMD Optimized	Status
Recall@100	93.42%	93.48%	✅ Stable
Perfect Matches	~25.3%	~26.5%	✅ Stable

- Enable Graviton-specific compiler flags (armv8.2-a, neoverse-n1 tuning) - Add SVE support detection - Increase allocator alignment to 16 bytes for better SIMD performance - Add is_closer_distance helper in HNSW to handle fast-math edge cases - Relax vector test tolerance - Enable USE_SIMSIMD Signed-off-by: Zvi Schneider <[email protected]>

zvi-code · 2025-12-29T15:15:45Z

raw results data can be reviewed here: https://github.com/zvi-code/valkey-bench-rs/tree/unstable/results

yairgott · 2025-12-29T16:38:54Z

cmake/Modules/valkey_search.cmake

    target_compile_options(${TARGET} PRIVATE -mprfchw)
+  elseif(VALKEY_SEARCH_IS_GRAV)
+    # Graviton-optimized compilation flags
+    # Reference: https://aws.github.io/graviton/perfrunbook/optimization_recommendation.html


I briefly reviewed the referenced link but couldn't find specific mentions of the flags listed below. Could we add direct references/links for each of these compilation optimizations?

yairgott · 2025-12-29T16:53:07Z

cmake/Modules/valkey_search.cmake

+    # - fp16: Half-precision floating point (critical for ML workloads)+
+    # - rcpc: Release Consistent Processor Consistent (better atomics)
+    # - dotprod: Dot product instructions (critical for vector operations)
+    target_compile_options(${TARGET} PRIVATE


Reading online, it seems that for x86_64 it is safe to set -mavx512vnni (equivalent to +dotprod) and -mf16c. WDYT?

yairgott · 2025-12-29T16:54:48Z

cmake/Modules/valkey_search.cmake

  message(STATUS "Current platform is aarch64")
  set(VALKEY_SEARCH_IS_ARM 1)
  set(VALKEY_SEARCH_IS_X86 0)
+  set(VALKEY_SEARCH_IS_GRAV 0)


VALKEY_SEARCH_IS_ARM is currently only used on line 233:

target_link_libraries(${__TARGET} PRIVATE pthread)

Since Graviton is ARM-based, it’s curious that this isn't required there as well. Is it possible that explicit pthread linkage is no longer necessary for ARM compilation in our environment? If so, we should consider removing entirely VALKEY_SEARCH_IS_ARM or consolidate it with VALKEY_SEARCH_IS_GRAV.

I think ARM things should be separated from GRAVITON things (see below)

yairgott · 2025-12-29T16:58:16Z

src/utils/allocator.cc

    : size_(size), require_ptr_alignment_(require_ptr_alignment) {
  if (require_ptr_alignment_) {
-    size_ = UpperBoundToMultipleOf8(size);
+    size_ = UpperBoundToMultipleOf16(size);


Should such UpperBoundToMultipleOf16 usage be conditioned on the CPU SIMD support?

It'll probably help non SIMD also, but maybe not as much. IMO SIMD is likely to be present on > 99% of the CPUs that we'll be running on, so I'd vote to avoid the complexity of testing for 8 vs 16 alignment issues.

yairgott · 2025-12-29T17:02:18Z

third_party/hnswlib/hnswalg.h

+// NaN handling by disabling fast-math optimizations only for this comparison.
+template <typename dist_t>
+#if defined(__GNUC__) && !defined(__clang__)
+__attribute__((optimize("no-fast-math", "no-unsafe-math-optimizations")))


Should we address the the NaN comparison rather than disabling no-unsafe-math-optimizations? Also, can you elaborate why this is conditioned on not clang?

I had to address the NaN issue in the FT.AGGREGATE code. If we start fiddling with the compilation options it may or may not affect that code. It might simplify it and speed things up.

See https://github.com/valkey-io/valkey-search/blob/1b165430e7eda1f70ea1e15234f6660fae735abf/src/expr/value.cc#L28C13-L28C18

zvi-code · 2025-12-29T17:17:41Z

@yairgott , thanks for the feedback! will review your comments, was looking to get initial feedback, it's not ready for merge, converted to draft. Will address the comments and then change back to PR
[need to fix the macos as well]

yairgott · 2025-12-29T17:19:23Z

cmake/Modules/valkey_search.cmake

+        # Try to detect SVE support for Graviton3+ optimization
+    # SVE provides 30-50% improvement on vector operations on Graviton3
+    execute_process(
+      COMMAND ${CMAKE_CXX_COMPILER} -march=armv8.2-a+sve -E -x c /dev/null


It appears that the flags -E -x c are generic, applicable to both ARM and X86_64, rather than specific to Graviton.

I think restructuring the options into three buckets makes sense.

Architecture independent options, i.e., -E -x , etc.

Architecture dependent options (ARM vs x86 vs Power, etc.)

CPU dependent options (Graviton, ...)

yairgott · 2025-12-29T17:27:03Z

cmake/Modules/valkey_search.cmake

+        # Try to detect SVE support for Graviton3+ optimization
+    # SVE provides 30-50% improvement on vector operations on Graviton3
+    execute_process(
+      COMMAND ${CMAKE_CXX_COMPILER} -march=armv8.2-a+sve -E -x c /dev/null


Does it make sense to use -march=armv8.5-a+sve2 -E -x c /dev/null for a Graviton 4+ targets?

yairgott · 2025-12-29T17:52:12Z

@zvi-code , Kudos on https://github.com/zvi-code/valkey-bench-rs!, having a consolidated perf testing framework would be very useful! I wonder if you plan to make it an official project under valkey?

zvi-code · 2025-12-29T20:04:26Z

@zvi-code , Kudos on https://github.com/zvi-code/valkey-bench-rs!, having a consolidated perf testing framework would be very useful! I wonder if you plan to make it an official project under valkey?

Thanks, appreciate the feedback, I would love to see this contributed and used\evolved. This was my goal from the start (started with a c code version and recently decided to move to rust for safety and code maintenance - you want to trust your loader), will be happy to chat about this when we meet if there is interest...

allenss-amazon · 2026-01-04T06:40:26Z

third_party/hnswlib/hnswalg.h

+// NaN handling by disabling fast-math optimizations only for this comparison.
+template <typename dist_t>
+#if defined(__GNUC__) && !defined(__clang__)
+__attribute__((optimize("no-fast-math", "no-unsafe-math-optimizations")))


I had to address the NaN issue in the FT.AGGREGATE code. If we start fiddling with the compilation options it may or may not affect that code. It might simplify it and speed things up.

See https://github.com/valkey-io/valkey-search/blob/1b165430e7eda1f70ea1e15234f6660fae735abf/src/expr/value.cc#L28C13-L28C18

allenss-amazon · 2026-01-04T19:03:11Z

third_party/hnswlib/hnswalg.h

+        if (isCancelled && isCancelled->isCancelled()) {  // VALKEYSEARCH
+          flag_stop_search = true;                        // VALKEYSEARCH
+        } else                                            // VALKEYSEARCH
+          if (stop_condition) {
+            flag_stop_search =
+                stop_condition->should_stop_search(candidate_dist, lowerBound);
+          } else {
+            flag_stop_search =
+                candidate_dist > lowerBound && top_candidates.size() == ef;
+          }


Looks like this should be reverted along with a bunch of the changes after this.

allenss-amazon · 2026-01-04T19:05:36Z

src/utils/allocator.cc

    : size_(size), require_ptr_alignment_(require_ptr_alignment) {
  if (require_ptr_alignment_) {
-    size_ = UpperBoundToMultipleOf8(size);
+    size_ = UpperBoundToMultipleOf16(size);


It'll probably help non SIMD also, but maybe not as much. IMO SIMD is likely to be present on > 99% of the CPUs that we'll be running on, so I'd vote to avoid the complexity of testing for 8 vs 16 alignment issues.

allenss-amazon · 2026-01-04T19:10:14Z

cmake/Modules/valkey_search.cmake

+        # Try to detect SVE support for Graviton3+ optimization
+    # SVE provides 30-50% improvement on vector operations on Graviton3
+    execute_process(
+      COMMAND ${CMAKE_CXX_COMPILER} -march=armv8.2-a+sve -E -x c /dev/null


I think restructuring the options into three buckets makes sense.

Architecture independent options, i.e., -E -x , etc.

Architecture dependent options (ARM vs x86 vs Power, etc.)

CPU dependent options (Graviton, ...)

allenss-amazon · 2026-01-04T19:12:18Z

cmake/Modules/valkey_search.cmake

  message(STATUS "Current platform is aarch64")
  set(VALKEY_SEARCH_IS_ARM 1)
  set(VALKEY_SEARCH_IS_X86 0)
+  set(VALKEY_SEARCH_IS_GRAV 0)


I think ARM things should be separated from GRAVITON things (see below)

zvi-code requested review from allenss-amazon and yairgott December 29, 2025 15:04

zvi-code force-pushed the performance-optimizations branch from 6d80ea0 to da0892d Compare December 29, 2025 15:11

yairgott reviewed Dec 29, 2025

View reviewed changes

zvi-code marked this pull request as draft December 29, 2025 17:16

yairgott reviewed Dec 29, 2025

View reviewed changes

allenss-amazon reviewed Jan 4, 2026

View reviewed changes

perf: SIMD optimizations (+70% query efficiency, +30% indexing speed) #567

Are you sure you want to change the base?

perf: SIMD optimizations (+70% query efficiency, +30% indexing speed) #567

Uh oh!

Conversation

zvi-code commented Dec 29, 2025

Performance Comparison: SIMD Optimizations vs Baseline

benchmark was running using: https://github.com/zvi-code/valkey-bench-rs

Head-to-Head Summary

1. Vector Loading Performance

2. Query Throughput & Scaling Analysis

Throughput by Reader Thread Count

Analysis

3. Latency Comparison

4. Recall Stability

Uh oh!

zvi-code commented Dec 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yairgott Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yairgott Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zvi-code commented Dec 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yairgott Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yairgott commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvi-code commented Dec 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yairgott Dec 29, 2025 •

edited

Loading

yairgott Dec 29, 2025 •

edited

Loading

yairgott Dec 29, 2025 •

edited

Loading

yairgott commented Dec 29, 2025 •

edited

Loading