faiss: parallelize post-BLAS reduction loop and end_multiple() in result handlers (#5185)

alibeklfc · meta-codesync[bot] · commit 2322afd2d848 · 2026-05-06T15:29:03.000-07:00
Summary: Pull Request resolved: #5185 Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with `#pragma omp parallel for schedule(static)`, gated by an `if (...)` clause to avoid spawn-cost regressions on small workloads. **Changes** 1. `exhaustive_L2sqr_blas_cmax` (AVX2 + ARM SVE): after `sgemm_` completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query `i` reads a distinct row of `ip_block` and writes to `dis_tab[i]/ids_tab[i]` — no cross-query dependencies. Added `#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)` to both ISA specializations. 2. `HeapBlockResultHandler::end_multiple`: `heap_reorder` is O(k log k) per query and was sequential. The original author left a `// maybe parallel for` comment. `add_results` in the same class already has `#pragma omp parallel for`; `end_multiple` was the only remaining sequential step. Gate: `if ((i1 - i0) * k >= 1024)`. 3. `ReservoirBlockResultHandler::end_multiple`: same pattern — reservoir `to_result` (partial sort, O(capacity)) was sequential despite `add_results` being parallelized. `// maybe parallel for` comment removed and replaced with the actual pragma. Gate: `if ((i1 - i0) * this->k >= 1024)`. The `if (...)` thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of `dis_tab`/`ids_tab` indexed by query `i`. **Benchmark results** A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with `taskset -c 0-15` (OMP=16) and `taskset -c 0` (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions. `HeapBlockResultHandler::end_multiple` (us, lower better): | nq | k | parent t=1 | this t=1 | parent t=16 | this t=16 | speedup t=16 | |------:|-----:|-----------:|---------:|------------:|----------:|--------------:| | 64 | 10 | 9.2 | 7.2 | 8.1 | 8.3 | 0.98× (gated) | | 64 | 100 | 340 | 345 | 318 | 67 | 4.79× | | 64 | 1000 | 5,796 | 5,700 | 5,886 | 501 | 11.76× | | 512 | 100 | 2,811 | 2,769 | 2,677 | 312 | 8.59× | | 512 | 1000 | 46,109 | 46,070 | 45,758 | 3,778 | 12.11× | | 4096 | 100 | 22,041 | 21,588 | 21,672 | 1,869 | 11.60× | | 4096 | 1000 | 369,069 | 376,541 | 372,481 | 25,442 | 14.64× | `ReservoirBlockResultHandler::end_multiple` (us): | nq | k | parent t=16 | this t=16 | speedup | |------:|-----:|------------:|----------:|--------------:| | 64 | 10 | 18.0 | 18.1 | 0.99× (gated) | | 64 | 100 | 659 | 96 | 6.86× | | 64 | 1000 | 7,592 | 553 | 13.73× | | 512 | 100 | 5,498 | 490 | 11.21× | | 512 | 1000 | 59,548 | 4,677 | 12.73× | | 4096 | 100 | 44,064 | 3,230 | 13.64× | | 4096 | 1000 | 476,388 | 32,237 | 14.78× | `IndexFlatL2::search` end-to-end — drives `exhaustive_L2sqr_blas_cmax` (ms): | nb | nq | k | parent t=16 | this t=16 | speedup | |------:|------:|----:|------------:|----------:|--------:| | 1024 | 1024 | 10 | 1.71 | 1.45 | 1.18× | | 1024 | 4096 | 100 | 58.5 | 15.5 | 3.78× | | 4096 | 4096 | 100 | 76.9 | 39.4 | 1.95× | Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the `if (...)` clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers. Caveats: the `IndexFlatL2::search` numbers measure the full search path, so the speedup attributed to change #1 also includes contributions from change #2 (heap handler, also called by this path). The `end_multiple` numbers isolate the changed function via `PauseTiming`/`ResumeTiming` around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy. Reviewed By: mnorris11 Differential Revision: D103830810 fbshipit-source-id: 8434fa6f16b78c5ff7b2244ac5d5fe9cc8c012a5
diff --git a/faiss/impl/ResultHandler.h b/faiss/impl/ResultHandler.h
@@ -377,8 +377,9 @@ struct HeapBlockResultHandler : TopkBlockResultHandler<C, use_sel> {
 
     /// series of results for queries i0..i1 is done
     void end_multiple() final {
-        // maybe parallel for
-        for (size_t i = i0; i < i1; i++) {
+#pragma omp parallel for schedule(static) if ((i1 - i0) * k >= 1024)
+        for (int64_t i = static_cast<int64_t>(i0); i < static_cast<int64_t>(i1);
+             i++) {
             heap_reorder<C>(k, this->dis_tab + i * k, this->ids_tab + i * k);
         }
     }
@@ -568,9 +569,10 @@ struct ReservoirBlockResultHandler : TopkBlockResultHandler<C, use_sel> {
 
     /// series of results for queries i0..i1 is done
     void end_multiple() final {
-        // maybe parallel for
-        for (size_t i = i0; i < i1; i++) {
-            reservoirs[i - i0].to_result(
+#pragma omp parallel for schedule(static) if ((i1 - i0) * this->k >= 1024)
+        for (int64_t i = static_cast<int64_t>(i0); i < static_cast<int64_t>(i1);
+             i++) {
+            reservoirs[i - static_cast<int64_t>(i0)].to_result(
                     this->dis_tab + i * this->k, this->ids_tab + i * this->k);
         }
     }
diff --git a/faiss/utils/simd_impl/distances_arm_sve.cpp b/faiss/utils/simd_impl/distances_arm_sve.cpp
@@ -655,7 +655,10 @@ void exhaustive_L2sqr_blas_cmax<SIMDLevel::ARM_SVE>(
                        ip_block.get(),
                        &nyi);
             }
-            for (int64_t i = i0; i < i1; i++) {
+#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)
+            for (int64_t i = static_cast<int64_t>(i0);
+                 i < static_cast<int64_t>(i1);
+                 i++) {
                 const size_t count = j1 - j0;
                 float* ip_line = ip_block.get() + (i - i0) * count;
 
diff --git a/faiss/utils/simd_impl/distances_avx2.cpp b/faiss/utils/simd_impl/distances_avx2.cpp
@@ -1276,7 +1276,10 @@ void exhaustive_L2sqr_blas_cmax<SIMDLevel::AVX2>(
                        ip_block.get(),
                        &nyi);
             }
-            for (int64_t i = i0; i < i1; i++) {
+#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)
+            for (int64_t i = static_cast<int64_t>(i0);
+                 i < static_cast<int64_t>(i1);
+                 i++) {
                 float* ip_line = ip_block.get() + (i - i0) * (j1 - j0);
 
                 _mm_prefetch((const char*)ip_line, _MM_HINT_NTA);

Original file line number	Diff line number	Diff line change
`@@ -377,8 +377,9 @@ struct HeapBlockResultHandler : TopkBlockResultHandler<C, use_sel> {`
`377`	`377`
`378`	`378`	`/// series of results for queries i0..i1 is done`
`379`	`379`	`void end_multiple() final {`
`380`		`- // maybe parallel for`
`381`		`- for (size_t i = i0; i < i1; i++) {`
	`380`	`+#pragma omp parallel for schedule(static) if ((i1 - i0) * k >= 1024)`
	`381`	`+ for (int64_t i = static_cast<int64_t>(i0); i < static_cast<int64_t>(i1);`
	`382`	`+ i++) {`
`382`	`383`	`heap_reorder<C>(k, this->dis_tab + i * k, this->ids_tab + i * k);`
`383`	`384`	`}`
`384`	`385`	`}`
`@@ -568,9 +569,10 @@ struct ReservoirBlockResultHandler : TopkBlockResultHandler<C, use_sel> {`
`568`	`569`
`569`	`570`	`/// series of results for queries i0..i1 is done`
`570`	`571`	`void end_multiple() final {`
`571`		`- // maybe parallel for`
`572`		`- for (size_t i = i0; i < i1; i++) {`
`573`		`- reservoirs[i - i0].to_result(`
	`572`	`+#pragma omp parallel for schedule(static) if ((i1 - i0) * this->k >= 1024)`
	`573`	`+ for (int64_t i = static_cast<int64_t>(i0); i < static_cast<int64_t>(i1);`
	`574`	`+ i++) {`
	`575`	`+ reservoirs[i - static_cast<int64_t>(i0)].to_result(`
`574`	`576`	`this->dis_tab + i * this->k, this->ids_tab + i * this->k);`
`575`	`577`	`}`
`576`	`578`	`}`