Skip to content

Commit 70e3d52

Browse files
authored
Revert pruning PRs (#331)
# Revert #282 and #327 to restore green CI for VecSim tests (via Claude) ## Problem Private CI `build-share-lib.yml` workflow's `test-shared-libs` jobs (VecSim tests) — reduced, gcc-11, and gcc-14 on rockylinux8 — started failing between **Apr 29** (last green: `0ce34e1`) and **May 6** (first red: `8050bec`). ### Failing tests - `SVSTest.quant_modes` - `SVSTest.save_load` from `VectorSimilarity/tests/unit/test_svs.cpp:2975`, with assertions like: ``` Expected equality of these values: id Which is: 55 (idx + 45) Which is: 54 ``` ## Root cause The `test-shared-libs` (reduced) job clones `RedisAI/VectorSimilarity` and runs its test suite against the SVS shared library. The failure window coincided with SVS commit **`629a79c`** — *"Enhance heuristic pruning to handle duplicate clusters (#282)"* — which added a post-pruning step to both `IterativePruneStrategy` and `ProgressivePruneStrategy` in `include/svs/index/vamana/prune.h`: > If `all_duplicates && anchor_set && !result.empty()`, overwrite `result.back()` with the closest candidate from the pool whose distance ≠ `anchor_dist`. On the VectorSimilarity test fixture (`v[i] = (i,i,i,i)`, dim=4, n=100), every node has **symmetric pairs of equidistant neighbors** (e.g. nodes `i-1` and `i+1` both at `d²=4` from node `i`). These look like "duplicate clusters" to the heuristic, but geometrically they are symmetric pairs *in opposite directions*, not redundant edges. The post-prune step: 1. Overwrote a genuine nearest-neighbor edge with a strictly farther candidate. 2. Biased replacement toward one direction (the first non-tied pool entry), which systematically removed edges on one side of every node. **Result:** the constructed Vamana graph is no longer traversable to certain neighbors for certain queries. For `query = (50,50,50,50)`, the exact top-10 by ID (`[45..54]`) can't all be reached; search returns `55` in place of `54`. ## Attempted fixes ### Attempt 1 — `9cee0f1` (#327, "Fix duplicate-cluster trigger to require >=2 kept candidates") Tightened the guard from `!result.empty()` to `result.size() >= 2`. Correctly prevented the branch from firing when only one neighbor was retained, but did **not** address the core bug: the heuristic still overwrites a genuine nearest edge with a farther candidate when 2+ retained neighbors tie in distance. **Result:** the `test-shared-libs` failures persisted. ### Attempt 2 — append-only variant Branch: [`dev/eglaser-sharedci-fix`](https://github.com/intel/ScalableVectorSearch/tree/dev/eglaser-sharedci-fix) Changed both strategies in `include/svs/index/vamana/prune.h`: - `result.back() = ...` → `result.push_back(...)` (append instead of overwrite). - Added `result.size() < max_result_size` guard (only fire when there's room). **Rationale:** if every kept neighbor ties at one distance, each one is a legitimate nearest edge and must not be replaced. Append a diversity edge only when capacity allows. **Result:** the failure pattern changed but the test still failed: | | Top-10 by ID | Missing | |---|---|---| | Before attempt 2 | `[45..53, 55]` | vector **54** | | After attempt 2 | `[46..55]` | vector **45** (entire set shifted +1) | The same graph-asymmetry problem reappeared on the opposite side because the appended "diversity" candidate is still systematically biased (always the first pool entry past the tie band, which stably falls on one side of the symmetric ring). The heuristic gains an asymmetric edge toward lower IDs instead of losing an asymmetric edge toward higher IDs. ## Conclusion Both #327 and the append-only variant confirm that the diversity heuristic's **detection criterion is fundamentally wrong**: it uses distance equality to infer directional redundancy, but same-distance neighbors can be (and in symmetric data, typically are) in *opposite directions*. No local tweak to the guard or the insertion method fixes this — the criterion itself needs redesign to detect **directional collapse** rather than **distance collapse**. That's a nontrivial design task and out of scope for an immediate unblock. ## This PR Reverts: - **#282** (`629a79c`) — *Enhance heuristic pruning to handle duplicate clusters* - **#327** (`9cee0f1`) — *Fix duplicate-cluster trigger to require >=2 kept candidates* (only exists to patch #282) This restores the pre-April pruning behavior, which restores green CI for VecSim tests. ## Follow-up Issue **#80** (the original motivation for #282) should be reopened with a note that the fix needs to detect **directional** rather than **distance** redundancy.
1 parent 80e4a6c commit 70e3d52

2 files changed

Lines changed: 0 additions & 167 deletions

File tree

include/svs/index/vamana/prune.h

Lines changed: 0 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -130,9 +130,6 @@ void heuristic_prune_neighbors(
130130

131131
auto pruned = std::vector<PruneState>(poolsize, PruneState::Available);
132132
float current_alpha = 1.0f;
133-
float anchor_dist = 0.0f;
134-
bool anchor_set = false;
135-
bool all_duplicates = true;
136133
while (result.size() < max_result_size && !cmp(alpha, current_alpha)) {
137134
size_t start = 0;
138135
while (result.size() < max_result_size && start < poolsize) {
@@ -148,16 +145,6 @@ void heuristic_prune_neighbors(
148145
const auto& query = accessor(dataset, id);
149146
distance::maybe_fix_argument(distance_function, query);
150147
result.push_back(detail::construct_as(lib::Type<I>(), pool[start]));
151-
152-
if (all_duplicates) {
153-
if (!anchor_set) {
154-
anchor_dist = pool[start].distance();
155-
anchor_set = true;
156-
} else if (pool[start].distance() != anchor_dist) {
157-
all_duplicates = false;
158-
}
159-
}
160-
161148
for (size_t t = start + 1; t < poolsize; ++t) {
162149
if (excluded(pruned[t])) {
163150
continue;
@@ -184,44 +171,6 @@ void heuristic_prune_neighbors(
184171
}
185172
current_alpha *= alpha;
186173
}
187-
188-
// Add a diversity edge if a duplicate cluster is detected.
189-
// A "cluster" requires at least 2 kept candidates sharing the same
190-
// distance; a single retained neighbor is not a cluster and must not
191-
// be replaced (doing so would discard the only true nearest-neighbor
192-
// edge for that node).
193-
if (all_duplicates && anchor_set && result.size() >= 2) {
194-
auto result_id = [](const I& r) -> size_t {
195-
if constexpr (std::integral<I>) {
196-
return static_cast<size_t>(r);
197-
} else {
198-
return static_cast<size_t>(r.id());
199-
}
200-
};
201-
for (size_t t = 0; t < poolsize; ++t) {
202-
const auto& candidate = pool[t];
203-
auto cid = candidate.id();
204-
if (cid == current_node_id || candidate.distance() == anchor_dist) {
205-
continue;
206-
}
207-
bool in_result = false;
208-
for (const auto& r : result) {
209-
if (result_id(r) == static_cast<size_t>(cid)) {
210-
in_result = true;
211-
break;
212-
}
213-
}
214-
assert(
215-
!in_result &&
216-
"Candidate with non-anchor distance should not already be in result"
217-
);
218-
if (in_result) {
219-
continue;
220-
}
221-
result.back() = detail::construct_as(lib::Type<I>(), candidate);
222-
break;
223-
}
224-
}
225174
}
226175

227176
template <
@@ -254,9 +203,6 @@ void heuristic_prune_neighbors(
254203
std::vector<float> pruned(poolsize, type_traits::tombstone_v<float, decltype(cmp)>);
255204

256205
float current_alpha = 1.0f;
257-
float anchor_dist = 0.0f;
258-
bool anchor_set = false;
259-
bool all_duplicates = true;
260206
while (result.size() < max_result_size && !cmp(alpha, current_alpha)) {
261207
size_t start = 0;
262208
while (result.size() < max_result_size && start < poolsize) {
@@ -272,16 +218,6 @@ void heuristic_prune_neighbors(
272218
const auto& query = accessor(dataset, id);
273219
distance::maybe_fix_argument(distance_function, query);
274220
result.push_back(detail::construct_as(lib::Type<I>(), pool[start]));
275-
276-
if (all_duplicates) {
277-
if (!anchor_set) {
278-
anchor_dist = pool[start].distance();
279-
anchor_set = true;
280-
} else if (pool[start].distance() != anchor_dist) {
281-
all_duplicates = false;
282-
}
283-
}
284-
285221
for (size_t t = start + 1; t < poolsize; ++t) {
286222
if (cmp(current_alpha, pruned[t])) {
287223
continue;
@@ -300,44 +236,6 @@ void heuristic_prune_neighbors(
300236
}
301237
current_alpha *= alpha;
302238
}
303-
304-
// Add a diversity edge if a duplicate cluster is detected.
305-
// A "cluster" requires at least 2 kept candidates sharing the same
306-
// distance; a single retained neighbor is not a cluster and must not
307-
// be replaced (doing so would discard the only true nearest-neighbor
308-
// edge for that node).
309-
if (all_duplicates && anchor_set && result.size() >= 2) {
310-
auto result_id = [](const I& r) -> size_t {
311-
if constexpr (std::integral<I>) {
312-
return static_cast<size_t>(r);
313-
} else {
314-
return static_cast<size_t>(r.id());
315-
}
316-
};
317-
for (size_t t = 0; t < poolsize; ++t) {
318-
const auto& candidate = pool[t];
319-
auto cid = candidate.id();
320-
if (cid == current_node_id || candidate.distance() == anchor_dist) {
321-
continue;
322-
}
323-
bool in_result = false;
324-
for (const auto& r : result) {
325-
if (result_id(r) == static_cast<size_t>(cid)) {
326-
in_result = true;
327-
break;
328-
}
329-
}
330-
assert(
331-
!in_result &&
332-
"Candidate with non-anchor distance should not already be in result"
333-
);
334-
if (in_result) {
335-
continue;
336-
}
337-
result.back() = detail::construct_as(lib::Type<I>(), candidate);
338-
break;
339-
}
340-
}
341239
}
342240

343241
///

tests/svs/index/vamana/prune.cpp

Lines changed: 0 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,6 @@
1717
// header under test
1818
#include "svs/index/vamana/prune.h"
1919

20-
// core
21-
#include "svs/core/data/simple.h"
22-
#include "svs/core/distance/euclidean.h"
23-
2420
// catch2
2521
#include "catch2/catch_test_macros.hpp"
2622

@@ -50,65 +46,4 @@ CATCH_TEST_CASE("Pruning", "[index][vamana]") {
5046
CATCH_REQUIRE(v::excluded(v::PruneState::Pruned) == true);
5147
}
5248
}
53-
54-
CATCH_SECTION("Duplicate Cluster Trap") {
55-
auto data = svs::data::SimpleData<float>(6, 4);
56-
auto d0 = std::vector<float>{1.0f, 1.0f, 1.0f, 1.0f};
57-
auto d4 = std::vector<float>{2.0f, 1.0f, 1.0f, 1.0f};
58-
auto d5 = std::vector<float>{1.5f, 1.0f, 1.0f, 1.0f};
59-
60-
for (size_t i = 0; i < 4; ++i) {
61-
data.set_datum(i, d0);
62-
}
63-
data.set_datum(4, d4);
64-
data.set_datum(5, d5);
65-
66-
auto dist = svs::distance::DistanceL2();
67-
auto accessor = svs::data::GetDatumAccessor{};
68-
69-
std::vector<svs::Neighbor<size_t>> pool = {
70-
{size_t{0}, 0.0f},
71-
{size_t{1}, 0.0f},
72-
{size_t{2}, 0.0f},
73-
{size_t{3}, 0.0f},
74-
{size_t{4}, 1.0f}};
75-
76-
CATCH_SECTION("Iterative Strategy Fix") {
77-
std::vector<svs::Neighbor<size_t>> result;
78-
v::heuristic_prune_neighbors(
79-
v::IterativePruneStrategy{},
80-
2,
81-
1.3f,
82-
data,
83-
accessor,
84-
dist,
85-
size_t{5},
86-
std::span<const svs::Neighbor<size_t>>(pool),
87-
result
88-
);
89-
90-
CATCH_REQUIRE(result.size() == 2);
91-
CATCH_REQUIRE(result[0].id() == 0);
92-
CATCH_REQUIRE(result[1].id() == 4);
93-
}
94-
95-
CATCH_SECTION("Progressive Strategy Fix") {
96-
std::vector<svs::Neighbor<size_t>> result;
97-
v::heuristic_prune_neighbors(
98-
v::ProgressivePruneStrategy{},
99-
2,
100-
1.3f,
101-
data,
102-
accessor,
103-
dist,
104-
size_t{5},
105-
std::span<const svs::Neighbor<size_t>>(pool),
106-
result
107-
);
108-
109-
CATCH_REQUIRE(result.size() == 2);
110-
CATCH_REQUIRE(result[0].id() == 0);
111-
CATCH_REQUIRE(result[1].id() == 4);
112-
}
113-
}
11449
}

0 commit comments

Comments
 (0)