Add a FilteredAccessor for filtered search.#1141
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new glue::FilteredAccessor abstraction to make filtered graph search “filter-aware” at the accessor layer, and refactors multi-hop filtered search to use this new trait. It also relocates the legacy QueryLabelProvider-based filtering hook into a new graph::ext::labeled compatibility module with adapter types so existing call sites can be wrapped with minimal churn.
Changes:
- Add
Decision<T>,ExpansionKind, and a newglue::FilteredAccessortrait for filter-aware expansion and start-point handling. - Refactor
MultihopSearchto operate onFilteredAccessorand fix start-point handling (rejected start points now flow into two-hop expansion). - Add
graph::ext::labeledcompatibility adapters (strategy + accessor wrapper) and update affected crates/tests; remove distance-tweaking callback filtering test coverage and baselines.
Reviewed changes
Copilot reviewed 23 out of 25 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann/test/generated/graph/test/cases/multihop/even_filtering_grid.json | Updates baseline stats due to corrected start-point handling in multihop filtered search. |
| diskann/test/generated/graph/test/cases/multihop/callback_filtering_grid.json | Removes baseline for dropped “distance tweaking” callback-filtering behavior. |
| diskann/src/internal/counter.rs | Adds local counter value accessor used by tests to assert distance computation counts. |
| diskann/src/graph/test/provider.rs | Exposes get_vector_count() for test assertions around distance computation behavior. |
| diskann/src/graph/test/cases/multihop.rs | Refactors tests to use ext::labeled compatibility layer and updates expectations for rejected start points; removes callback-filtering tests. |
| diskann/src/graph/search/range_search.rs | Tightens SearchStrategy bounds to require SearchAccessor: SearchAccessor for range search. |
| diskann/src/graph/search/multihop_search.rs | Rewrites multihop search to use FilteredAccessor and Decision-based acceptance/rejection. |
| diskann/src/graph/search/knn_search.rs | Tightens SearchStrategy bounds to require SearchAccessor: SearchAccessor for KNN search. |
| diskann/src/graph/search/diverse_search.rs | Tightens SearchStrategy bounds and imports SearchAccessor for diverse search. |
| diskann/src/graph/mod.rs | Adds graph::ext module. |
| diskann/src/graph/index.rs | Removes QueryLabelProvider/QueryVisitDecision from graph::index and updates SearchStrategy bounds for paged search. |
| diskann/src/graph/glue.rs | Adds Decision, ExpansionKind, and FilteredAccessor; relaxes SearchStrategy::SearchAccessor associated type bound to HasId. |
| diskann/src/graph/ext/mod.rs | Introduces graph::ext module root. |
| diskann/src/graph/ext/labeled.rs | Adds QueryLabelProvider and adapters (Filtered strategy + FilteredAccessor wrapper + post-process step) for legacy filter integration. |
| diskann-providers/src/model/graph/provider/layers/betafilter.rs | Updates imports to new ext::labeled::QueryLabelProvider location. |
| diskann-providers/src/model/graph/provider/async_/inmem/spherical.rs | Adjusts SearchStrategy bounds with explicit associated-type constraint for SearchAccessor. |
| diskann-providers/src/model/graph/provider/async_/inmem/scalar.rs | Adjusts SearchStrategy bounds with explicit associated-type constraint for SearchAccessor. |
| diskann-providers/src/index/wrapped_async.rs | Updates SearchStrategy bounds and refactors internal “noawait” paged-search wiring to avoid projection issues. |
| diskann-providers/src/index/diskann_async.rs | Updates imports and generic bounds to align with updated SearchStrategy associated-type constraints. |
| diskann-garnet/src/labels.rs | Updates imports to new ext::labeled::QueryLabelProvider location. |
| diskann-benchmark/src/utils/filters.rs | Updates imports to new ext::labeled::QueryLabelProvider location. |
| diskann-benchmark/src/backend/index/benchmarks.rs | Adjusts DefaultSearchStrategy bounds with explicit associated-type constraint for SearchAccessor. |
| diskann-benchmark-core/src/search/graph/range.rs | Adds a trait bound ensuring graph::search::Range implements graph::Search for the selected strategy. |
| diskann-benchmark-core/src/search/graph/multihop.rs | Updates to new multihop API (filter moved into strategy wrapper) and import path changes. |
| diskann-benchmark-core/src/search/graph/knn.rs | Adds a trait bound ensuring graph::search::Knn implements graph::Search for the selected strategy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| queries: Arc<Matrix<T>>, | ||
| strategy: Strategy<S>, | ||
| labels: Arc<[Arc<dyn graph::index::QueryLabelProvider<DP::InternalId>>]>, | ||
| labels: Arc<[Arc<dyn labeled::QueryLabelProvider<DP::InternalId>>]>, |
There was a problem hiding this comment.
We are still using QueryLabelProvider with DP::InternalId, are we planning to remove it in a follow-up PR?
There was a problem hiding this comment.
I wasn't planning on making this change. This PR allows filtered search implementations without using QueryLabelProvider whatsoever. A decent chunk of internal code assumes internal ID on the QueryLabelProvider (though I'm hoping in the near future all notion of an "external" vs "internal" ID will go away). The intent is to enable providers like the disk index to hook into filtering as they see fit without resorting to trait gymnastics on the QueryLabelProvider.
| fn num_starting_points(&self) -> impl std::future::Future<Output = ANNResult<usize>> + Send { | ||
| self.starting_points() | ||
| .map(|result: ANNResult<_>| result.map(|v: Vec<_>| v.len())) | ||
| } |
There was a problem hiding this comment.
Similarly, do we really need all these special functions on starting points? They don't seem to really save many lines of code, and they're not any special optimizations either. Could we consider getting rid of num_starting_points and maybe even start_point_distances?
There was a problem hiding this comment.
Perhaps. One thing to consider is that implementing these as provided definitions doesn't really cost much. With hooking into start_points_distances, there is a bit of a chicken and egg problem: knowing the number of start points helps when sizing the neighbor priority queue, which is then pushed into from start_point_distances. I'm in favor of paring these down, but perhaps as a different PR?
| Accept(T), | ||
| /// The item does not satisfy the filter criteria. | ||
| Reject(T), | ||
| } |
There was a problem hiding this comment.
Just thinking this through--one thing folks in MSRI have discussed experimenting with is algorithms that use "partial" matches (e.g. maybe one label in an AND expression matches but the other doesn't). This could be modified in the future to return some kind of match score without breaking too much down the line, right?
There was a problem hiding this comment.
mmmmmm. Do you know what that would look like from an algorithmic side?
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1141 +/- ##
==========================================
- Coverage 89.43% 89.43% -0.01%
==========================================
Files 484 485 +1
Lines 91495 91539 +44
==========================================
+ Hits 81829 81868 +39
- Misses 9666 9671 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Introduce a
FilteredAccessoras a filter aware version ofSearchAccessor.Motivation
Our current hook for performing filtered search relies on passing a
QueryLabelProviderexternally to all call sites. This has the following disadvantages:QueryLabelProvideris hard-coded to use the accessor's internal ID, which may not always be the best fit depending on context.QueryLabelProviderevaluations always live behind a trait object and are just queried one at a time.How
This PR boils down to the following enums and trait:
The existing multi-hop search is rewritten to use
FilteredAccessorand #1131 can be easily rewritten this way as well. This makes theFilteredAccessor"filter-aware" and allows the stated problems to be solved (the TOCTOU potential is really pushed down to the accessor, but that's the layer that knows the concurrency story so it's in a much better place to solve it anyways).To reduce code churn, I added a new module
diskann::graph::ext::labeledthat is the new home forQueryLabelProviderand introduces a wrappingFilteredAccessorand associated strategy. The functionality in this module is intended as a compatibility layer for code that is currently relying onQueryLabelProvider.Bug Fixes in MultiHop
With this change, it was clear that multi-hop search was slightly buggy with respect to start point handling: start points were treated as always fulfilling the predicate. This PR goes ahead and fixes that bug by tracking rejected start points explicitly and filtering them out in the post-process call.
What's up with
QueryVisitDecisionThis PR gets rid of
QueryVisitDecision. It is conflating a large number of unrelated decisions.diskannever bothered with it, leading to semantic issues with, for example, beta-filtered search never callingon_visit.two_hop_neighbors. To actually use it right would require a redundant interrogation of theQueryLabelProvider.Dropping this extra functionality means we can ditch the multi-hop tests specifically testing this functionality.
Suggested Review Order
SearchStrategy, this commit relaxes theSearchAccessortrait bound onSearchStrategy::SearchAccessorto allow the disjointFilteredAccessorto join the party. This commit is the one responsible for most of the churn in random parts of the code.diskann/src/graph/glue.rs: The newFilteredAccessorand associated enums.diskann/src/graph/search/multihop_search.rs: Refactored multi-hop search to use the new trait.diskann/src/graph/ext/labeled.rs: The adaptor layer that allows code currently written forQueryLabelProviderto keep working.diskann/test/generated/graph/test/cases/multihop/callback_filtering_grid.jsonwas deleted since it's corresponding test was deleted. We no longer support distance tweaking.