Skip to content

Add a FilteredAccessor for filtered search.#1141

Open
hildebrandmw wants to merge 9 commits into
mainfrom
mhildebr/filtering
Open

Add a FilteredAccessor for filtered search.#1141
hildebrandmw wants to merge 9 commits into
mainfrom
mhildebr/filtering

Conversation

@hildebrandmw

@hildebrandmw hildebrandmw commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Introduce a FilteredAccessor as a filter aware version of SearchAccessor.

Motivation

Our current hook for performing filtered search relies on passing a QueryLabelProvider externally to all call sites. This has the following disadvantages:

  • The caller must somehow know exactly the ID space of the provider.
  • The QueryLabelProvider is hard-coded to use the accessor's internal ID, which may not always be the best fit depending on context.
  • QueryLabelProvider evaluations always live behind a trait object and are just queried one at a time.
  • The search accessor has no knowledge of the filter.
  • Splitting the label check with distance computations opens a time-of-check, time-of-use window that can be tricky to resolve.

How

This PR boils down to the following enums and trait:

#[derive(Debug, Clone, Copy)]
pub enum Decision<T> {
    Accept(T),
    Reject(T),
}

#[derive(Debug, Clone, Copy)]
pub enum ExpansionKind {
    All,
    AcceptOnly,
}

pub trait FilteredAccessor: HasId + Send + Sync {
    fn starting_points(
        &self,
    ) -> impl std::future::Future<Output = ANNResult<Vec<Decision<Self::Id>>>> + Send;

    fn start_point_distances<F>(
        &mut self,
        f: F,
    ) -> impl std::future::Future<Output = ANNResult<()>> + Send
    where
        F: FnMut(Decision<Self::Id>, f32) + Send;

    fn expand_beam_filtered<Itr, P, F>(
        &mut self,
        kind: ExpansionKind,
        ids: Itr,
        pred: P,
        on_neighbors: F,
    ) -> impl std::future::Future<Output = ANNResult<()>> + Send
    where
        Itr: Iterator<Item = Self::Id> + Send,
        P: HybridPredicate<Self::Id> + Send + Sync,
        F: FnMut(Decision<Self::Id>, f32) + Send;

    fn terminate_early(&mut self) -> bool {
        false
    }

    fn num_starting_points(&self) -> impl std::future::Future<Output = ANNResult<usize>> + Send {
        // provided
    }
}

The existing multi-hop search is rewritten to use FilteredAccessor and #1131 can be easily rewritten this way as well. This makes the FilteredAccessor "filter-aware" and allows the stated problems to be solved (the TOCTOU potential is really pushed down to the accessor, but that's the layer that knows the concurrency story so it's in a much better place to solve it anyways).

To reduce code churn, I added a new module diskann::graph::ext::labeled that is the new home for QueryLabelProvider and introduces a wrapping FilteredAccessor and associated strategy. The functionality in this module is intended as a compatibility layer for code that is currently relying on QueryLabelProvider.

Bug Fixes in MultiHop

With this change, it was clear that multi-hop search was slightly buggy with respect to start point handling: start points were treated as always fulfilling the predicate. This PR goes ahead and fixes that bug by tracking rejected start points explicitly and filtering them out in the post-process call.

What's up with QueryVisitDecision

This PR gets rid of QueryVisitDecision. It is conflating a large number of unrelated decisions.

  • It was almost never used properly: no code outside of the core diskann ever bothered with it, leading to semantic issues with, for example, beta-filtered search never calling on_visit.
  • Multi-hop search wasn't even using it right when adding items from two_hop_neighbors. To actually use it right would require a redundant interrogation of the QueryLabelProvider.
  • The decision to terminate is weird - that decision belongs with the accessor.
  • Tweaking distance is sketchy and not clear if it's intended to influence navigation or simply be recorded and passed to post process. Upcoming ID refactors will help restore this functionality and make it clear what the intent is.

Dropping this extra functionality means we can ditch the multi-hop tests specifically testing this functionality.

Suggested Review Order

  • Start with the first commit. Since it seemed cleanest to still use SearchStrategy, this commit relaxes the SearchAccessor trait bound on SearchStrategy::SearchAccessor to allow the disjoint FilteredAccessor to join the party. This commit is the one responsible for most of the churn in random parts of the code.
  • diskann/src/graph/glue.rs: The new FilteredAccessor and associated enums.
  • diskann/src/graph/search/multihop_search.rs: Refactored multi-hop search to use the new trait.
  • diskann/src/graph/ext/labeled.rs: The adaptor layer that allows code currently written for QueryLabelProvider to keep working.
  • diskann/test/generated/graph/test/cases/multihop/callback_filtering_grid.json was deleted since it's corresponding test was deleted. We no longer support distance tweaking.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new glue::FilteredAccessor abstraction to make filtered graph search “filter-aware” at the accessor layer, and refactors multi-hop filtered search to use this new trait. It also relocates the legacy QueryLabelProvider-based filtering hook into a new graph::ext::labeled compatibility module with adapter types so existing call sites can be wrapped with minimal churn.

Changes:

  • Add Decision<T>, ExpansionKind, and a new glue::FilteredAccessor trait for filter-aware expansion and start-point handling.
  • Refactor MultihopSearch to operate on FilteredAccessor and fix start-point handling (rejected start points now flow into two-hop expansion).
  • Add graph::ext::labeled compatibility adapters (strategy + accessor wrapper) and update affected crates/tests; remove distance-tweaking callback filtering test coverage and baselines.

Reviewed changes

Copilot reviewed 23 out of 25 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
diskann/test/generated/graph/test/cases/multihop/even_filtering_grid.json Updates baseline stats due to corrected start-point handling in multihop filtered search.
diskann/test/generated/graph/test/cases/multihop/callback_filtering_grid.json Removes baseline for dropped “distance tweaking” callback-filtering behavior.
diskann/src/internal/counter.rs Adds local counter value accessor used by tests to assert distance computation counts.
diskann/src/graph/test/provider.rs Exposes get_vector_count() for test assertions around distance computation behavior.
diskann/src/graph/test/cases/multihop.rs Refactors tests to use ext::labeled compatibility layer and updates expectations for rejected start points; removes callback-filtering tests.
diskann/src/graph/search/range_search.rs Tightens SearchStrategy bounds to require SearchAccessor: SearchAccessor for range search.
diskann/src/graph/search/multihop_search.rs Rewrites multihop search to use FilteredAccessor and Decision-based acceptance/rejection.
diskann/src/graph/search/knn_search.rs Tightens SearchStrategy bounds to require SearchAccessor: SearchAccessor for KNN search.
diskann/src/graph/search/diverse_search.rs Tightens SearchStrategy bounds and imports SearchAccessor for diverse search.
diskann/src/graph/mod.rs Adds graph::ext module.
diskann/src/graph/index.rs Removes QueryLabelProvider/QueryVisitDecision from graph::index and updates SearchStrategy bounds for paged search.
diskann/src/graph/glue.rs Adds Decision, ExpansionKind, and FilteredAccessor; relaxes SearchStrategy::SearchAccessor associated type bound to HasId.
diskann/src/graph/ext/mod.rs Introduces graph::ext module root.
diskann/src/graph/ext/labeled.rs Adds QueryLabelProvider and adapters (Filtered strategy + FilteredAccessor wrapper + post-process step) for legacy filter integration.
diskann-providers/src/model/graph/provider/layers/betafilter.rs Updates imports to new ext::labeled::QueryLabelProvider location.
diskann-providers/src/model/graph/provider/async_/inmem/spherical.rs Adjusts SearchStrategy bounds with explicit associated-type constraint for SearchAccessor.
diskann-providers/src/model/graph/provider/async_/inmem/scalar.rs Adjusts SearchStrategy bounds with explicit associated-type constraint for SearchAccessor.
diskann-providers/src/index/wrapped_async.rs Updates SearchStrategy bounds and refactors internal “noawait” paged-search wiring to avoid projection issues.
diskann-providers/src/index/diskann_async.rs Updates imports and generic bounds to align with updated SearchStrategy associated-type constraints.
diskann-garnet/src/labels.rs Updates imports to new ext::labeled::QueryLabelProvider location.
diskann-benchmark/src/utils/filters.rs Updates imports to new ext::labeled::QueryLabelProvider location.
diskann-benchmark/src/backend/index/benchmarks.rs Adjusts DefaultSearchStrategy bounds with explicit associated-type constraint for SearchAccessor.
diskann-benchmark-core/src/search/graph/range.rs Adds a trait bound ensuring graph::search::Range implements graph::Search for the selected strategy.
diskann-benchmark-core/src/search/graph/multihop.rs Updates to new multihop API (filter moved into strategy wrapper) and import path changes.
diskann-benchmark-core/src/search/graph/knn.rs Adds a trait bound ensuring graph::search::Knn implements graph::Search for the selected strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann/src/graph/ext/labeled.rs
Comment thread diskann/src/graph/index.rs
Comment thread diskann/src/graph/glue.rs Outdated
Comment thread diskann/src/graph/ext/labeled.rs Outdated
queries: Arc<Matrix<T>>,
strategy: Strategy<S>,
labels: Arc<[Arc<dyn graph::index::QueryLabelProvider<DP::InternalId>>]>,
labels: Arc<[Arc<dyn labeled::QueryLabelProvider<DP::InternalId>>]>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still using QueryLabelProvider with DP::InternalId, are we planning to remove it in a follow-up PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't planning on making this change. This PR allows filtered search implementations without using QueryLabelProvider whatsoever. A decent chunk of internal code assumes internal ID on the QueryLabelProvider (though I'm hoping in the near future all notion of an "external" vs "internal" ID will go away). The intent is to enable providers like the disk index to hook into filtering as they see fit without resorting to trait gymnastics on the QueryLabelProvider.

Comment thread diskann/src/graph/glue.rs Outdated
Comment thread diskann/src/graph/glue.rs
fn num_starting_points(&self) -> impl std::future::Future<Output = ANNResult<usize>> + Send {
self.starting_points()
.map(|result: ANNResult<_>| result.map(|v: Vec<_>| v.len()))
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, do we really need all these special functions on starting points? They don't seem to really save many lines of code, and they're not any special optimizations either. Could we consider getting rid of num_starting_points and maybe even start_point_distances?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. One thing to consider is that implementing these as provided definitions doesn't really cost much. With hooking into start_points_distances, there is a bit of a chicken and egg problem: knowing the number of start points helps when sizing the neighbor priority queue, which is then pushed into from start_point_distances. I'm in favor of paring these down, but perhaps as a different PR?

Comment thread diskann/src/graph/glue.rs
Accept(T),
/// The item does not satisfy the filter criteria.
Reject(T),
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking this through--one thing folks in MSRI have discussed experimenting with is algorithms that use "partial" matches (e.g. maybe one label in an AND expression matches but the other doesn't). This could be modified in the future to return some kind of match score without breaking too much down the line, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmmmm. Do you know what that would look like from an algorithmic side?

@codecov-commenter

codecov-commenter commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.28536% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.43%. Comparing base (9b0ae29) to head (f6c7b6b).

Files with missing lines Patch % Lines
diskann/src/graph/glue.rs 83.33% 9 Missing ⚠️
diskann-providers/src/index/wrapped_async.rs 87.80% 5 Missing ⚠️
diskann/src/graph/ext/labeled.rs 97.35% 5 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1141      +/-   ##
==========================================
- Coverage   89.43%   89.43%   -0.01%     
==========================================
  Files         484      485       +1     
  Lines       91495    91539      +44     
==========================================
+ Hits        81829    81868      +39     
- Misses       9666     9671       +5     
Flag Coverage Δ
miri 89.43% <95.28%> (-0.01%) ⬇️
unittests 89.08% <95.28%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-benchmark-core/src/search/graph/knn.rs 95.08% <ø> (ø)
...iskann-benchmark-core/src/search/graph/multihop.rs 98.75% <100.00%> (+0.02%) ⬆️
diskann-benchmark-core/src/search/graph/range.rs 93.57% <ø> (ø)
diskann-benchmark/src/backend/index/benchmarks.rs 69.26% <ø> (ø)
diskann-benchmark/src/utils/filters.rs 86.25% <ø> (ø)
diskann-garnet/src/labels.rs 100.00% <ø> (ø)
diskann-providers/src/index/diskann_async.rs 96.21% <100.00%> (+0.01%) ⬆️
...rs/src/model/graph/provider/async_/inmem/scalar.rs 93.55% <ø> (ø)
...src/model/graph/provider/async_/inmem/spherical.rs 88.58% <ø> (ø)
...ders/src/model/graph/provider/layers/betafilter.rs 93.18% <ø> (ø)
... and 10 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants