(diversity) Integrate determinant diversity in tokp full precision search and disk search#1011
(diversity) Integrate determinant diversity in tokp full precision search and disk search#1011narendatha wants to merge 50 commits into
Conversation
Replace kind()-based string equality checks with explicit is_match() and get() phase-shape helpers on plugin structs. This avoids fragile ordering assumptions and makes each plugin responsible for recognising its own phase shape.
Co-authored-by: Copilot <copilot@github.com>
…plugins # Conflicts: # diskann-benchmark/src/backend/index/benchmarks.rs # diskann-benchmark/src/backend/index/product.rs # diskann-benchmark/src/backend/index/scalar.rs # diskann-benchmark/src/backend/index/search/plugins.rs # diskann-benchmark/src/backend/index/spherical.rs # diskann-benchmark/src/inputs/graph_index.rs # diskann-benchmark/src/inputs/mod.rs
Co-authored-by: Copilot <copilot@github.com>
…ojection issue This commit explores approaches to wire real candidate vectors into async determinant-diversity post-processing. Current state: IN COMPILATION ERROR (intentional for analysis) Attempted approaches: 1. Initial shim-trait FullPrecisionVectorAccessor with async get_full_precision_vector() - Resulted in 'implementation not general enough' at search_with() call 2. Removed explicit for<'a> post_processor::DeterminantDiversity bound - Still fails - the constraint is inherent in search_with() signature itself Root cause analysis: - search_with() requires: PP: for<'a> SearchPostProcess<S::SearchAccessor<'a>, T, O> - This means post-processor must work for ANY accessor lifetime 'a - But query = queries.row(query_idx) is borrowed for specific loop iteration lifetime - These are fundamentally incompatible - a borrowed value can't satisfy for<'a> generically Compiler errors (3 total): - 'not general enough': implementation needed for or<'a> but found specific '0 - 'does not live long enough': queries lifetime too short for 'static requirement Files modified: - diskann-benchmark/src/backend/index/benchmarks.rs: * Removed explicit for<'a> post_processor::DeterminantDiversity constraint * Narrowed plugin impl to FullPrecisionProvider<f32> - diskann-benchmark/src/backend/index/post_processor/determinant_diversity.rs: * Added shim trait FullPrecisionVectorAccessor * Async method get_full_precision_vector(&mut self, id) -> impl Future<...> Next steps to investigate: - Move determinant-diversity outside search_with() as post-processing reranking - This avoids HRTB entirely by applying after candidates are returned - Benchmark impact: measure recall/QPS with external reranking vs baseline Related context: - Disk index determinant-diversity works correctly (uses real vectors, shows 51-53% QPS cost) - Shared algorithm fixed (distance-to-similarity scoring direction) - Branch already merged with origin/main
Co-authored-by: Copilot <copilot@github.com>
…educe duplication - Use for<'a, 'b> SearchStrategy bound (user-provided fix) to break HRTB lifetime projection issue in the search_with post-processor constraint - Wire FullPrecisionVectorAccessor shim trait so async det-div post-processor fetches real candidate vectors instead of placeholder distances - Populate QPS/latency metrics in async det-div benchmark path (previously all 'missing') - Extract run_topk_timed helper to eliminate ~100 lines of duplicated loop/timing/recall machinery from DeterminantDiversity::run - Update async-determinant-diversity.json example tag (async-index-build -> graph-index-build) - Fix clippy::manual_async_fn in FullPrecisionVectorAccessor shim trait
- Use for<'a, 'b> SearchStrategy bound to resolve HRTB lifetime mismatch - Wire FullPrecisionVectorAccessor shim trait for async det-div to fetch real vectors - Implement DeterminantDiversity post-processor for async graph-index path - Extract run_topk_timed helper to eliminate ~100 lines of code duplication - Wire post_processor parameter to disk-index search pipeline - Update search parameter handling and result counting for post-processed results - Add TopkPostProcessor input type and necessary imports - Populate QPS/latency metrics in async det-div benchmark path
…rosoft/DiskANN into u/narendatha/det_div_plugins
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Naren! My biggest concern is diskann-benchmark. I would really prefer if we generalized diskann-benchmark-core to handle post-processing, which would reduce bugs and be more broadly applicable.
On the implementation front, there's a bit of work to do regarding being more opinionated on the nature of inputs, establishing invariants of the diversity search with a bespoke struct (which could help with the previous point), reducing code duplication, testing the implementation, and removing allocations with a Matrix.
I'll leave reviewing the diskann-disk changes to others.
- Add 4th type parameter PP to KNN with default of () - Add with_postprocessor() constructor for post-processor support - Keep new() for standard KNN without post-processing - Add accessor methods for index, queries, strategy, post_processor - Update Search impl to work with generic KNN<DP, T, S, ()> - Maintains backward compatibility - all existing code continues to work - Provides foundation for DeterminantDiversity plugin refactoring
…diskann-providers - Add diskann-providers/src/post_processor.rs with centralized DeterminantDiversityParams - Define DeterminantDiversityError for precise error reporting - Provide unified validation: power > 0.0, eta >= 0.0 - Add accessor methods and Display impl - Import tests included - Update diskann-benchmark to use shared type from diskann-providers - Remove duplicate local definition in plugins.rs - Maintains backward compatibility with anyhow::Error conversion - Foundation for sharing post-processor logic across subsystems (benchmark, disk-index)
…process.rs - Document algorithm overview, parameters (power, eta), variants, time complexity - Clarify distinction between eta=0 (exact) and eta>0 (ridge-regularized) paths - Reference asymptotic complexity O(m^3) due to determinant computation
- test_diversity_selects_orthogonal_candidates: Verify orthogonal pair chosen over parallel - test_diversity_selects_orthogonal_candidates_with_eta: Same test for eta>0 variant - test_high_power_prefers_closer_candidates: Verify relevance weighting with high power - test_equal_distances: Verify stable behavior with equal-distance candidates - test_eta_zero_is_greedy_path: Confirm eta=0.0 routes to greedy orthogonalization Note: Diversity tests use equal distances to eliminate relevance weighting interference, making them pure tests of the geometric diversity property.
- Remove duplicate post_process_with_eta_f32 and post_process_greedy_orthogonalization_f32 - Unify into single greedy_orthogonal_select() with inv_sqrt_eta parameter - inv_sqrt_eta=1.0/sqrt(eta) for ridge-regularized path (eta>0) - inv_sqrt_eta=1.0 for exact greedy path (eta=0) - Eliminates ~80 lines of near-duplicate code - All 14 existing tests continue to pass
- Use contiguous Matrix allocation instead of separate Vec per candidate - Reduces heap allocations from O(n) to O(1) where n = num candidates - Improves cache locality during orthogonalization iteration - Access residuals via row(i) / row_mut(i) instead of indexing Vec of Vecs - All 14 tests continue to pass
- Move implementation to model/graph/provider/determinant_diversity.rs - Export via provider::mod.rs as determinant_diversity_post_process - Keep backward-compatible re-export from async_::mod.rs - Remove old async_-scoped source file - Verify build and full diskann-providers tests pass
There was a problem hiding this comment.
Thanks Naren, I am glad this feature is starting to land in the repo. Left a few small comments here but my main question is whether greedy_orthogonal_select should be written in a way that is more pluggable and generalizable -
- Allocation and format of full precision vectors. Isn't
Vec<(Id, Vec<f32>)>too restrictive? - Should we lean into the matrix API more and its optimized implementations in
diskann-quantization?
…plugins # Conflicts: # diskann-benchmark/src/backend/index/search/knn.rs # diskann-benchmark/src/inputs/graph_index.rs # diskann-benchmark/src/inputs/mod.rs
…e Positive<f32> for power
…rrors Indented blocks in /// comments are interpreted as Rust doctests by rustdoc; the formula lines (e.g. 'alpha_i = ...', 'similarity(d) = ...') were being compiled and failed in CI. Wrap them in ext fences so they remain prose.
…plugins # Conflicts: # diskann-disk/src/search/provider/disk_provider.rs
…plugins # Conflicts: # diskann-benchmark/src/inputs/graph_index.rs
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Naren, this has been a journey. The PR still has a lot of little structural issues that need to be addressed before merging. I've done my best to enumerate the issues and potential fixes in the inline comments. Please read through all the comments first as some of them are inter-related.
Note that this review focused on architecture and integration. I'd like to take one more look at the algorithm when the structural issues are resolved.
Feel free to reach out with any questions!
| pub(crate) mod postprocess; | ||
|
|
||
| // Re-export from parent module for backward compatibility. | ||
| // The algorithm is not async-specific and lives in provider::determinant_diversity. |
There was a problem hiding this comment.
What backwards compatibility are we trying to maintain? This PR is adding the feature, no?
Please don't reexport items from places that aren't proper children. I know that diskann-providers is not structured well at all, but let's not make it any worse.
| pub fn new(comparisons: u32, hops: u32) -> Self { | ||
| Self { comparisons, hops } | ||
| } | ||
| } |
There was a problem hiding this comment.
This constructor is not needed.
| @@ -23,7 +23,7 @@ use crate::{ | |||
| }; | |||
|
|
|||
| /// A built-in helper for benchmarking the K-nearest neighbors method | |||
There was a problem hiding this comment.
Here is an alternative that doesn't require duplicating the search loop.
#[derive(Debug, Clone, Copy)]
pub struct Defaulted;
pub trait AsPostProcessor<'a, S, DP, T>
where
DP: provider::DataProvider,
S: glue::SearchStrategy<'a, DP, T>,
{
type Processor: glue::SearchPostProcess<S::SearchAccessor, T, DP::ExternalId> + Send + Sync;
fn as_post_processor(&'a self, strategy: &'a S) -> Self::Processor;
}
impl<'a, S, DP, T> AsPostProcessor<'a, S, DP, T> for Defaulted
where
DP: provider::DataProvider,
S: glue::DefaultPostProcessor<'a, DP, T, DP::ExternalId>,
{
type Processor = S::Processor;
fn as_post_processor(&'a self, strategy: &'a S) -> Self::Processor {
strategy.default_post_processor()
}
}
#[derive(Debug)]
pub struct Forwarded<PP>(PP);
impl<'a, S, DP, T, PP> AsPostProcessor<'a, S, DP, T> for Forwarded<PP>
where
DP: provider::DataProvider,
S: glue::SearchStrategy<'a, DP, T>,
PP: glue::SearchPostProcess<S::SearchAccessor, T, DP::ExternalId> + Clone + AsyncFriendly,
{
type Processor = PP;
fn as_post_processor(&'a self, _strategy: &'a S) -> Self::Processor {
self.0.clone()
}
}
// ...
impl<DP, T, S, PP> Search for KNN<DP, T, S, PP>
where
DP: provider::DataProvider<Context: Default, ExternalId: search::Id>,
S: for<'a> glue::SearchStrategy<'a, DP, &'a [T]> + Clone + AsyncFriendly,
PP: for<'a> AsPostProcessor<'a, S, DP, &'a [T]> + AsyncFriendly,
T: AsyncFriendly + Clone,
{
// ...
let strategy = self.strategy.get(index)?;
let post_processor = self.post_processor.as_post_processor(&strategy);
let stats = self
.index
.search_with(
knn_search,
strategy,
post_processor,
&context,
self.queries.row(index),
buffer,
)
//...
}Would prefer an approach like this that is mostly infallible.
| pub struct DeterminantDiversityAndFilter<'a> { | ||
| filter: &'a (dyn Fn(&u32) -> bool + Send + Sync), | ||
| power: f32, | ||
| eta: f32, |
There was a problem hiding this comment.
Why not reuse DeterminantDiversityParams ?
| /// - `power <= 0.0`: invalid relevance weighting | ||
| /// - `eta < 0.0`: invalid numerical stability parameter | ||
| pub fn new(power: f32, eta: f32) -> Result<Self, DeterminantDiversityError> { | ||
| if power <= 0.0 { |
There was a problem hiding this comment.
Adding .is_finite() checks wouldn't hurt.
| vector_filter: Option<VectorFilter<Data>>, | ||
| post_processor: Option<SearchPostProcessorKind>, | ||
| is_flat_search: bool, | ||
| ) -> ANNResult<SearchResult<Data::AssociatedDataType>> { |
There was a problem hiding this comment.
There are not substantive tests for the new functionality in diskann-disk and this API is starting to get bloated with many optionals. My stake in diskann-disk is relatively low. Please ensure the maintainers of this code are comfortable with these changes.
| #[derive(Debug, Clone, Serialize, Deserialize)] | ||
| #[serde(tag = "type", rename_all = "kebab-case")] | ||
| pub(crate) enum TopkPostProcessor { | ||
| DeterminantDiversity { power: f32, eta: f32 }, |
There was a problem hiding this comment.
This would be cleaner as
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "kebab-case")]
enum RawTopkPostProcessor {
DeterminantDiversity { power: f32, eta: f32 },
}
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(try_from = "RawTopkPostProcessor`, into = "RawTopkPostProcessor")]
pub(crate) enum TopkPostProcessor {
DeterminantDiversity(DeterminantDiversityParams)
}
// Impl necessary conversionsThis has the benefit of carrying around a fully verified DeterminantDiversityParams struct and is pretty easily extensible for future variants.
| None => write_field!(f, "Search IO Limit", "none (defaults to `usize::MAX`)")?, | ||
| } | ||
| match &self.post_processor { | ||
| Some(TopkPostProcessor::DeterminantDiversity { power, eta }) => { |
There was a problem hiding this comment.
Should this not live as a Display impl on TopkPostProcessor instead?
| registry.register( | ||
| "graph-index-full-precision-f32", | ||
| FullPrecision::<f32>::new() | ||
| .search(plugins::DeterminantDiversity) |
There was a problem hiding this comment.
Why is determinant diversity first? The most common search kinds should be ordered first.
|
|
||
| /// A search plugin for determinant-diversity top-k post-processing. | ||
| #[derive(Debug, Clone, Copy)] | ||
| pub(crate) struct DeterminantDiversity; |
There was a problem hiding this comment.
After reviewing this - matching on a sub-enum of TopK feels like it's going to be super brittle. Since the plugin implementation already duplicates all the logic that a new SearchPhase enum would need anyways, let's must make determinant diversity a new SearchPhase enum and keep the one-to-one mapping.
Integrate determinant-diversity into topk full-precision search and disk search
Summary
Wires the determinant-diversity post-processing algorithm as an optional
SearchPluginfor two search paths:inmem::FullAccessor)DiskIndexSearcherpost-processor)Determinant-diversity reranks search results to increase geometric diversity among the top-k, at a configurable quality/speed tradeoff.
Motivation
Standard nearest neighbor search often returns highly similar/redundant results. For some scenarios like RAG search, it's more valuable to retrieve diverse, complementary documents that cover different aspects of the query. This PR implements a principled approach based on determinant maximization of the Gram matrix of query-scaled vectors.
Algorithm
The algorithm scales each candidate vector by
(similarity_to_query)^powerand then greedily selects vectors to maximize the determinant of the Gram matrix (equivalent to maximizing the volume of the parallelepiped spanned by the vectors).Two variants are implemented:
Both variants run in O(n k d) time where n = candidates, k = results to return, d = dimension.
Performance Characteristics
Tested on OpenAI embeddings dataset (3072 dimensions, 10K sample):
The recall reduction is expected - Det_Div search intentionally selects diverse vectors rather than the absolute nearest neighbors.
Diversity Improvement
We measure diversity using log-determinant of the Gram matrix of scaled vectors (higher = vectors span more volume = more diverse).
Parameter Tuning Guide
power(default: 2.0)Controls the trade-off between relevance and diversity:
eta(default: 0.01)Ridge regularization parameter that controls numerical robustness and relevance preference:
Recommended Settings
poweretaChanges
Core algorithm (
diskann-providers)determinant_diversity_post_process()indiskann-providers/src/model/graph/provider/async_/determinant_diversity_post_process.rseta:eta == 0: greedy orthogonalizationeta > 0: eta-weighted residual selectionAsync graph-index path (
diskann-benchmark)FullPrecisionVectorAccessorshim trait + impl forinmem::FullAccessor<f32, ...>so the post-processor can fetch real candidate vectorsDeterminantDiversityplugin that satisfiesglue::SearchPostProcessusingfor<'a, 'b> SearchStrategyHRTB bound (credit: @hildebrandmw)Disk-index path (
diskann-disk,diskann-benchmark)DeterminantDiversityAndFilterprocessor toDiskSearchPostProcessorenumsearcher.search()to accept an optionalSearchPostProcessorKind::DeterminantDiversity { power, eta }DiskSearchPhaseJSON input to acceptpost_processorresult_countfrom search stats to correctly bound post-processed resultsInput/config
TopkPostProcessorserde enum indiskann-benchmark/src/inputs/post_processor.rswith validation (power > 0,eta >= 0)async-determinant-diversity.json,disk-index-determinant-diversity.jsonBenchmark JSON examples
Notes