fix: deduplicate doc counts in term aggregation for multi-valued fields#2854
Open
nuri-yoo wants to merge 3 commits intoquickwit-oss:mainfrom
Open
fix: deduplicate doc counts in term aggregation for multi-valued fields#2854nuri-yoo wants to merge 3 commits intoquickwit-oss:mainfrom
nuri-yoo wants to merge 3 commits intoquickwit-oss:mainfrom
Conversation
Term aggregation was counting term occurrences instead of documents for multi-valued fields. A document with the same value appearing multiple times would inflate doc_count. Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor that deduplicates (doc_id, value) pairs, and use it in term aggregation. Fixes quickwit-oss#2721
PSeitz
reviewed
Mar 18, 2026
columnar/src/block_accessor.rs
Outdated
| missing: Option<T>, | ||
| ) { | ||
| self.fetch_block_with_missing(docs, accessor, missing); | ||
| if !accessor.index.get_cardinality().is_full() { |
Collaborator
There was a problem hiding this comment.
it's only necessary to deduplicate for multivalue cardinality
Duplicates can only occur with multivalue columns, so narrow the check from !is_full() to is_multivalue().
PSeitz
reviewed
Mar 18, 2026
| let mut write = 0; | ||
| for read in 1..self.docid_cache.len() { | ||
| if self.docid_cache[read] != self.docid_cache[write] | ||
| || self.val_cache[read] != self.val_cache[write] |
Collaborator
There was a problem hiding this comment.
I think we should check only for duplicate docids, not for duplicate values?
Can you extend the tests to capture this?
Collaborator
There was a problem hiding this comment.
Ah nevermind, we need to check both, so that termid only filters duplicate values on the same docid, but still handles multi-values
Collaborator
There was a problem hiding this comment.
It does not cover pairs if the values are not consecutive, e.g. :
(0, 1), (0, 2), (0, 1)
It's a bit more expensive. I think we could
- make groups of consecutive docids (just start and end pos)
- if the group contains more than 2 elements sort by value
- then do the same algorithm as now
Author
There was a problem hiding this comment.
Fixed. Values within each doc_id group are now sorted before deduplicating, and added unit tests.
Sort values within each doc_id group before deduplicating, so that non-adjacent duplicates are correctly handled. Add unit tests for dedup_docid_val_pairs: consecutive duplicates, non-consecutive duplicates, multi-doc groups, no duplicates, and single element.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
doc_count.fetch_block_with_missing_unique_per_doctoColumnBlockAccessorthat deduplicates(doc_id, value)pairs after fetching, and used it in term aggregation collection.Details
The root cause is in
ColumnBlockAccessor::fetch_block, which callsrow_ids_for_docs. For multi-valued columns, this can return the same(doc_id, value)pair multiple times.term_entry()then incrementsbucket.countfor each occurrence, inflatingdoc_count.The fix adds a deduplication step that removes consecutive duplicate
(doc_id, value)pairs in O(n), which is safe becauserow_ids_for_docsreturns entries sorted by doc_id.This also fixes the sub-aggregation path —
CachedSubAggs::push(bucket_id, doc)was receiving the same doc multiple times for the same bucket.Test plan
terms_aggregation_missing_multi_valuetest now asserts correct doc counts (4 instead of 5, 2 instead of 3)cargo +nightly fmt --allcleanFixes #2721