Skip to content

fix: deduplicate doc counts in term aggregation for multi-valued fields#2854

Open
nuri-yoo wants to merge 3 commits intoquickwit-oss:mainfrom
nuri-yoo:nuri-yoo/fix-term-agg-doc-count
Open

fix: deduplicate doc counts in term aggregation for multi-valued fields#2854
nuri-yoo wants to merge 3 commits intoquickwit-oss:mainfrom
nuri-yoo:nuri-yoo/fix-term-agg-doc-count

Conversation

@nuri-yoo
Copy link

Summary

  • Term aggregation was counting term occurrences instead of documents for multi-valued fields. A document with the same value appearing multiple times would inflate doc_count.
  • Added fetch_block_with_missing_unique_per_doc to ColumnBlockAccessor that deduplicates (doc_id, value) pairs after fetching, and used it in term aggregation collection.
  • Fixed existing test assertions that were asserting the buggy counts.

Details

The root cause is in ColumnBlockAccessor::fetch_block, which calls row_ids_for_docs. For multi-valued columns, this can return the same (doc_id, value) pair multiple times. term_entry() then increments bucket.count for each occurrence, inflating doc_count.

The fix adds a deduplication step that removes consecutive duplicate (doc_id, value) pairs in O(n), which is safe because row_ids_for_docs returns entries sorted by doc_id.

This also fixes the sub-aggregation path — CachedSubAggs::push(bucket_id, doc) was receiving the same doc multiple times for the same bucket.

Test plan

  • Existing terms_aggregation_missing_multi_value test now asserts correct doc counts (4 instead of 5, 2 instead of 3)
  • All 169 aggregation tests pass
  • cargo +nightly fmt --all clean

Fixes #2721

Term aggregation was counting term occurrences instead of documents
for multi-valued fields. A document with the same value appearing
multiple times would inflate doc_count.

Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor
that deduplicates (doc_id, value) pairs, and use it in term aggregation.

Fixes quickwit-oss#2721
missing: Option<T>,
) {
self.fetch_block_with_missing(docs, accessor, missing);
if !accessor.index.get_cardinality().is_full() {
Copy link
Collaborator

@PSeitz PSeitz Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only necessary to deduplicate for multivalue cardinality

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, Thanks for the review

Duplicates can only occur with multivalue columns, so narrow the
check from !is_full() to is_multivalue().
let mut write = 0;
for read in 1..self.docid_cache.len() {
if self.docid_cache[read] != self.docid_cache[write]
|| self.val_cache[read] != self.val_cache[write]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check only for duplicate docids, not for duplicate values?

Can you extend the tests to capture this?

Copy link
Collaborator

@PSeitz PSeitz Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nevermind, we need to check both, so that termid only filters duplicate values on the same docid, but still handles multi-values

Copy link
Collaborator

@PSeitz PSeitz Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not cover pairs if the values are not consecutive, e.g. :

(0, 1), (0, 2), (0, 1)

It's a bit more expensive. I think we could

  • make groups of consecutive docids (just start and end pos)
  • if the group contains more than 2 elements sort by value
  • then do the same algorithm as now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Values within each doc_id group are now sorted before deduplicating, and added unit tests.

Sort values within each doc_id group before deduplicating, so that
non-adjacent duplicates are correctly handled.

Add unit tests for dedup_docid_val_pairs: consecutive duplicates,
non-consecutive duplicates, multi-doc groups, no duplicates, and
single element.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

doc_count in aggregation result is not the actual number of documents

2 participants