fix: deduplicate doc counts in term aggregation for multi-valued fields by nuri-yoo · Pull Request #2854 · quickwit-oss/tantivy

nuri-yoo · 2026-03-18T04:09:54Z

Summary

Term aggregation was counting term occurrences instead of documents for multi-valued fields. A document with the same value appearing multiple times would inflate doc_count.
Added fetch_block_with_missing_unique_per_doc to ColumnBlockAccessor that deduplicates (doc_id, value) pairs after fetching, and used it in term aggregation collection.
Fixed existing test assertions that were asserting the buggy counts.

Details

The root cause is in ColumnBlockAccessor::fetch_block, which calls row_ids_for_docs. For multi-valued columns, this can return the same (doc_id, value) pair multiple times. term_entry() then increments bucket.count for each occurrence, inflating doc_count.

The fix adds a deduplication step that removes consecutive duplicate (doc_id, value) pairs in O(n), which is safe because row_ids_for_docs returns entries sorted by doc_id.

This also fixes the sub-aggregation path — CachedSubAggs::push(bucket_id, doc) was receiving the same doc multiple times for the same bucket.

Test plan

Existing terms_aggregation_missing_multi_value test now asserts correct doc counts (4 instead of 5, 2 instead of 3)
All 169 aggregation tests pass
cargo +nightly fmt --all clean

Fixes #2721

Term aggregation was counting term occurrences instead of documents for multi-valued fields. A document with the same value appearing multiple times would inflate doc_count. Add `fetch_block_with_missing_unique_per_doc` to ColumnBlockAccessor that deduplicates (doc_id, value) pairs, and use it in term aggregation. Fixes quickwit-oss#2721

PSeitz · 2026-03-18T08:33:14Z

columnar/src/block_accessor.rs

+        missing: Option<T>,
+    ) {
+        self.fetch_block_with_missing(docs, accessor, missing);
+        if !accessor.index.get_cardinality().is_full() {


it's only necessary to deduplicate for multivalue cardinality

Fixed, Thanks for the review

Duplicates can only occur with multivalue columns, so narrow the check from !is_full() to is_multivalue().

PSeitz · 2026-03-18T11:58:59Z

columnar/src/block_accessor.rs

+        let mut write = 0;
+        for read in 1..self.docid_cache.len() {
+            if self.docid_cache[read] != self.docid_cache[write]
+                || self.val_cache[read] != self.val_cache[write]


I think we should check only for duplicate docids, not for duplicate values?

Can you extend the tests to capture this?

Ah nevermind, we need to check both, so that termid only filters duplicate values on the same docid, but still handles multi-values

It does not cover pairs if the values are not consecutive, e.g. :

(0, 1), (0, 2), (0, 1)

It's a bit more expensive. I think we could

make groups of consecutive docids (just start and end pos)

if the group contains more than 2 elements sort by value

then do the same algorithm as now

Fixed. Values within each doc_id group are now sorted before deduplicating, and added unit tests.

Sort values within each doc_id group before deduplicating, so that non-adjacent duplicates are correctly handled. Add unit tests for dedup_docid_val_pairs: consecutive duplicates, non-consecutive duplicates, multi-doc groups, no duplicates, and single element.

PSeitz reviewed Mar 18, 2026

View reviewed changes

refactor: only deduplicate for multivalue cardinality

11db726

Duplicates can only occur with multivalue columns, so narrow the check from !is_full() to is_multivalue().

PSeitz reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: deduplicate doc counts in term aggregation for multi-valued fields#2854

fix: deduplicate doc counts in term aggregation for multi-valued fields#2854
nuri-yoo wants to merge 3 commits intoquickwit-oss:mainfrom
nuri-yoo:nuri-yoo/fix-term-agg-doc-count

nuri-yoo commented Mar 18, 2026

Uh oh!

PSeitz Mar 18, 2026 •

edited

Loading

Uh oh!

nuri-yoo Mar 18, 2026

Uh oh!

PSeitz Mar 18, 2026

Uh oh!

PSeitz Mar 18, 2026 •

edited

Loading

Uh oh!

PSeitz Mar 18, 2026 •

edited

Loading

Uh oh!

nuri-yoo Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nuri-yoo commented Mar 18, 2026

Summary

Details

Test plan

Uh oh!

PSeitz Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuri-yoo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

PSeitz Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

PSeitz Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PSeitz Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuri-yoo Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PSeitz Mar 18, 2026 •

edited

Loading

PSeitz Mar 18, 2026 •

edited

Loading

PSeitz Mar 18, 2026 •

edited

Loading