Skip to content

Conversation

@srlch
Copy link
Contributor

@srlch srlch commented Dec 10, 2025

What I'm doing:

This PR introduces builtin inverted index including the entire read and write logic. It builtin based on the bitmap index infrastructure.

Fixes #67167

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.0
    • 3.5
    • 3.4
    • 3.3

Note

Adds a builtin inverted index implementation using bitmap index infrastructure with complete reader/writer/iterator and planner/runtime integration.

  • New builtin inverted index: BuiltinInvertedWriter/Reader/IndexIterator and BuiltinPlugin; simple tokenizer for English; wired into CMake
  • Storage integration: ColumnReader/Writer, Segment, SegmentIterator updated to load/use builtin inverted index via IndexReadOptions; iterator close path added; new GIN metrics (GinFilter*) and profile counters
  • Bitmap index enhancements: batch-add and de-dup in writer, dictionary predicate seek and batch read APIs in reader/iterator; new methods (add_value_with_current_rowid, incre_rowid, finish overload, seek_dictionary_by_predicate, next_batch_dictionary, read_union_bitmap(Buffer)) and minor size/memory optimizations
  • Lake connector/planner flags: pass enable_prune_column_after_index_filter and enable_gin_filter; propagate to tablet reader params
  • FE changes: allow IMP_LIB=builtin, default to builtin in shared-data mode; restrict clucene in shared-data; refine replicated storage rule when GIN present; assign index id for CN tables (non-vector)
  • Protocol changes: proto adds BUILTIN_INVERTED_INDEX and BuiltinInvertedIndexPB; thrift TLakeScanNode gains enable flags
  • Tests: extensive UTs for builtin inverted index, plugin factory/options, and bitmap index behavior

Written by Cursor Bugbot for commit e3f10ad. This will update automatically on new commits. Configure here.

@srlch srlch requested review from a team as code owners December 10, 2025 05:58
@wanpengfei-git wanpengfei-git requested a review from a team December 10, 2025 05:59
@mergify mergify bot assigned srlch Dec 10, 2025
@mergify
Copy link
Contributor

mergify bot commented Dec 10, 2025

🧪 CI Insights

Here's what we observed from your CI run for e3bf220.

🟢 All jobs passed!

But CI Insights is watching 👀

@alvin-celerdata
Copy link
Contributor

@cursor review

@alvin-celerdata
Copy link
Contributor

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


@alvin-celerdata
Copy link
Contributor

@cursor review

@alvin-celerdata
Copy link
Contributor

@cursor review

@srlch srlch force-pushed the builtin_gin branch 2 times, most recently from 8644f5c to 97d36e4 Compare December 24, 2025 03:42
@alvin-celerdata
Copy link
Contributor

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!

@alvin-celerdata
Copy link
Contributor

@cursor review

@alvin-celerdata
Copy link
Contributor

@cursor review

1 similar comment
@alvin-celerdata
Copy link
Contributor

@cursor review

sevev
sevev previously approved these changes Dec 26, 2025
@github-actions
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link

[FE Incremental Coverage Report]

pass : 19 / 19 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/server/OlapTableFactory.java 5 5 100.00% []
🔵 com/starrocks/planner/OlapScanNode.java 4 4 100.00% []
🔵 com/starrocks/alter/SchemaChangeHandler.java 2 2 100.00% []
🔵 com/starrocks/sql/analyzer/IndexAnalyzer.java 6 6 100.00% []
🔵 com/starrocks/common/InvertedIndexParams.java 2 2 100.00% []

@github-actions
Copy link

[BE Incremental Coverage Report]

pass : 383 / 420 (91.19%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/storage/index/inverted/builtin/builtin_plugin.h 0 4 00.00% [23, 24, 25, 38]
🔵 be/src/storage/index/inverted/inverted_plugin_factory.cpp 0 2 00.00% [26, 27]
🔵 be/src/storage/index/inverted/builtin/builtin_plugin.cpp 0 4 00.00% [22, 24, 27, 29]
🔵 be/src/storage/index/inverted/inverted_index_option.cpp 0 2 00.00% [27, 28]
🔵 be/src/storage/index/inverted/builtin/builtin_inverted_reader.h 1 7 14.29% [36, 50, 52, 55, 56, 59]
🔵 be/src/storage/rowset/column_reader.cpp 7 10 70.00% [123, 208, 573]
🔵 be/src/connector/lake_connector.cpp 6 8 75.00% [309, 312]
🔵 be/src/storage/index/inverted/builtin/builtin_inverted_reader.cpp 20 23 86.96% [41, 47, 54]
🔵 be/src/storage/index/inverted/builtin/builtin_inverted_index_iterator.cpp 131 140 93.57% [41, 54, 55, 83, 128, 135, 137, 150, 151]
🔵 be/src/storage/rowset/bitmap_index_reader.cpp 26 27 96.30% [119]
🔵 be/src/storage/rowset/bitmap_index_writer.cpp 55 56 98.21% [393]
🔵 be/src/storage/rowset/column_writer.cpp 1 1 100.00% []
🔵 be/src/storage/column_expr_predicate.cpp 1 1 100.00% []
🔵 be/src/storage/index/inverted/builtin/builtin_inverted_writer.h 1 1 100.00% []
🔵 be/src/storage/index/inverted/clucene/clucene_inverted_writer.cpp 2 2 100.00% []
🔵 be/src/storage/lake/tablet_reader.cpp 2 2 100.00% []
🔵 be/src/storage/index/inverted/inverted_reader.h 1 1 100.00% []
🔵 be/src/storage/index/inverted/builtin/builtin_inverted_index_iterator.h 3 3 100.00% []
🔵 be/src/storage/rowset/bitmap_index_writer.h 1 1 100.00% []
🔵 be/src/storage/rowset/segment.cpp 3 3 100.00% []
🔵 be/src/storage/index/inverted/builtin/builtin_simple_analyzer.h 9 9 100.00% []
🔵 be/src/storage/index/inverted/inverted_index_iterator.h 3 3 100.00% []
🔵 be/src/storage/index/inverted/builtin/builtin_inverted_writer.cpp 71 71 100.00% []
🔵 be/src/storage/rowset/segment_iterator.cpp 9 9 100.00% []
🔵 be/src/storage/lake/rowset.cpp 3 3 100.00% []
🔵 be/src/storage/index/inverted/builtin/builtin_simple_analyzer.cpp 27 27 100.00% []

@alvin-celerdata
Copy link
Contributor

@cursor review

return st;
}

Status BuiltinInvertedIndexIterator::_equal_query(const Slice* search_query, roaring::Roaring* bit_map) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it’s tokenized at write time, can we still perform an equal query using bitmap index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equal query means match keyword. It always need MATCH predicate to trigger inverted index

};
Slice from_value = Slice(lower.second);
size_t search_size = upper.first - lower.first;
auto res = _bitmap_itr->seek_dictionary_by_predicate(predicate, from_value, search_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it’s tokenized at write time, is the like predicate on dictionary right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it performs like on dictionary

@sonarqubecloud
Copy link

This PR introduces builtin inverted index including the entire read and write logic.
It builtin based on the bitmap index infrastructure.

Signed-off-by: srlch <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Builtin Inverted Index

6 participants