-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[Enhancement] reading predicate column by late materialization and sort predicate column according to predicate selectivity #64600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@cursor review |
🧪 CI InsightsHere's what we observed from your CI run for d8089ab. 🟢 All jobs passed!But CI Insights is watching 👀 |
|
@cursor review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Bugbot reviewed your changes and found no bugs!
|
@cursor review |
|
@cursor review |
1 similar comment
|
@cursor review |
|
@cursor review |
|
@cursor review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Bugbot reviewed your changes and found no bugs!
|
@cursor review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Bugbot reviewed your changes and found no bugs!
|
@cursor review |
| class SegmentIterator final : public ChunkIterator { | ||
| public: | ||
| SegmentIterator(std::shared_ptr<Segment> segment, Schema _schema, SegmentReadOptions options); | ||
| SegmentIterator(std::shared_ptr<Segment> segment, Schema _schema, const SegmentReadOptions& options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The segment iterator now has dual code paths for normal vs late materialization, making it harder to maintain and debug.
Consider refactoring into strategy pattern:
class ScanStrategy {
public:
virtual Status read(Chunk* chunk, vector<rowid_t>* rowids, size_t n) = 0;
virtual Status seek(ordinal_t pos) = 0;
};
class NormalScanStrategy : public ScanStrategy { /* ... */ };
class LateMaterializationStrategy : public ScanStrategy { /* ... */ };
| CONF_mInt64(max_lookup_batch_request, "8"); | ||
| // if the first predicate column's selectivity bigger than this, trigger sample, | ||
| // which means if the selectivity is very good already, we don't need to sample | ||
| CONF_mDouble(tigger_sample_selectivity, "0.2"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo. trigger_sample_selectivity
this config name is bad. please give a clear and meaning name.
| // column-expr-predicate doesn't support [begin, end] interface | ||
| ASSIGN_OR_RETURN(next_start, _filter_by_compound_and_predicates( | ||
| chunk, rowid, chunk_start, next_start, first_column_id, | ||
| non_expr_column_predicate_map.at(first_column_id), current_columns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have checked non_expr_column_predicate_map.contains(first_column_id). so we could use [] replace at
| return count; | ||
| } | ||
|
|
||
| Status read_by_rowds(Column* column, const rowid_t* rowids, size_t* count) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_by_rowids
[BE Incremental Coverage Report]❌ fail : 296 / 446 (66.37%) file detail
|
|
@cursor review |
| } | ||
|
|
||
| // No decoder available yet, conservatively return false | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plain-encoded columns missing predicate push-down check
The support_push_down_predicate function's comment says "First check dict decoder, then check current page decoder" but the implementation only checks _dict_decoder and returns false when it's null. For plain-encoded VARCHAR columns where _dict_decoder is nullptr, this incorrectly disables predicate push-down optimization even though BinaryPlainPageDecoder::support_push_down_predicate() returns true for VARCHAR and next_batch_with_filter is fully implemented. The function should also check the page decoder when the dict decoder is not available.
[FE Incremental Coverage Report]✅ pass : 6 / 7 (85.71%) file detail
|
|
@cursor review |
| } | ||
| size_t total = *count; | ||
| size_t read_count = 0; | ||
| [[maybe_unused]] CppType data[total]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable Length Array risks stack overflow
The code uses a Variable Length Array (VLA) CppType data[total] where total can be as large as the chunk size (4096 by default). For types like int128_t (16 bytes), this allocates 64KB on the stack, risking stack overflow. VLAs are also not standard C++ (they're a GCC extension). The [[maybe_unused]] attribute is incorrect since the data array is actively used in the loop and passed to append_numbers. This should use heap allocation (e.g., std::vector<CppType> or std::unique_ptr<CppType[]>).
|
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
|
@cursor review |
| } | ||
| size_t total = *count; | ||
| size_t read_count = 0; | ||
| [[maybe_unused]] CppType data[total]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stack-allocated VLA may overflow for large row counts
The new read_by_rowids function uses a Variable Length Array (VLA) CppType data[total] which allocates memory on the stack based on the runtime count parameter. VLAs are a non-standard C++ extension and can cause stack overflow when total is large. For a page with 16K rows of int32 type, this allocates 64KB on the stack, and larger data types or deeper call stacks could trigger a crash. Using a std::vector or heap allocation would be safer.
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |



Why I'm doing:
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check:
Note
Implements late materialization for predicate columns with efficient page-level filtering and zero-copy string handling.
column/append_with_mask.{h,cpp}to append rows by selection mask; replaces custom logic inSortedStreamingAggregatorand simplifies selector handlingContainerResource, newget_immutable_bytes()/_data_base(), and lazy materialization; switch call sites across hashing, CSV/Parquet writers, string funcs, join, sort, split, etc.next_batch_with_filter,read_by_rowids,reserve_col, andsupport_push_down_predicate()to page decoders (binary_plain/dict/bitshuffle),ParsedPage, andScalarColumnIterator; newcompound_and_predicates_evaluateto combine predicates; optimizebinary_col != ""by checking offsetsenable_predicate_col_late_materializethrough OLAP/Lake readers andRowsetOptions; addPredicateTree::has_or_predicate()and sampling configtigger_sample_selectivityChunk::append_columnoverload byColumnId; numerous sites updated to useget_immutable_bytes()and reserve properlyWritten by Cursor Bugbot for commit 024808e. This will update automatically on new commits. Configure here.