Skip to content

Conversation

@before-Sunrise
Copy link
Contributor

@before-Sunrise before-Sunrise commented Oct 27, 2025

Why I'm doing:

  1. optimize our internal table scan'late materialization's implementation, so selected rows in the same page can be handled in one function call
  2. support reading predicate column by late materialization, if predicate columns' number is huge, and the predicates in front can filter lots of data, this can reduce io and memory copy
  3. since we don't need reading predicate columns all at once, we can reorder the predicate columns so the predicate with lower selectivity can execute first. This is done by sample data and check the selectivity of every predicate
  4. if the first predicate column is string, push down the string predicate into page level, so we don't need to read big string into column then filter it if not satisfied predicate. This is one implementation of zero-copy, since our current zero-copy doesn't support string type.
  5. optimize predicate evaluation speed for: string_col != "", only check the offset column.

What I'm doing:

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
  • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.0
    • 3.5
    • 3.4
    • 3.3

Note

Implements late materialization for predicate columns with efficient page-level filtering and zero-copy string handling.

  • New utility: column/append_with_mask.{h,cpp} to append rows by selection mask; replaces custom logic in SortedStreamingAggregator and simplifies selector handling
  • BinaryColumn refactor: zero-copy via ContainerResource, new get_immutable_bytes()/_data_base(), and lazy materialization; switch call sites across hashing, CSV/Parquet writers, string funcs, join, sort, split, etc.
  • Predicate pushdown & late materialization: add next_batch_with_filter, read_by_rowids, reserve_col, and support_push_down_predicate() to page decoders (binary_plain/dict/bitshuffle), ParsedPage, and ScalarColumnIterator; new compound_and_predicates_evaluate to combine predicates; optimize binary_col != "" by checking offsets
  • Scan plumbing: propagate enable_predicate_col_late_materialize through OLAP/Lake readers and RowsetOptions; add PredicateTree::has_or_predicate() and sampling config tigger_sample_selectivity
  • APIs: Chunk::append_column overload by ColumnId; numerous sites updated to use get_immutable_bytes() and reserve properly
  • Minor cleanups/perf tweaks in string operations, logging, and case conversions

Written by Cursor Bugbot for commit 024808e. This will update automatically on new commits. Configure here.

@alvin-celerdata
Copy link
Contributor

@cursor review

@mergify
Copy link
Contributor

mergify bot commented Oct 29, 2025

🧪 CI Insights

Here's what we observed from your CI run for d8089ab.

🟢 All jobs passed!

But CI Insights is watching 👀

@alvin-celerdata
Copy link
Contributor

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


@alvin-celerdata
Copy link
Contributor

@cursor review

@alvin-celerdata
Copy link
Contributor

@cursor review

1 similar comment
@alvin-celerdata
Copy link
Contributor

@cursor review

@alvin-celerdata
Copy link
Contributor

@cursor review

@alvin-celerdata
Copy link
Contributor

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


@alvin-celerdata
Copy link
Contributor

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


@alvin-celerdata
Copy link
Contributor

@cursor review

class SegmentIterator final : public ChunkIterator {
public:
SegmentIterator(std::shared_ptr<Segment> segment, Schema _schema, SegmentReadOptions options);
SegmentIterator(std::shared_ptr<Segment> segment, Schema _schema, const SegmentReadOptions& options);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The segment iterator now has dual code paths for normal vs late materialization, making it harder to maintain and debug.

Consider refactoring into strategy pattern:

class ScanStrategy {
public:
    virtual Status read(Chunk* chunk, vector<rowid_t>* rowids, size_t n) = 0;
    virtual Status seek(ordinal_t pos) = 0;
};

class NormalScanStrategy : public ScanStrategy { /* ... */ };
class LateMaterializationStrategy : public ScanStrategy { /* ... */ };

CONF_mInt64(max_lookup_batch_request, "8");
// if the first predicate column's selectivity bigger than this, trigger sample,
// which means if the selectivity is very good already, we don't need to sample
CONF_mDouble(tigger_sample_selectivity, "0.2");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo. trigger_sample_selectivity

this config name is bad. please give a clear and meaning name.

// column-expr-predicate doesn't support [begin, end] interface
ASSIGN_OR_RETURN(next_start, _filter_by_compound_and_predicates(
chunk, rowid, chunk_start, next_start, first_column_id,
non_expr_column_predicate_map.at(first_column_id), current_columns));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have checked non_expr_column_predicate_map.contains(first_column_id). so we could use [] replace at

return count;
}

Status read_by_rowds(Column* column, const rowid_t* rowids, size_t* count) override {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_by_rowids

@github-actions
Copy link

[BE Incremental Coverage Report]

fail : 296 / 446 (66.37%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 src/storage/lake/tablet_reader.cpp 0 1 00.00% [346]
🔵 src/exprs/agg/group_concat.h 0 2 00.00% [216, 266]
🔵 src/storage/push_utils.cpp 0 1 00.00% [152]
🔵 src/storage/rowset/dictcode_column_iterator.h 0 2 00.00% [50, 51]
🔵 src/exec/sorted_streaming_aggregator.cpp 0 6 00.00% [310, 324, 326, 329, 330, 442]
🔵 src/exec/tablet_scanner.cpp 0 1 00.00% [150]
🔵 src/column/binary_column.h 0 22 00.00% [70, 71, 72, 75, 76, 77, 78, 87, 88, 89, 118, 120, 121, 122, 125, 153, 154, 155, 166, 169, 270, 304]
🔵 src/storage/rowset/page_decoder.h 0 6 00.00% [97, 98, 101, 102, 118, 119]
🔵 src/storage/rowset/binary_dict_page.cpp 1 4 25.00% [272, 278, 279]
🔵 src/storage/rowset/binary_plain_page.h 2 8 25.00% [221, 239, 240, 241, 242, 243]
🔵 src/storage/rowset/scalar_column_iterator.cpp 7 18 38.89% [331, 339, 350, 351, 352, 355, 356, 357, 365, 366, 367]
🔵 src/util/slice.h 4 10 40.00% [242, 243, 254, 255, 258, 259]
🔵 src/storage/rowset/binary_dict_page.h 2 3 66.67% [136]
🔵 src/storage/rowset/segment_iterator.cpp 151 208 72.60% [432, 433, 434, 435, 436, 437, 531, 536, 537, 612, 615, 619, 633, 652, 653, 654, 655, 657, 658, 659, 661, 662, 663, 665, 666, 670, 874, 1312, 1317, 1318, 1320, 1384, 1385, 1391, 1400, 1414, 1415, 1416, 1417, 1418, 1449, 1450, 1451, 1452, 1453, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 1463, 1465, 1466, 1467, 1469]
🔵 src/storage/rowset/parsed_page.cpp 53 69 76.81% [264, 265, 266, 268, 269, 270, 271, 273, 274, 277, 278, 279, 281, 361, 374, 404]
🔵 src/column/binary_column.cpp 43 51 84.31% [99, 402, 408, 546, 565, 768, 769, 770]
🔵 src/exprs/string_functions.cpp 6 7 85.71% [1605]
🔵 src/storage/rowset/binary_plain_page.cpp 2 2 100.00% []
🔵 src/exprs/split.cpp 3 3 100.00% []
🔵 src/formats/parquet/level_builder.cpp 1 1 100.00% []
🔵 src/storage/lake/rowset.cpp 1 1 100.00% []
🔵 src/serde/column_array_serde.cpp 1 1 100.00% []
🔵 src/formats/csv/varbinary_converter.cpp 1 1 100.00% []
🔵 src/exprs/like_predicate.cpp 1 1 100.00% []
🔵 src/storage/rowset/scalar_column_iterator.h 2 2 100.00% []
🔵 src/storage/rowset/rowid_column_iterator.h 11 11 100.00% []
🔵 src/storage/tablet_reader.cpp 1 1 100.00% []
🔵 src/storage/rowset/rowset.cpp 1 1 100.00% []
🔵 src/formats/csv/string_converter.cpp 2 2 100.00% []

@alvin-celerdata
Copy link
Contributor

@cursor review

}

// No decoder available yet, conservatively return false
return false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plain-encoded columns missing predicate push-down check

The support_push_down_predicate function's comment says "First check dict decoder, then check current page decoder" but the implementation only checks _dict_decoder and returns false when it's null. For plain-encoded VARCHAR columns where _dict_decoder is nullptr, this incorrectly disables predicate push-down optimization even though BinaryPlainPageDecoder::support_push_down_predicate() returns true for VARCHAR and next_batch_with_filter is fully implemented. The function should also check the page decoder when the dict decoder is not available.

Fix in Cursor Fix in Web

@github-actions
Copy link

[FE Incremental Coverage Report]

pass : 6 / 7 (85.71%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/qe/SessionVariable.java 4 5 80.00% [5705]
🔵 com/starrocks/sql/optimizer/rule/tree/AddIndexOnlyPredicateRule.java 1 1 100.00% []
🔵 com/starrocks/statistic/StatisticUtils.java 1 1 100.00% []

@alvin-celerdata
Copy link
Contributor

@cursor review

}
size_t total = *count;
size_t read_count = 0;
[[maybe_unused]] CppType data[total];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable Length Array risks stack overflow

The code uses a Variable Length Array (VLA) CppType data[total] where total can be as large as the chunk size (4096 by default). For types like int128_t (16 bytes), this allocates 64KB on the stack, risking stack overflow. VLAs are also not standard C++ (they're a GCC extension). The [[maybe_unused]] attribute is incorrect since the data array is actively used in the loop and passed to append_numbers. This should use heap allocation (e.g., std::vector<CppType> or std::unique_ptr<CppType[]>).

Fix in Cursor Fix in Web

@sonarqubecloud
Copy link

Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
@alvin-celerdata
Copy link
Contributor

@cursor review

}
size_t total = *count;
size_t read_count = 0;
[[maybe_unused]] CppType data[total];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stack-allocated VLA may overflow for large row counts

The new read_by_rowids function uses a Variable Length Array (VLA) CppType data[total] which allocates memory on the stack based on the runtime count parameter. VLAs are a non-standard C++ extension and can cause stack overflow when total is large. For a page with 16K rows of int32 type, this allocates 64KB on the stack, and larger data types or deeper call stacks could trigger a crash. Using a std::vector or heap allocation would be safer.

Fix in Cursor Fix in Web

@github-actions
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants