[WIP] feat(parquet): support pushing bitmaps down to page-level filtering (except for nested columns)#375
[WIP] feat(parquet): support pushing bitmaps down to page-level filtering (except for nested columns)#375zhf999 wants to merge 14 commits into
Conversation
| // bitmap pushdown if there is any nested field in the schema | ||
| PAIMON_ASSIGN_OR_RAISE(target_row_groups, | ||
| FilterRowGroupsByBitmap(selection_bitmap.value(), | ||
| target_row_groups, has_nested_field)); |
There was a problem hiding this comment.
PARQUET_READ_ENABLE_PAGE_INDEX_FILTER=false is not respected for bitmap page pruning.
SetReadSchema calls FilterRowGroupsByBitmap before reading PARQUET_READ_ENABLE_PAGE_INDEX_FILTER, and FilterPagesByBitmap may still load offset/page index metadata and mark row groups as partially matched. So even when page-index filtering is explicitly disabled, a bitmap that only hits part of a row group can still trigger page-level pruning.
Could we gate bitmap page-level pruning behind the same option? When disabled, bitmap filtering should only prune at row-group level. A regression test with PARQUET_READ_ENABLE_PAGE_INDEX_FILTER=false and a partial-page bitmap would be useful too.
| } | ||
| } | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
Why does bitmap page pruning derive RowRanges from the first arbitrary column that has an offset index?
Since the bitmap already gives us row ids, the target row ranges can be derived independently of any specific column layout. Using one column’s page layout makes the pruning granularity depend on whichever column FindColumnWithOffsetIndex happens to return. If that column has larger/coarser pages than other columns, we may mark a much wider range as matched and lose a lot of the intended page-pruning benefit.
Could we compute row ranges directly from the bitmap, or at least choose the most selective/fine-grained offset index among the projected columns instead of returning on the first available column?
Purpose
This PR follows PR#371.
TargetRowGrouptype to carry row-group index, partial-match status, row ranges, and read-range exclusion flags consistently across planning and reading phases.Changes
src/paimon/format/parquet/target_row_group.hTargetRowGroupclass andTargetRowGroupsalias.row_ranges.h.src/paimon/format/parquet/parquet_file_batch_reader.hTargetRowGroupsinstead of rawstd::vector<int32_t>.FindColumnWithOffsetIndexandFilterPagesByBitmap.src/paimon/format/parquet/parquet_file_batch_reader.cppSetReadSchema, switched row-group candidate list toTargetRowGroupsand applied bitmap filter before page-index filtering.FilterRowGroupsByPageIndexto intersect page ranges when RGs are already partially matched by bitmap.Tests
BitmapAllPagesSomeRowGroupsBitmapPartialPagesSingleRowGroupBitmapAllAndPartialPagesMixedBitmapAllPagesWithPredicateBitmapPartialPagesWithPredicateBitmapMixedWithPredicateNestedMapBitmapFallbackNestedListBitmapFallbackNestedStructBitmapFallbackDatasets
Multiple performance testing were conducted on 2 dataset.
Test Methodology
To validate bitmap behavior, we organize tests into six groups based on bitmap distribution patterns. Each group contains five distinct datasets, and the final result reported is the average of these five runs.
Test Cases
Test Results
On Dataset#1
On Dataset#2
API and Format
Documentation
Generative AI tooling