Skip to content

Conversation

@ShubhamChaurasia
Copy link
Member

Description

Avoid redundant dictionary reads - earlier the dictionary page is read at the time of rowgroup pruning (PredicateUtils) and then again read during the actual file reading(ParquetReader). This PR removes this redundant work and moves dictionary based rowgroup pruning logic to ParquetReader

Would like to thank @pettyjamesm for his initial suggestions on this.

Some numbers with patch

These queries were run on 1TB TPC-DS data(parquet, zstd compressed).

select count(*) from store_sales where ss_store_sk=163;

Without Patch - 26.14 [2.88B rows, 2.08GiB] [110M rows/s, 81.4MiB/s]
With Patch - 21.09 [2.88B rows, 1.09GiB] [137M rows/s, 52.9MiB/s]


select count(*) from store_sales where ss_hdemo_sk < 5226;

Without Patch - 22.23 [2.88B rows, 2.4GiB] [130M rows/s, 111MiB/s]
With Patch - 17.17 [2.88B rows, 1.25GiB] [168M rows/s, 74.6MiB/s]


select count(*) from store_sales where ss_store_sk=163 and ss_hdemo_sk < 5226;

Without Patch - 28.37 [2.88B rows, 5.07GiB] [102M rows/s, 183MiB/s]
With Patch - 18.98 [2.88B rows, 2.94GiB] [152M rows/s, 158MiB/s]


select ss_cdemo_sk, count(*) from store_sales where ss_store_sk=163 and ss_hdemo_sk < 5226 group by ss_cdemo_sk;

Without Patch - 28.87 [2.88B rows, 7.36GiB] [99.8M rows/s, 261MiB/s]
With Patch - 24.23 [2.88B rows, 5.23GiB] [119M rows/s, 221MiB/s]

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Nov 21, 2025
@github-actions github-actions bot added the redshift Redshift connector label Nov 21, 2025
@ShubhamChaurasia ShubhamChaurasia force-pushed the avoid-duplicate-dictionary-read-2 branch from 918ffd5 to 0c1761c Compare November 21, 2025 19:43
@starburstdata-automation
Copy link

starburstdata-automation commented Nov 22, 2025

Started benchmark workflow for this PR with test type = iceberg/sf1000_parquet_part.

Building Trino finished with status: success
Benchmark finished with status: failure
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: Found regressions for:<br/>(presto/tpcds, q09, totalCpuTime, over by 16.8%)<br/>(presto/tpcds, q88, totalCpuTime, over by 38.9%)
Benchmark Comparison to the closest run from Master: Report

@starburstdata-automation
Copy link

starburstdata-automation commented Nov 22, 2025

Started benchmark workflow for this PR with test type = iceberg/sf1000_parquet_unpart.

Building Trino finished with status: success
Benchmark finished with status: success
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: NO Regression found.
Benchmark Comparison to the closest run from Master: Report

Comment on lines +219 to +220
private static final PredicateMatchResult FALSE_WITH_DICT_MATCHING_NOT_NEEDED = new PredicateMatchResult(false, null, null);
private static final PredicateMatchResult TRUE_WITH_DICT_MATCHING_NOT_NEEDED = new PredicateMatchResult(true, null, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we avoid null here ? Pass TupleDomain#all and an empty set ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed redshift Redshift connector

Development

Successfully merging this pull request may close these issues.

3 participants