Parquet reader - avoid redundant dictionary reads #27407

ShubhamChaurasia · 2025-11-21T19:35:34Z

Description

Avoid redundant dictionary reads - earlier the dictionary page is read at the time of rowgroup pruning (PredicateUtils) and then again read during the actual file reading(ParquetReader). This PR removes this redundant work and moves dictionary based rowgroup pruning logic to ParquetReader

Would like to thank @pettyjamesm for his initial suggestions on this.

Some numbers with patch

These queries were run on 1TB TPC-DS data(parquet, zstd compressed).

select count(*) from store_sales where ss_store_sk=163;

Without Patch - 26.14 [2.88B rows, 2.08GiB] [110M rows/s, 81.4MiB/s]
With Patch - 21.09 [2.88B rows, 1.09GiB] [137M rows/s, 52.9MiB/s]


select count(*) from store_sales where ss_hdemo_sk < 5226;

Without Patch - 22.23 [2.88B rows, 2.4GiB] [130M rows/s, 111MiB/s]
With Patch - 17.17 [2.88B rows, 1.25GiB] [168M rows/s, 74.6MiB/s]


select count(*) from store_sales where ss_store_sk=163 and ss_hdemo_sk < 5226;

Without Patch - 28.37 [2.88B rows, 5.07GiB] [102M rows/s, 183MiB/s]
With Patch - 18.98 [2.88B rows, 2.94GiB] [152M rows/s, 158MiB/s]


select ss_cdemo_sk, count(*) from store_sales where ss_store_sk=163 and ss_hdemo_sk < 5226 group by ss_cdemo_sk;

Without Patch - 28.87 [2.88B rows, 7.36GiB] [99.8M rows/s, 261MiB/s]
With Patch - 24.23 [2.88B rows, 5.23GiB] [119M rows/s, 221MiB/s]

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

starburstdata-automation · 2025-11-22T16:14:35Z

Started benchmark workflow for this PR with test type = iceberg/sf1000_parquet_part.

Building Trino finished with status: success
Benchmark finished with status: failure
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: Found regressions for:<br/>(presto/tpcds, q09, totalCpuTime, over by 16.8%)<br/>(presto/tpcds, q88, totalCpuTime, over by 38.9%)
Benchmark Comparison to the closest run from Master: Report

starburstdata-automation · 2025-11-22T16:14:46Z

Started benchmark workflow for this PR with test type = iceberg/sf1000_parquet_unpart.

Building Trino finished with status: success
Benchmark finished with status: success
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: NO Regression found.
Benchmark Comparison to the closest run from Master: Report

Praveen2112 · 2025-11-24T07:18:46Z

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/PredicateUtils.java

+        private static final PredicateMatchResult FALSE_WITH_DICT_MATCHING_NOT_NEEDED = new PredicateMatchResult(false, null, null);
+        private static final PredicateMatchResult TRUE_WITH_DICT_MATCHING_NOT_NEEDED = new PredicateMatchResult(true, null, null);


nit: Can we avoid null here ? Pass TupleDomain#all and an empty set ?

cla-bot bot added the cla-signed label Nov 21, 2025

ShubhamChaurasia requested a review from raunaqmorarka November 21, 2025 19:35

github-actions bot added the redshift Redshift connector label Nov 21, 2025

ShubhamChaurasia requested a review from pettyjamesm November 21, 2025 19:35

Parquet reader - avoid redundant dictionary reads

0c1761c

ShubhamChaurasia force-pushed the avoid-duplicate-dictionary-read-2 branch from 918ffd5 to 0c1761c Compare November 21, 2025 19:43

Praveen2112 reviewed Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet reader - avoid redundant dictionary reads #27407

Parquet reader - avoid redundant dictionary reads #27407

ShubhamChaurasia commented Nov 21, 2025

Uh oh!

starburstdata-automation commented Nov 22, 2025 •

edited

Loading

Uh oh!

starburstdata-automation commented Nov 22, 2025 •

edited

Loading

Uh oh!

Praveen2112 Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

		private static final PredicateMatchResult FALSE_WITH_DICT_MATCHING_NOT_NEEDED = new PredicateMatchResult(false, null, null);
		private static final PredicateMatchResult TRUE_WITH_DICT_MATCHING_NOT_NEEDED = new PredicateMatchResult(true, null, null);

Parquet reader - avoid redundant dictionary reads #27407

Are you sure you want to change the base?

Parquet reader - avoid redundant dictionary reads #27407

Conversation

ShubhamChaurasia commented Nov 21, 2025

Description

Some numbers with patch

Additional context and related issues

Release notes

Uh oh!

starburstdata-automation commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

starburstdata-automation commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Praveen2112 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

starburstdata-automation commented Nov 22, 2025 •

edited

Loading

starburstdata-automation commented Nov 22, 2025 •

edited

Loading