GH-43574: [Python] do not add partition columns from file path when reading single file by bkurtz · Pull Request #49853 · apache/arrow

bkurtz · 2026-04-23T23:08:16Z

Fixes #43574 (I think; fixes the aspect of it that I ran into, but haven't checked the OP's issue).

Reverts a small portion of bd44410

Rationale for this change

This reverts a change made in pyarrow 17 which means that reading a single file returns different results when that file happens to be located in a path that contains x=y segments (i.e. that look like hive partition columns) than when it doesn't. Particularly given the way some higher-level calls wrap this functionality, e.g. by already opening a file before it is passed to ParquetDataset, this can lead to confusing results, e.g. that are different when running code on a local vs remote filesystem. For example, for single-file local reads, pandas.read_parquet already opens a filehandle to pass to pyarrow, while for remote reads, it passes a single-file path + filesystem, resulting in code that works differently when tested on a local filesystem compared to the deployed cloud filesystem.

The original change was introduced in #39438 and there was a discussion thread about it (sorry; github's links to resolved discussions don't always work well!) The gist of the discussion thread seems to be that the PR author thought that this code was unused, when in fact the subsequent issue shows that it is used.

Screenshot of the original discussion thread to help you find it:

What changes are included in this PR?

Restores special "single file" handling for single-file paths passed to ParquetDataset constructor, and analogous to the handling for an open file handle.

This results in the loaded dataset not parsing the full file path for hive partition columns, which results in a different set of columns.

Are these changes tested?

Added a new unit test. Verified that it fixes the issue I'd been observing, and which I'd commented on in #43574, though I don't have a working reproduction to verify that it fixes the original issue there.

Are there any user-facing changes?

This PR includes breaking changes to public APIs. In particular, it changes the columns returned by single-file calls to pyarrow.parquet.read_table(...), bringing the results back in line with pyarrow<17.

While technically a breaking change, it should be noted that the original PR that introduced this change in pyarrow 17 did not call out this change as a breaking change. However, it's been some time since then, and it's plausible that some applications have developed dependencies on the current behavior.

GitHub Issue: [Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

github-actions · 2026-04-23T23:08:41Z

⚠️ GitHub issue #43574 has been automatically assigned in GitHub to PR creator.

…when reading single file Reverts a small portion of bd44410

github-actions Bot added Component: Python awaiting review Awaiting review labels Apr 23, 2026

bkurtz marked this pull request as ready for review April 23, 2026 23:09

bkurtz requested review from AlenkaF, raulcd and rok as code owners April 23, 2026 23:09

bkurtz mentioned this pull request Apr 24, 2026

[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

Open

apacheGH-43574: [Python] do not add partition columns from file path …

8333a5c

…when reading single file Reverts a small portion of bd44410

bkurtz force-pushed the fix-single-file-parquet-read branch from 320823a to 8333a5c Compare April 24, 2026 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-43574: [Python] do not add partition columns from file path when reading single file#49853

GH-43574: [Python] do not add partition columns from file path when reading single file#49853
bkurtz wants to merge 1 commit intoapache:mainfrom
bkurtz:fix-single-file-parquet-read

bkurtz commented Apr 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bkurtz commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bkurtz commented Apr 23, 2026 •

edited

Loading