Skip to content

GH-43574: [Python] do not add partition columns from file path when reading single file#49853

Open
bkurtz wants to merge 1 commit intoapache:mainfrom
bkurtz:fix-single-file-parquet-read
Open

GH-43574: [Python] do not add partition columns from file path when reading single file#49853
bkurtz wants to merge 1 commit intoapache:mainfrom
bkurtz:fix-single-file-parquet-read

Conversation

@bkurtz
Copy link
Copy Markdown

@bkurtz bkurtz commented Apr 23, 2026

Fixes #43574 (I think; fixes the aspect of it that I ran into, but haven't checked the OP's issue).

Reverts a small portion of bd44410

Rationale for this change

This reverts a change made in pyarrow 17 which means that reading a single file returns different results when that file happens to be located in a path that contains x=y segments (i.e. that look like hive partition columns) than when it doesn't. Particularly given the way some higher-level calls wrap this functionality, e.g. by already opening a file before it is passed to ParquetDataset, this can lead to confusing results, e.g. that are different when running code on a local vs remote filesystem. For example, for single-file local reads, pandas.read_parquet already opens a filehandle to pass to pyarrow, while for remote reads, it passes a single-file path + filesystem, resulting in code that works differently when tested on a local filesystem compared to the deployed cloud filesystem.

The original change was introduced in #39438 and there was a discussion thread about it (sorry; github's links to resolved discussions don't always work well!) The gist of the discussion thread seems to be that the PR author thought that this code was unused, when in fact the subsequent issue shows that it is used.

Screenshot of the original discussion thread to help you find it:
image

What changes are included in this PR?

Restores special "single file" handling for single-file paths passed to ParquetDataset constructor, and analogous to the handling for an open file handle.

This results in the loaded dataset not parsing the full file path for hive partition columns, which results in a different set of columns.

Are these changes tested?

Added a new unit test. Verified that it fixes the issue I'd been observing, and which I'd commented on in #43574, though I don't have a working reproduction to verify that it fixes the original issue there.

Are there any user-facing changes?

This PR includes breaking changes to public APIs. In particular, it changes the columns returned by single-file calls to pyarrow.parquet.read_table(...), bringing the results back in line with pyarrow<17.

While technically a breaking change, it should be noted that the original PR that introduced this change in pyarrow 17 did not call out this change as a breaking change. However, it's been some time since then, and it's plausible that some applications have developed dependencies on the current behavior.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #43574 has been automatically assigned in GitHub to PR creator.

…when reading single file

Reverts a small portion of bd44410
@bkurtz bkurtz force-pushed the fix-single-file-parquet-read branch from 320823a to 8333a5c Compare April 24, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0

1 participant