Skip to content

GH-46374: [Python][Docs] Adds type checks for source in read_table #46330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Mottl
Copy link

@Mottl Mottl commented May 6, 2025

Rationale for this change

pyarrow.parquet.read_table actually supports source parameter as a list of strings

Are these changes tested?

No, an issue to test those has been opened as this code path is currently not being tested. See:

Are there any user-facing changes?

no, only docstring

@Mottl Mottl requested review from AlenkaF, raulcd and rok as code owners May 6, 2025 07:44
Copy link

github-actions bot commented May 6, 2025

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@Mottl Mottl changed the title Fix docstring for pyarrow.parquet.read_table [Docs][Python] Fix docstring for read_table May 6, 2025
@Mottl Mottl changed the title [Docs][Python] Fix docstring for read_table MINOR: [Docs][Python] Fix docstring for read_table May 6, 2025
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky because if the dataset module is enabled, yes we can use a list of strings but if we've built pyarrow without dataset a list of strings won't work because instead of using ParquetDataset->source we will be using ParquetFile->source.
I don't think we want to mislead users thinking a list of strings can be used when it might not be the case. cc @AlenkaF

@rok
Copy link
Member

rok commented May 6, 2025

@raulcd I suppose we want to document this conditional behavior?

@raulcd
Copy link
Member

raulcd commented May 6, 2025

I suppose we want to document this conditional behavior?

That makes sense, then something in the lines:
If the dataset module is enabled a list of strings can also be used which correspond to a list of files.
or something in the lines? @rok ?

@AlenkaF
Copy link
Member

AlenkaF commented May 6, 2025

I agree that the conditional behavior is worth documenting.

If the dataset module is enabled a list of strings can also be used which correspond to a list of files.

This is great. I would just suggest changing "can also be used" to "can also be passed" to align more closely with the existing phrasing.

One more thing to note: this will require a test that would use the dataset mark. Because of that, the PR no longer fits into the MINOR category and will need a corresponding issue to track the change.

@rok
Copy link
Member

rok commented May 6, 2025

Agreed with the proposed text (as amended by Alenka).

For the dataset marked test - might as well be a separate PR?

@Mottl
Copy link
Author

Mottl commented May 6, 2025

Can we just concat_tables from multiple ParquetFiles in case ParquetDataset is unavailable? Reading many files at once is a really handy feature.

@AlenkaF
Copy link
Member

AlenkaF commented May 6, 2025

I don't think we want to go in that direction—concatenation would introduce complexity that's already handled in the Dataset C++ layer (schema mismatches is a good example). Instead, we could consider adding a warning and suggest the use of the dataset module when passing files as a list (or a dict) here:

# TODO test that source is not a directory or a list

@Mottl
Copy link
Author

Mottl commented May 6, 2025

Okay, close this PR as soon as you finish with the discussion. Thanks all!

@raulcd
Copy link
Member

raulcd commented May 6, 2025

Hi @Mottl thanks for your PR and contribution, I think the discussion is closed. Would you like to update this PR with docstring clarification based on the feedback and add a test as @AlenkaF commented? If you can't or don't have time we can add an issue to the issue tracker

@Mottl Mottl changed the title MINOR: [Docs][Python] Fix docstring for read_table WIP: [Docs][Python] Fix read_table May 6, 2025
@Mottl Mottl changed the title WIP: [Docs][Python] Fix read_table [Python][Docs] Adds type checks for source in read_table May 6, 2025
@Mottl
Copy link
Author

Mottl commented May 6, 2025

Have a look please

@Mottl
Copy link
Author

Mottl commented May 9, 2025

@raulcd

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels May 9, 2025
@raulcd raulcd changed the title [Python][Docs] Adds type checks for source in read_table GH-46374: [Python][Docs] Adds type checks for source in read_table May 9, 2025
Copy link

github-actions bot commented May 9, 2025

⚠️ GitHub issue #46374 has been automatically assigned in GitHub to PR creator.

1 similar comment
Copy link

github-actions bot commented May 9, 2025

⚠️ GitHub issue #46374 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlenkaF @rok I've opened an issue to test this code path as is not tested currently, let me know if that makes sense to you.
The current checks and docstring look good to me.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 9, 2025
@@ -1825,7 +1826,15 @@ def read_table(source, *, columns=None, use_threads=True,
filesystem, path = _resolve_filesystem_and_path(source, filesystem)
if filesystem is not None:
source = filesystem.open_input_file(path)
# TODO test that source is not a directory or a list
if not (
(isinstance(source, str) and not os.path.isdir(source))
Copy link
Member

@rok rok May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if source is a folder in an S3 bucket? os.path.isdir
will return True for existing folder. Will it check through the s3 fs? That would add a network call. Can we just check if the string is a valid folder path and not if it also exists?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants