GH-46374: [Python][Docs] Adds type checks for source in read_table #46330

Mottl · 2025-05-06T07:44:38Z

Rationale for this change

pyarrow.parquet.read_table actually supports source parameter as a list of strings

Are these changes tested?

No, an issue to test those has been opened as this code path is currently not being tested. See:

[Python] Exercise fallback case on tests for parquet.read_table in case dataset is not available #46373

Are there any user-facing changes?

no, only docstring

GitHub Issue: [Python][Doc] Improve docs to specify that source argument on parquet.read_table can also be a list of strings #46374

github-actions · 2025-05-06T07:45:03Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

raulcd

This is tricky because if the dataset module is enabled, yes we can use a list of strings but if we've built pyarrow without dataset a list of strings won't work because instead of using ParquetDataset->source we will be using ParquetFile->source.
I don't think we want to mislead users thinking a list of strings can be used when it might not be the case. cc @AlenkaF

rok · 2025-05-06T08:45:06Z

@raulcd I suppose we want to document this conditional behavior?

raulcd · 2025-05-06T08:49:20Z

I suppose we want to document this conditional behavior?

That makes sense, then something in the lines:
If the dataset module is enabled a list of strings can also be used which correspond to a list of files.
or something in the lines? @rok ?

AlenkaF · 2025-05-06T09:10:22Z

I agree that the conditional behavior is worth documenting.

If the dataset module is enabled a list of strings can also be used which correspond to a list of files.

This is great. I would just suggest changing "can also be used" to "can also be passed" to align more closely with the existing phrasing.

One more thing to note: this will require a test that would use the dataset mark. Because of that, the PR no longer fits into the MINOR category and will need a corresponding issue to track the change.

rok · 2025-05-06T09:54:34Z

Agreed with the proposed text (as amended by Alenka).

For the dataset marked test - might as well be a separate PR?

Mottl · 2025-05-06T10:39:30Z

Can we just concat_tables from multiple ParquetFiles in case ParquetDataset is unavailable? Reading many files at once is a really handy feature.

AlenkaF · 2025-05-06T11:17:23Z

I don't think we want to go in that direction—concatenation would introduce complexity that's already handled in the Dataset C++ layer (schema mismatches is a good example). Instead, we could consider adding a warning and suggest the use of the dataset module when passing files as a list (or a dict) here:

arrow/python/pyarrow/parquet/core.py

Line 1828 in a2e6136

# TODO test that source is not a directory or a list

Mottl · 2025-05-06T12:24:36Z

Okay, close this PR as soon as you finish with the discussion. Thanks all!

raulcd · 2025-05-06T12:42:02Z

Hi @Mottl thanks for your PR and contribution, I think the discussion is closed. Would you like to update this PR with docstring clarification based on the feedback and add a test as @AlenkaF commented? If you can't or don't have time we can add an issue to the issue tracker

Mottl · 2025-05-06T14:17:49Z

Have a look please

Mottl · 2025-05-09T01:39:28Z

@raulcd

python/pyarrow/parquet/core.py

Co-authored-by: Raúl Cumplido <[email protected]>

github-actions · 2025-05-09T12:22:52Z

⚠️ GitHub issue #46374 has been automatically assigned in GitHub to PR creator.

github-actions · 2025-05-09T12:23:32Z

⚠️ GitHub issue #46374 has been automatically assigned in GitHub to PR creator.

raulcd

@AlenkaF @rok I've opened an issue to test this code path as is not tested currently, let me know if that makes sense to you.
The current checks and docstring look good to me.

rok · 2025-05-09T13:15:28Z

python/pyarrow/parquet/core.py

@@ -1825,7 +1826,15 @@ def read_table(source, *, columns=None, use_threads=True,
        filesystem, path = _resolve_filesystem_and_path(source, filesystem)
        if filesystem is not None:
            source = filesystem.open_input_file(path)
-        # TODO test that source is not a directory or a list
+        if not (
+            (isinstance(source, str) and not os.path.isdir(source))


What if source is a folder in an S3 bucket? os.path.isdir
will return True for existing folder. Will it check through the s3 fs? That would add a network call. Can we just check if the string is a valid folder path and not if it also exists?

Fix docstring for pyarrow.parquet.read_table

457fb44

Mottl requested review from AlenkaF, raulcd and rok as code owners May 6, 2025 07:44

github-actions bot added Component: Python awaiting review Awaiting review labels May 6, 2025

Mottl changed the title ~~Fix docstring for pyarrow.parquet.read_table~~ [Docs][Python] Fix docstring for read_table May 6, 2025

Mottl changed the title ~~[Docs][Python] Fix docstring for read_table~~ MINOR: [Docs][Python] Fix docstring for read_table May 6, 2025

raulcd reviewed May 6, 2025

View reviewed changes

Fix docstring

22b4a85

Mottl changed the title ~~MINOR: [Docs][Python] Fix docstring for read_table~~ WIP: [Docs][Python] Fix read_table May 6, 2025

Add type checks for source argument in read_table

db59178

Mottl changed the title ~~WIP: [Docs][Python] Fix read_table~~ [Python][Docs] Adds type checks for source in read_table May 6, 2025

raulcd reviewed May 9, 2025

View reviewed changes

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved

Update python/pyarrow/parquet/core.py

e93d845

Co-authored-by: Raúl Cumplido <[email protected]>

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels May 9, 2025

raulcd mentioned this pull request May 9, 2025

[Python] Exercise fallback case on tests for parquet.read_table in case dataset is not available #46373

Open

Mottl added 2 commits May 9, 2025 18:17

Add directory check

06ba561

linting

a781890

raulcd changed the title ~~[Python][Docs] Adds type checks for source in read_table~~ GH-46374: [Python][Docs] Adds type checks for source in read_table May 9, 2025

raulcd approved these changes May 9, 2025

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 9, 2025

rok reviewed May 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-46374: [Python][Docs] Adds type checks for source in read_table #46330

GH-46374: [Python][Docs] Adds type checks for source in read_table #46330

Mottl commented May 6, 2025 •

edited by github-actions bot

Loading

github-actions bot commented May 6, 2025

raulcd left a comment

rok commented May 6, 2025

raulcd commented May 6, 2025

AlenkaF commented May 6, 2025

rok commented May 6, 2025

Mottl commented May 6, 2025 •

edited

Loading

AlenkaF commented May 6, 2025

Mottl commented May 6, 2025

raulcd commented May 6, 2025

Mottl commented May 6, 2025

Mottl commented May 9, 2025

github-actions bot commented May 9, 2025

github-actions bot commented May 9, 2025

raulcd left a comment

rok May 9, 2025 •

edited

Loading

GH-46374: [Python][Docs] Adds type checks for source in read_table #46330

Are you sure you want to change the base?

GH-46374: [Python][Docs] Adds type checks for source in read_table #46330

Conversation

Mottl commented May 6, 2025 • edited by github-actions bot Loading

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 6, 2025

raulcd left a comment

Choose a reason for hiding this comment

rok commented May 6, 2025

raulcd commented May 6, 2025

AlenkaF commented May 6, 2025

rok commented May 6, 2025

Mottl commented May 6, 2025 • edited Loading

AlenkaF commented May 6, 2025

Mottl commented May 6, 2025

raulcd commented May 6, 2025

Mottl commented May 6, 2025

Mottl commented May 9, 2025

github-actions bot commented May 9, 2025

github-actions bot commented May 9, 2025

raulcd left a comment

Choose a reason for hiding this comment

rok May 9, 2025 • edited Loading

Choose a reason for hiding this comment

Mottl commented May 6, 2025 •

edited by github-actions bot

Loading

Mottl commented May 6, 2025 •

edited

Loading

rok May 9, 2025 •

edited

Loading