-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46374: [Python][Docs] Adds type checks for source in read_table #46330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky because if the dataset module is enabled, yes we can use a list of strings but if we've built pyarrow without dataset a list of strings won't work because instead of using ParquetDataset->source
we will be using ParquetFile->source
.
I don't think we want to mislead users thinking a list of strings can be used when it might not be the case. cc @AlenkaF
@raulcd I suppose we want to document this conditional behavior? |
That makes sense, then something in the lines: |
I agree that the conditional behavior is worth documenting.
This is great. I would just suggest changing "can also be used" to "can also be passed" to align more closely with the existing phrasing. One more thing to note: this will require a test that would use the dataset mark. Because of that, the PR no longer fits into the MINOR category and will need a corresponding issue to track the change. |
Agreed with the proposed text (as amended by Alenka). For the dataset marked test - might as well be a separate PR? |
Can we just |
I don't think we want to go in that direction—concatenation would introduce complexity that's already handled in the Dataset C++ layer (schema mismatches is a good example). Instead, we could consider adding a warning and suggest the use of the dataset module when passing files as a list (or a dict) here: arrow/python/pyarrow/parquet/core.py Line 1828 in a2e6136
|
Okay, close this PR as soon as you finish with the discussion. Thanks all! |
Have a look please |
Co-authored-by: Raúl Cumplido <[email protected]>
|
1 similar comment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1825,7 +1826,15 @@ def read_table(source, *, columns=None, use_threads=True, | |||
filesystem, path = _resolve_filesystem_and_path(source, filesystem) | |||
if filesystem is not None: | |||
source = filesystem.open_input_file(path) | |||
# TODO test that source is not a directory or a list | |||
if not ( | |||
(isinstance(source, str) and not os.path.isdir(source)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if source is a folder in an S3 bucket? os.path.isdir
will return True for existing folder. Will it check through the s3 fs? That would add a network call. Can we just check if the string is a valid folder path and not if it also exists?
Rationale for this change
pyarrow.parquet.read_table
actually supportssource
parameter as a list of stringsAre these changes tested?
No, an issue to test those has been opened as this code path is currently not being tested. See:
Are there any user-facing changes?
no, only docstring