Skip to content

GH-46374: [Python][Docs] Adds type checks for source in read_table #46330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 13 additions & 4 deletions python/pyarrow/parquet/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1627,9 +1627,10 @@ def partitioning(self):
Parameters
----------
source : str, pyarrow.NativeFile, or file-like object
If a string passed, can be a single file name or directory name. For
file-like objects, only read a single file. Use pyarrow.BufferReader to
read a file contained in a bytes or buffer-like object.
If a string is passed, it should be single file name.
If the dataset module is enabled, you can also pass a directory name or a list
of file names.
Use pyarrow.BufferReader to read a file contained in a bytes or buffer-like object.
columns : list
If not None, only these columns will be read from the file. A column
name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
Expand Down Expand Up @@ -1825,7 +1826,15 @@ def read_table(source, *, columns=None, use_threads=True,
filesystem, path = _resolve_filesystem_and_path(source, filesystem)
if filesystem is not None:
source = filesystem.open_input_file(path)
# TODO test that source is not a directory or a list
if not (
(isinstance(source, str) and not os.path.isdir(source))
Copy link
Member

@rok rok May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if source is a folder in an S3 bucket? os.path.isdir
will return True for existing folder. Will it check through the s3 fs? That would add a network call. Can we just check if the string is a valid folder path and not if it also exists?

or isinstance(source, pa.NativeFile)
or hasattr(source, "read")
):
raise ValueError(
"source should be a file name, a pyarrow.NativeFile or a file-like "
"object when the pyarrow.dataset module is not available"
)
dataset = ParquetFile(
source, read_dictionary=read_dictionary,
memory_map=memory_map, buffer_size=buffer_size,
Expand Down
Loading