Skip to content

ReferenceFileSystem: streaming support for _open #1771

Closed
@skshetry

Description

@skshetry

Hi, I am using ReferenceFileSystem as a sort of virtual filesystem, similar to how you have described it in stackoverflow - Does fsspec support virtual filesystems such as pyfileysystem.

It works great for my use-case, but I have encountered an issue - the _open API reads the entire file instead of streaming it.

def _open(self, path, mode="rb", block_size=None, cache_options=None, **kwargs):
data = self.cat_file(path) # load whole chunk into memory
return io.BytesIO(data)

This behaviour is expected and is documented as such:

This FileSystem is read-only. It is designed to be used with async
targets (for now). This FileSystem only allows whole-file access, no
``open``. We do not get original file details from the target FS.

I’m curious if there’s a specific reason _open was implemented to load the entire file instead of allowing for streaming access. Could it be that I’m misusing ReferenceFileSystem? If not, I’d be happy to work on a PR to implement streaming support. Let me know if this would be useful!

EDIT: I'm basically using it as follows, for pyarrow to preserve partitioning format that it infers from filepath.

path = "s3://bucket/parquets/first_name=Alice/5c6de-0.parquet"
fs = ReferenceFileSystem(fo={path: ["/path/to/a/local/cache"]}
ds = dataset(path, filesystem=fs, **kwargs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions