Skip to content

ReferenceFileSystem: streaming support for _open #1771

Closed
@skshetry

Description

Hi, I am using ReferenceFileSystem as a sort of virtual filesystem, similar to how you have described it in stackoverflow - Does fsspec support virtual filesystems such as pyfileysystem.

It works great for my use-case, but I have encountered an issue - the _open API reads the entire file instead of streaming it.

def _open(self, path, mode="rb", block_size=None, cache_options=None, **kwargs):
data = self.cat_file(path) # load whole chunk into memory
return io.BytesIO(data)

This behaviour is expected and is documented as such:

This FileSystem is read-only. It is designed to be used with async
targets (for now). This FileSystem only allows whole-file access, no
``open``. We do not get original file details from the target FS.

I’m curious if there’s a specific reason _open was implemented to load the entire file instead of allowing for streaming access. Could it be that I’m misusing ReferenceFileSystem? If not, I’d be happy to work on a PR to implement streaming support. Let me know if this would be useful!

EDIT: I'm basically using it as follows, for pyarrow to preserve partitioning format that it infers from filepath.

path = "s3://bucket/parquets/first_name=Alice/5c6de-0.parquet"
fs = ReferenceFileSystem(fo={path: ["/path/to/a/local/cache"]}
ds = dataset(path, filesystem=fs, **kwargs)

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions