Skip to content

potential performance improvement for GSPath globbing capabilities #513

Open
@fafnirZ

Description

@fafnirZ

Hey, I've been using the GSPath globbing capabilities to glob over a fairly large GCS bucket (couple of gbs) and have been noticing that it takes a lot longer to process compared to a google-cloud-storage implementation.

list_blobs(match_glob="**/version_1/**")

Furthermore, when having task manager open when performing a glob on the bucket I observe significantly higher network footprint (when using GSPath) in comparison to the list_blobs implementation.

My guess is that cloudpathlib may potentially be sending more network request than necessary (correct me if I'm wrong)

Any reasons why we don't just leverage the match_glob arg for GSPath's glob capabilities?

GCloud SDK list_blobs(match_glob="") reference below:
https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L1407

a GSPath("/path/to/folder/").glob("**/version_1/**) to my belief can be translated to list_blobs by doing the following:
list_blobs(prefix="/path/to/folder", match_glob="**/version_1/**")

happy to submit something if you would like this change incorporated :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions