Description
Hey, I've been using the GSPath globbing capabilities to glob over a fairly large GCS bucket (couple of gbs) and have been noticing that it takes a lot longer to process compared to a google-cloud-storage
implementation.
list_blobs(match_glob="**/version_1/**")
Furthermore, when having task manager open when performing a glob on the bucket I observe significantly higher network footprint (when using GSPath) in comparison to the list_blobs
implementation.
My guess is that cloudpathlib may potentially be sending more network request than necessary (correct me if I'm wrong)
Any reasons why we don't just leverage the match_glob arg for GSPath
's glob capabilities?
GCloud SDK list_blobs(match_glob="")
reference below:
https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L1407
a GSPath("/path/to/folder/").glob("**/version_1/**)
to my belief can be translated to list_blobs
by doing the following:
list_blobs(prefix="/path/to/folder", match_glob="**/version_1/**")
happy to submit something if you would like this change incorporated :)