-
Notifications
You must be signed in to change notification settings - Fork 14
[FEAT] Create obstore store in fsspec on demand #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Create obstore store in fsspec on demand #198
Conversation
constructe store with from_url using protocol and bucket name
obstore/python/obstore/fsspec.py
Outdated
@@ -45,6 +47,9 @@ def __init__( | |||
self, | |||
store: obs.store.ObjectStore, | |||
*args, | |||
config: dict[str, Any] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow these, store should be optional?
And before merge we should enable typing overloads for better typing. You can see how from_url is implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use store here for deciding the store Interface (whether it is S3Store, GCSStore, ...), so that in AsyncFsspecStore we don't need to decide the interface based on the protocol.
Maybe there's a better way of deciding the store interface?
obstore_fs: AsyncFsspecStore = fsspec.filesystem(
"s3",
store=S3Store,
config={
"endpoint": "http://localhost:30002",
"access_key_id": "minio",
"secret_access_key": "miniostorage",
"virtual_hosted_style_request": True, # path contain bucket name
},
client_options={"timeout": "99999s", "allow_http": "true"},
retry_config={
"max_retries": 2,
"backoff": {
"base": 2,
"init_backoff": timedelta(seconds=2),
"max_backoff": timedelta(seconds=16),
},
"retry_timeout": timedelta(minutes=3),
},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at the typing later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's confusing because store is the type of the class and not an instance.
We should be able to use the from_url top level function directly here?
obstore/python/obstore/fsspec.py
Outdated
file_path = "/".join(path_li[1:]) | ||
return (bucket, file_path) | ||
|
||
@lru_cache(maxsize=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this cache size could be user specified but we can come back to it
Would there be one fsspec instance per cloud provider? So if you wanted to use s3 and gcs you'd make two separate instances? |
Based on what I know, to use fsspec, we will do: fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore) Each will have their own AsyncFsspecStore innstance already. To config, we can use (based on my current implementation): s3_fs: AsyncFsspecStore = fsspec.filesystem(
"s3",
store=S3Store,
config={...}
)
gcs_fs: AsyncFsspecStore = fsspec.filesystem(
"gs",
store=GCSStore,
config={...}
) |
It would be nice to take out the |
34f79f0
to
29464a7
Compare
Specify protocol s3, gs, and abfs
I use obstore/obstore/python/obstore/fsspec.py Lines 272 to 279 in a0d9e1d
|
Is it true that a single fsspec class can't be associated with more than one protocol? E.g. Azure has three different protocols |
The latest PRs allow you to access the |
I think we can if those protocols refer to the same object instance. s3fs do have 2 protocols ("s3", "s3a"), see: https://github.com/fsspec/s3fs/blob/023aecf00b5c6243ff5f8a016dac8b6af3913c6b/s3fs/core.py#L277 I think abfs, adlfs, and az have different implementation so that they exports different classes. If we use them in obstore, I think we can define a class with protocol (abfs, adlfs, az), but need to test is they all work |
obstore/python/obstore/fsspec.py
Outdated
@@ -104,6 +104,12 @@ def _split_path(self, path: str) -> Tuple[str, str]: | |||
# no bucket name in path | |||
return "", path | |||
|
|||
if path.startswith(self.protocol + "://"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that this function will always receive something a URL like s3://mybucket/path/to/file
, I'm inclined for this function to use urlparse
instead of manually handling the parts of the URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not always be s3://mybucket/path/to/file
, but may be without protocol like mybucket/path/to/file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use urlparse like this here, which works for both s3://mybucket/path/to/file
and mybucket/path/to/file
obstore/obstore/python/obstore/fsspec.py
Lines 108 to 112 in 75c738e
res = urlparse(path) | |
if res.scheme: | |
if res.scheme != self.protocol: | |
raise ValueError(f"Expect protocol to be {self.protocol}. Got {res.schema}") | |
path = res.netloc + res.path |
Oh cool! That seems to indicate that we could have a single class that defines supported protocols as: protocol = ("s3", "s3a", "gs", "az", "abfs", "adlfs") Because the fsspec class used for each is the same? It's just custom kwargs that would need to be passed down for each? |
I don't think we can put all the protocols together into a class, as when using obstore/obstore/python/obstore/fsspec.py Lines 122 to 124 in 6614906
I think the better way is to create |
I did a quick look through your PR; it's really good progress but a few thoughts:
|
Thanks for the suggestion! I just added ruff linter and remove error for path. I also add the check for validating the two bucket name from two path are the same.
For
Yes! I will update the test in the next few days |
Check if AsyncFsspecStore is registered and test invalid types pass into register
loop = asyncio.get_running_loop() | ||
return await loop.run_in_executor(None, super().info, path, **_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you calling super().info
? You can call super()._info
, which is async, and not need to touch the running event loop at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because _info
in AsyncFileSystem
is NotImplemented
, see https://github.com/fsspec/filesystem_spec/blob/56054c0a30ceedab4c0e6a0f7e429666773baf6d/fsspec/asyn.py#L701-L702
I'd rather not. I'd rather just rely on the static type checker (at least for the |
This is because the protocol is set in class variable in register. I did this because obstore/obstore/python/obstore/fsspec.py Lines 521 to 532 in 5ea2ba8
I just updated the code, now the following works store = AsyncFsspecStore(protocol="s3", config={"skip_signature": True, "region": "us-west-2"})
path = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A"
print(store.ls(path)) |
1d4e786
to
45c4020
Compare
I fixed some of the comments above, and here are the link to the comments I replied to:
I will keep working from this comment: #198 (comment) tomorrow |
Thank you! Ping me again when it's ready for a review! |
As the runtime type check is removed in register for simplification, this test is no longer needed
Hi @kylebarron I fixed all the reviews above. Please have a look and let me know if any more fix needed. Thanks |
Thank you! obstore 0.5 will be published whenever the upstream |
Construct the obstore store instance on demand in fsspec when calling methods. This allows automatic store creation for reads/writes across different buckets, aligning usage with fsspec conventions