-
Notifications
You must be signed in to change notification settings - Fork 60
🪣 Support for Blob Storage #955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
directory = path_from_uri(data_uri) | ||
directory.mkdir(parents=True, exist_ok=True) | ||
store = zarr.storage.DirectoryStore(str(directory)) | ||
if storage.bucket: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old model, the adapter was handed multiple storage options and had to pick the one it wanted. Now, the caller in tiled.catalog.adapter
picks one storage option and passes just that one in.
tiled/tiled/catalog/adapter.py
Lines 672 to 695 in f3331ef
adapter = STORAGE_ADAPTERS_BY_MIMETYPE[data_source.mimetype] | |
# Choose writable storage. Use the first writable storage item | |
# with a scheme that is supported by this adapter. # For | |
# back-compat, if an adapter does not declare `supported_storage` | |
# assume it supports file-based storage only. | |
supported_storage = getattr( | |
adapter, "supported_storage", {FileStorage} | |
) | |
for storage in self.context.writable_storage: | |
if isinstance(storage, tuple(supported_storage)): | |
break | |
else: | |
raise RuntimeError( | |
f"The adapter {adapter} supports storage types " | |
f"{[cls.__name__ for cls in supported_storage]} " | |
"but the only available storage types " | |
f"are {self.context.writable_storage}." | |
) | |
data_source = await ensure_awaitable( | |
adapter.init_storage, | |
storage, | |
data_source, | |
self.segments + [key], | |
) |
So, the task here is to check isinstance(storage, BucketStorage)
versus FileStorage
. Additionally, the supported_storage
attribute on this class should be extended to include BucketStorage
. This tells the caller to offer BucketStorage
if that is the highest-priority item in writable_storage
.
zarr_obj = zarr.open( | ||
path_from_uri(data_source.assets[0].data_uri) | ||
) # Group or Array | ||
parsed = urlparse(data_source.assets[0].data_uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at what SQLAdapter
does here. If this is bucket storage, we need to get the credentials for this bucket, similarly to how SQLAdapter
does. The credentials are not in the database, but they are in the config, and we need to match the URL of the bucket store with one in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 71 to 74 in f3331ef
storage = parse_storage(data_uri) | |
if isinstance(storage, SQLStorage): | |
# Obtain credentials | |
data_uri = cast(SQLStorage, get_storage(data_uri)).authenticated_uri |
@@ -733,7 +735,8 @@ def path_from_uri(uri) -> Path: | |||
path = Path(parsed.path[1:]) | |||
else: | |||
raise ValueError( | |||
"Supported schemes are 'file', 'sqlite', and 'duckdb'. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change to this function. Now, unlike when we started this PR, path_from_uri
is only ever called on local filesystem paths.
@@ -92,6 +109,8 @@ def parse_storage(item: Union[Path, str]) -> Storage: | |||
result = FileStorage(item) | |||
elif scheme == "postgresql": | |||
result = SQLStorage(item) | |||
elif scheme == "bucket": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There won't be a bucket scheme. Instead, this function must be extended to accept dict
(a YAML object from config). While we're here, we might accept a dict for PostgreSQL as well as for Blob/Bucket.
We currently accept SQL creds like this only:
- postgresql://username:password@host:port/database
but we could additionally accept a more structured input:
- uri: postgresql://host:port/database
username: username
password: password
And buckets would of course be similar.
Created to resoslve #905
Blob (Binary Large Objects) is a recent cloud native data format with many advancements over traditional file systems which allows it to handle larger volumes of data in containers with less stringent size limitations.
Zarr has added support for blob storage through Fsspec, and conversely through s3fs, a library which more succinctly implements cloud based connections.
This PR creates a
blob
storage dataclass for tiled, which alongsidefilesystem
andsql
will provide a location where data can be written to and read from. Unlike the presently available storage options, Blobs requireuri
,key
andsecret
in order to connect to a datasource, omitting the last two parameters if said datasource is public.