Skip to content

🪣 Support for Blob Storage #955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft

Conversation

Kezzsim
Copy link
Contributor

@Kezzsim Kezzsim commented Apr 11, 2025

Created to resoslve #905
Blob (Binary Large Objects) is a recent cloud native data format with many advancements over traditional file systems which allows it to handle larger volumes of data in containers with less stringent size limitations.

Zarr has added support for blob storage through Fsspec, and conversely through s3fs, a library which more succinctly implements cloud based connections.

This PR creates a blob storage dataclass for tiled, which alongside filesystem and sql will provide a location where data can be written to and read from. Unlike the presently available storage options, Blobs require uri, key and secret in order to connect to a datasource, omitting the last two parameters if said datasource is public.

@Kezzsim Kezzsim changed the title 🧊 Support for Blob Storage 🪣 Support for Blob Storage Apr 17, 2025
directory = path_from_uri(data_uri)
directory.mkdir(parents=True, exist_ok=True)
store = zarr.storage.DirectoryStore(str(directory))
if storage.bucket:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old model, the adapter was handed multiple storage options and had to pick the one it wanted. Now, the caller in tiled.catalog.adapter picks one storage option and passes just that one in.

adapter = STORAGE_ADAPTERS_BY_MIMETYPE[data_source.mimetype]
# Choose writable storage. Use the first writable storage item
# with a scheme that is supported by this adapter. # For
# back-compat, if an adapter does not declare `supported_storage`
# assume it supports file-based storage only.
supported_storage = getattr(
adapter, "supported_storage", {FileStorage}
)
for storage in self.context.writable_storage:
if isinstance(storage, tuple(supported_storage)):
break
else:
raise RuntimeError(
f"The adapter {adapter} supports storage types "
f"{[cls.__name__ for cls in supported_storage]} "
"but the only available storage types "
f"are {self.context.writable_storage}."
)
data_source = await ensure_awaitable(
adapter.init_storage,
storage,
data_source,
self.segments + [key],
)

So, the task here is to check isinstance(storage, BucketStorage) versus FileStorage. Additionally, the supported_storage attribute on this class should be extended to include BucketStorage. This tells the caller to offer BucketStorage if that is the highest-priority item in writable_storage.

zarr_obj = zarr.open(
path_from_uri(data_source.assets[0].data_uri)
) # Group or Array
parsed = urlparse(data_source.assets[0].data_uri)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at what SQLAdapter does here. If this is bucket storage, we need to get the credentials for this bucket, similarly to how SQLAdapter does. The credentials are not in the database, but they are in the config, and we need to match the URL of the bucket store with one in the config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage = parse_storage(data_uri)
if isinstance(storage, SQLStorage):
# Obtain credentials
data_uri = cast(SQLStorage, get_storage(data_uri)).authenticated_uri

@@ -733,7 +735,8 @@ def path_from_uri(uri) -> Path:
path = Path(parsed.path[1:])
else:
raise ValueError(
"Supported schemes are 'file', 'sqlite', and 'duckdb'. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change to this function. Now, unlike when we started this PR, path_from_uri is only ever called on local filesystem paths.

@@ -92,6 +109,8 @@ def parse_storage(item: Union[Path, str]) -> Storage:
result = FileStorage(item)
elif scheme == "postgresql":
result = SQLStorage(item)
elif scheme == "bucket":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There won't be a bucket scheme. Instead, this function must be extended to accept dict (a YAML object from config). While we're here, we might accept a dict for PostgreSQL as well as for Blob/Bucket.

We currently accept SQL creds like this only:

- postgresql://username:password@host:port/database

but we could additionally accept a more structured input:

- uri: postgresql://host:port/database
  username: username
  password: password

And buckets would of course be similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants