Skip to content

🪣 Support for Blob Storage #955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
b9f402c
✍️ Define Blob (minimum)
Kezzsim Mar 28, 2025
bafff0c
✅ add `ensure_dict` to validate blob storage inputs
Kezzsim Apr 11, 2025
21a2457
Merge branch 'bluesky:main' into blob-tsar
Kezzsim Apr 11, 2025
e683956
🛃 `key` and `secret` not required for public datasets
Kezzsim Apr 11, 2025
54797ae
Merge branch 'bluesky:main' into blob-tsar
Kezzsim Apr 17, 2025
f6535fc
💥 Zarr `storage` class breakout condition
Kezzsim Apr 17, 2025
b5c6a0f
📂 `FsspecStore` implementation from Zarr docs
Kezzsim Apr 17, 2025
6f1488e
🪣 Zarr 2.8 doesn't have `FSspec` yet, use `s3fs`
Kezzsim Apr 17, 2025
decdc81
🪣 Add example config, rename `blob` to `bucket` in most places.
Kezzsim Apr 17, 2025
cd8f110
🧹 Various suggestions
Kezzsim Apr 21, 2025
e885a0d
🧵 Handle scenarios where `readable_storage` is not `str`
Kezzsim Apr 21, 2025
0c87737
✍️ Successfully write data!
Kezzsim Apr 22, 2025
3518072
🛅 Require preexisting bucket to be defined
Kezzsim Apr 23, 2025
e6d6f07
Merge branch 'bluesky:main' into blob-tsar
Kezzsim Apr 28, 2025
9f6d773
🚧 Prepare to rebase
Kezzsim Apr 29, 2025
de65c4d
Merge branch 'bluesky:main' into blob-tsar
Kezzsim Apr 29, 2025
99c1215
Merge branch 'bluesky:main' into blob-tsar
Kezzsim May 1, 2025
909ed03
🐳 Add `docker-compose` to example, write a `README.md`
Kezzsim May 1, 2025
e35e6ac
🪀 Make rebasing possible
Kezzsim May 5, 2025
3b95d80
🎏 Overwrite with upstream
Kezzsim May 5, 2025
643d39e
Merge branch 'main' into blob-tsar
Kezzsim May 5, 2025
511cf00
🎯 Put my changes back
Kezzsim May 5, 2025
2837cb4
Merge branch 'bluesky:main' into blob-tsar
Kezzsim May 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions example_configs/bucket_storage/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Create a local bucket for testing access to BLOBS

In this example there exists:
- A `docker-compose.yml` file capable of instantiating and running a [Minio](https://min.io/) container.
- A configuration yaml file `s3_style_storage.yml` which contains information tiled needs to authenticate with the bucket storage system and write / read Binary Large Objects (BLOBS) through the Zaar adapter.

## How to run this example:
1. In one terminal window, navigate to the directory where the `docker-compose.yml` and `s3_style_storage.yml` are.
2. Run `docker compose up` with adequate permissions.
3. Open another terminal window in the same location and run `tiled serve config s3_style_storage.yml --api-key secret`
4. You will need to create a `storage` directory in `/example_configs/bucket_storage` for the sqlite database.
5. Create an `ipython` session and run the following commands to write array data as a BLOB in a bucket:
```python
from tiled.client import from_uri
c = from_uri('http://localhost:8000', api_key='secret')
c.write_array([1,2,3])
```
6. You will be able to see the written data in the bucket if you log in to the minio container, exposed on your machine at `http://localhost:9001/login`. </br> Use testing credentials `minioadmin` for both fields.
31 changes: 31 additions & 0 deletions example_configs/bucket_storage/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
version: "3.2"
services:
minio:
image: minio/minio:latest
ports:
- 9000:9000
- 9001:9001
volumes:
- minio-data:/data
environment:
MINIO_ROOT_USER: "minioadmin"
MINIO_ROOT_PASSWORD: "minioadmin"
command: server /data --console-address :9001
restart: unless-stopped

create-bucket:
image: minio/mc:latest
environment:
MC_HOST_minio: http://minioadmin:minioadmin@minio:9000
entrypoint:
- sh
- -c
- |
until mc ls minio > /dev/null 2>&1; do
sleep 0.5
done

mc mb --ignore-existing minio/buck

volumes:
minio-data:
13 changes: 13 additions & 0 deletions example_configs/bucket_storage/s3_style_storage.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
authentication:
allow_anonymous_access: false
trees:
- path: /
tree: catalog
args:
uri: "sqlite:///storage/catalog.db"
writable_storage:
bucket:
uri: "http://localhost:9000/buck"
key: "minioadmin"
secret: "minioadmin"
init_if_not_exists: true
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ all = [
"python-jose[cryptography]",
"python-multipart",
"rich",
"s3fs",
"sparse",
"sqlalchemy[asyncio] >=2",
"stamina",
Expand Down Expand Up @@ -199,6 +200,7 @@ minimal-server = [
"python-dateutil",
"python-jose[cryptography]",
"python-multipart",
"s3fs",
"sqlalchemy[asyncio] >=2",
"starlette >=0.38.0",
"uvicorn[standard]",
Expand Down Expand Up @@ -242,6 +244,7 @@ server = [
"python-dateutil",
"python-jose[cryptography]",
"python-multipart",
"s3fs",
"sparse",
"sqlalchemy[asyncio] >=2",
"starlette >=0.38.0",
Expand Down
38 changes: 27 additions & 11 deletions tiled/adapters/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
import os
from collections.abc import Mapping
from typing import Any, Iterator, List, Optional, Tuple, Union, cast
from urllib.parse import quote_plus
from urllib.parse import quote_plus, urlparse

import s3fs
import zarr.core
import zarr.hierarchy
import zarr.storage
Expand Down Expand Up @@ -49,16 +50,28 @@ def init_storage(

"""
data_source = copy.deepcopy(data_source) # Do not mutate caller input.
data_uri = storage.uri + "".join(
f"/{quote_plus(segment)}" for segment in path_parts
)
# Zarr requires evenly-sized chunks within each dimension.
# Use the first chunk along each dimension.
zarr_chunks = tuple(dim[0] for dim in data_source.structure.chunks)
shape = tuple(dim[0] * len(dim) for dim in data_source.structure.chunks)
directory = path_from_uri(data_uri)
directory.mkdir(parents=True, exist_ok=True)
store = zarr.storage.DirectoryStore(str(directory))
if storage.bucket:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old model, the adapter was handed multiple storage options and had to pick the one it wanted. Now, the caller in tiled.catalog.adapter picks one storage option and passes just that one in.

adapter = STORAGE_ADAPTERS_BY_MIMETYPE[data_source.mimetype]
# Choose writable storage. Use the first writable storage item
# with a scheme that is supported by this adapter. # For
# back-compat, if an adapter does not declare `supported_storage`
# assume it supports file-based storage only.
supported_storage = getattr(
adapter, "supported_storage", {FileStorage}
)
for storage in self.context.writable_storage:
if isinstance(storage, tuple(supported_storage)):
break
else:
raise RuntimeError(
f"The adapter {adapter} supports storage types "
f"{[cls.__name__ for cls in supported_storage]} "
"but the only available storage types "
f"are {self.context.writable_storage}."
)
data_source = await ensure_awaitable(
adapter.init_storage,
storage,
data_source,
self.segments + [key],
)

So, the task here is to check isinstance(storage, BucketStorage) versus FileStorage. Additionally, the supported_storage attribute on this class should be extended to include BucketStorage. This tells the caller to offer BucketStorage if that is the highest-priority item in writable_storage.

data_uri = storage.bucket.uri
s3 = s3fs.S3FileSystem(
client_kwargs={"endpoint_url": data_uri},
key=storage.bucket.key,
secret=storage.bucket.secret,
use_ssl=False,
)
store = s3fs.S3Map(
root="".join(f"/{quote_plus(segment)}" for segment in path_parts), s3=s3
)
else:
data_uri = storage.get("filesystem") + "".join(
f"/{quote_plus(segment)}" for segment in path_parts
)
directory = path_from_uri(data_uri)
directory.mkdir(parents=True, exist_ok=True)
store = zarr.storage.DirectoryStore(str(directory))
zarr.storage.init_array(
store,
shape=shape,
Expand Down Expand Up @@ -365,9 +378,12 @@ def from_catalog(
/,
**kwargs: Optional[Any],
) -> Union[ZarrGroupAdapter, ArrayAdapter]:
zarr_obj = zarr.open(
path_from_uri(data_source.assets[0].data_uri)
) # Group or Array
parsed = urlparse(data_source.assets[0].data_uri)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at what SQLAdapter does here. If this is bucket storage, we need to get the credentials for this bucket, similarly to how SQLAdapter does. The credentials are not in the database, but they are in the config, and we need to match the URL of the bucket store with one in the config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage = parse_storage(data_uri)
if isinstance(storage, SQLStorage):
# Obtain credentials
data_uri = cast(SQLStorage, get_storage(data_uri)).authenticated_uri

if parsed.scheme in {"http", "https", "s3"}:
uri = data_source.assets[0].data_uri
else:
uri = path_from_uri(data_source.assets[0].data_uri)
zarr_obj = zarr.open(uri) # Group or Array
if node.structure_family == StructureFamily.container:
return ZarrGroupAdapter(
zarr_obj,
Expand All @@ -394,4 +410,4 @@ def from_uris(
return ZarrGroupAdapter(zarr_obj, **kwargs)
else:
structure = ArrayStructure.from_array(zarr_obj)
return ZarrArrayAdapter(zarr_obj, structure=structure, **kwargs)
return ZarrArrayAdapter(zarr_obj, structure=structure, **kwargs)
21 changes: 20 additions & 1 deletion tiled/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
__all__ = [
"EmbeddedSQLStorage",
"FileStorage",
"BucketStorage",
"SQLStorage",
"Storage",
"get_storage",
Expand All @@ -19,6 +20,7 @@
@dataclasses.dataclass(frozen=True)
class Storage:
"Base class for representing storage location"

uri: str

def __post_init__(self):
Expand All @@ -34,6 +36,20 @@ def path(self):
return path_from_uri(self.uri)


@dataclasses.dataclass(frozen=True)
class BucketStorage:
"Bucket storage location for BLOBS"
uri: str
key: Optional[str]
secret: Optional[str]

def __post_init__(self):
object.__setattr__(self, "uri", ensure_uri(self.uri))
parsed_uri = urlparse(self.uri)
if not parsed_uri.path or parsed_uri.path == "/":
raise ValueError(f"URI must contain a path attribute: {self.uri}")


@dataclasses.dataclass(frozen=True)
class EmbeddedSQLStorage(Storage):
"File-based SQL database storage location"
Expand All @@ -42,6 +58,7 @@ class EmbeddedSQLStorage(Storage):
@dataclasses.dataclass(frozen=True)
class SQLStorage(Storage):
"File-based SQL database storage location"

username: Optional[str] = None
password: Optional[str] = None

Expand Down Expand Up @@ -92,6 +109,8 @@ def parse_storage(item: Union[Path, str]) -> Storage:
result = FileStorage(item)
elif scheme == "postgresql":
result = SQLStorage(item)
elif scheme == "bucket":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There won't be a bucket scheme. Instead, this function must be extended to accept dict (a YAML object from config). While we're here, we might accept a dict for PostgreSQL as well as for Blob/Bucket.

We currently accept SQL creds like this only:

- postgresql://username:password@host:port/database

but we could additionally accept a more structured input:

- uri: postgresql://host:port/database
  username: username
  password: password

And buckets would of course be similar.

result = BucketStorage(item)
elif scheme in {"sqlite", "duckdb"}:
result = EmbeddedSQLStorage(item)
else:
Expand All @@ -112,4 +131,4 @@ def register_storage(storage: Storage) -> None:

def get_storage(uri: str) -> Storage:
"Look up Storage by URI."
return _STORAGE[uri]
return _STORAGE[uri]
5 changes: 4 additions & 1 deletion tiled/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -574,11 +574,13 @@ class UnsupportedQueryType(TypeError):

class Conflicts(Exception):
"Prompts the server to send 409 Conflicts with message"

pass


class BrokenLink(Exception):
"Prompts the server to send 410 Gone with message"

pass


Expand Down Expand Up @@ -733,7 +735,8 @@ def path_from_uri(uri) -> Path:
path = Path(parsed.path[1:])
else:
raise ValueError(
"Supported schemes are 'file', 'sqlite', and 'duckdb'. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change to this function. Now, unlike when we started this PR, path_from_uri is only ever called on local filesystem paths.

"Supported schemes are 'file', 'sqlite', and 'duckdb'."
"For bucket storage, 'http', 'https', and 's3' are supported."
f"Did not recognize scheme {parsed.scheme!r}"
)
return path
Expand Down