Virtual Datasets #931

genematx · 2025-03-24T18:15:21Z

This adds a possibility to create and download "virtual" datasets from contents of Consolidated containers. The data is collected server-side and presented to the client as an Xarray; no new nodes or da sources are created in the process.

Usage example:

import numpy as np
import pandas as pd

X = c.create_composite("test")

X.write_array(np.arange(10), key="time", specs=["xarray_coord"], dims=["time"])
X.write_array(np.random.randn(10, ), key="arr1", dims=["time"])
X.write_array(np.random.randn(10, ), key="arr2", dims=["time"])
df = pd.DataFrame({"colA": np.random.randn(10),
                                  "colB": np.random.randint(0, 10, 10),
                                  "colC": np.random.choice(["a", "b", "c", "d", "e"], 10)})
X.write_dataframe(df, key="tab1", metadata={"rows_dim": "time"})

The, to create a dataset from all arrays in c['test'], one can use the .to_dataset() method and then .read() on the resulting DatasetClient to fetch the data as an xarray:

In: X
Out: <Composite {'time', 'arr1', 'arr2', 'colA', 'colB', 'colC'}>

In: ds
Out: <DatasetClient ['time', 'arr1', 'arr2', 'colB', 'colA', 'colC']>

In: ds.read()
Out:
<xarray.Dataset> Size: 480B
Dimensions:  (time: 10)
Coordinates:
  * time     (time) int64 80B 0 1 2 3 4 5 6 7 8 9
Data variables:
    arr1     (time) float64 80B 0.787 0.4187 0.4231 ... 0.5161 1.222 -0.2083
    arr2     (time) float64 80B -1.1 0.5923 -1.155 -1.166 ... -1.562 1.416 2.133
    colB     (time) int64 80B 0 5 1 6 8 0 0 5 7 8
    colA     (time) float64 80B 1.816 0.5406 -0.2252 ... -0.03061 -0.5295 -1.278
    colC     (time) object 80B 'b' 'a' 'b' 'e' 'a' 'a' 'd' 'b' 'c' 'e'

To build a dataset from a subset of the consolidated contents, pass their keys to the .to_dataset() method.

By default, all arrays included in the dataset are treated as xarray_data_var's; to mark any of them as a coordinate, set its specs to specs = ["xarray_coord"], as for the "time" dimension above. Since the data arrays stored as table columns can not be assigned specs individually, one desires to use them as dataset coordinates, this can be accomplished by setting the "column_specs" key in the table metadata, e.g. "column_specs": {"colA": ["xarray_coord"]}.

Virtual datasets can include variables with multiple dimensions and dimensions with non-matching sizes, for example:

X.write_array(np.linspace(0, 1, 20), key="time_2x", specs=["xarray_coord"], dims=["time_2x"])
X.write_array(np.random.randn(20, ), key="arr3", dims=["time_2x"])
X.write_array(np.random.randn(20, 13, 17), key="img1", dims=["time_2x", 'x', 'y'])

In: X.to_dataset('time', 'time_2x', 'colA', 'arr3', 'img1').read()
Out:
<xarray.Dataset> Size: 36kB
Dimensions:  (time: 10, time_2x: 20, x: 13, y: 17)
Coordinates:
  * time     (time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * time_2x  (time_2x) float64 160B 0.0 0.05263 0.1053 ... 0.8947 0.9474 1.0
Dimensions without coordinates: x, y
Data variables:
    colA     (time) float64 80B 1.816 0.5406 -0.2252 ... -0.03061 -0.5295 -1.278
    arr3     (time_2x) float64 160B 0.4221 0.02674 0.3115 ... -0.8132 -0.1114
    img1     (time_2x, x, y) float64 35kB 2.165 -0.5372 ... -0.1032 -1.558

Finally, it is possible to align the variable sizes along the left-most dimension ("resample" or "zip_shortest") when building the dataset (before downloading the xarray).

ds = X.to_dataset('time', 'time_2x', 'colA', 'arr3', 'img1', align='resample')

In: ds.read()
Out:
<xarray.Dataset> Size: 18kB
Dimensions:  (time: 10, time_2x: 10, x: 13, y: 17)
Coordinates:
  * time     (time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * time_2x  (time_2x) float64 80B 0.0 0.1053 0.2105 ... 0.7368 0.8421 0.9474
Dimensions without coordinates: x, y
Data variables:
    colA     (time) float64 80B 1.816 0.5406 -0.2252 ... -0.03061 -0.5295 -1.278
    arr3     (time_2x) float64 80B 0.4221 0.3115 -1.195 ... 0.8859 -0.8132
    img1     (time_2x, x, y) float64 18kB 2.165 -0.5372 -0.5396 ... -1.05 0.709

In: ds['arr3']
Out: <ArrayClient shape=(10,) chunks=((10,),) dtype=float64 dims=('time_2x',)>

Checklist

Add a Changelog entry
Add the ticket number which this PR closes to the comment section

Co-authored-by: Dan Allan <[email protected]>

genematx · 2025-04-03T20:53:16Z

tiled/server/links.py

+def links_for_composite(structure_family, structure, base_url, path_str):
+    links = {}
+    links["full"] = f"{base_url}/composite/full/{path_str}"
+    links["meta"] = f"{base_url}/composite/meta/{path_str}"


Possibly a better name instead of /composite/meta/? Also, I'm not 100% sure that /composite/full/ (returns a DatasetClient) is consistent with /container/full/.

danielballan · 2025-04-03T20:43:14Z

tiled/_tests/test_dataset.py

+    ("colD", tab2["colD"]),
+    ("colE", tab2["colE"]),
+    ("colF", tab2["colF"]),
+    # ("colG", tab3["colG"]),


Is this running into trouble due to types? If so #941 will fix, once that's in.

no, due to setting up a different type of storage in tests

danielballan · 2025-04-03T20:43:22Z

tiled/_tests/test_dataset.py

+    # ("colG", tab3["colG"]),
+    # ("colH", tab3["colH"]),
+    # ("colI", tab3["colI"]),
+    # ("sps1", sps1.todense()),


Do these work?

danielballan · 2025-04-03T20:51:11Z

tiled/_tests/test_dataset.py

+    x = from_context(context)["x"]
+    ds = x.to_dataset()
+    actual = ds[name].read()
+    assert np.array_equal(actual, expected.squeeze())


Where does the inconsistency arise that makes squeeze() necessary?

for arrays with shapes (10, 1) and (20, 1). When creating an xarray, I assumed that trailing ones in the shape could be dropped (to avoid declaring another coordinate).

danielballan · 2025-04-03T20:53:28Z

tiled/_tests/test_dataset.py

+    assert set(xarr.dims) == {"time_1x", "time_2x", "x", "y"}
+    assert set(xarr.data_vars) == {"arr1", "arr3", "img1", "img2", "colA", "colD"}
+
+    # Revert the metadata changes


Maybe this should live in a finally block.

i like the idea!

danielballan · 2025-04-04T01:10:39Z

tiled/server/core.py

+    query_dict = {"align": align} if align else {}
+    # Encode the parts in the query sring
+    if parts is not None:
+        query_dict["code"] = await entry.encode_keys(parts)


This encoding is severely opaque, and seems unusual in a JSON API. Could we just forgo the link altogether? The links are convenience, but not a requirement. We don't provide them for the POST endpoints, for example.

danielballan · 2025-04-04T01:17:41Z

tiled/catalog/adapter.py

+
+        return sorted(all_keys)
+
+    async def encode_keys(self, keys):


I'm suspicious about this. I'll comment where it's used, below. :-D

danielballan · 2025-04-04T01:22:03Z

tiled/catalog/adapter.py

+            Name of the first (leftmost) dimension. Default is 'time'.
+        align : str, optional
+            If not None, align the arrays in the dataset. Options are:
+            - 'zip_shortest': Trim all arrays to the length of the shortest one.


I still feel some reservations about signing the server up to do this much computing. Party for the load, which can be mitigated by limiting this feature to smaller datasets, but partly for the complexity of the API. Of these three, zip_shortest is unambigous, but it seems to be that padding and resampling invite future parameterization: "Pad with what?" and "Resample with what options?"

Would it be acceptable to start with only zip_shortest and take a little more time to consider the implications of the others?

danielballan · 2025-04-04T13:51:41Z

tiled/catalog/adapter.py

+        keys=None,
+        return_adapters=False,
+        adapter_keys=None,
+        default_dim0="time",


This is a bluesky assumption. The bluesky-tiled-plugins can provide this; Tiled should not elevate any particular string as a default.

danielballan · 2025-07-02T15:45:57Z

We considered making this critical for representing Bluesky data. We concluded that we can use a simpler approach, so this is not needed.

We do not rule out adding support for something like this in the future, but we are not currently pursuing it.

genematx and others added 30 commits February 13, 2025 08:19

Template for the future HDF5ArrayAdapter

9c4d551

ENH: Implement HDF5ArrayAdapter

c78ab18

FIX: convert hdf5 dataset parameter

ec2f931

FIX: dtype check

1767201

FIX: shape check

abadcea

fix typos

ce7b60c

ENH: convert strings to ndslices

821fe7e

Enable slicing in HDF5ArrayAdapter

3ddf0b6

ENH: accept str for dataset in HDF5ArrayAdapter

73b10e3

FIX: string parsing

006f6fa

ENH: consider native chunking in hdf5

9d93fdf

FIX: rechunk hdf5

643ead6

WIP: include DHF5ArrayAdapter in HDF5Adapter

dd98040

WIP: refactor metadata from file

4c534bd

ENH: Incorporate HDF5ArrayAdapter into HDF5Adapter

9fbc039

Merge branch 'main' into hdf5-array-adapter

e41bdb5

FIX: errors with awkward form being None

44437dc

ENH: metadata for table columns

35fd53a

ENH: add composite structure family

65a4063

ENH: sketch of a simplest mapped flat structure

d1eff0a

ENH: zipping 1D arrays

9f751fe

ENH: generate a dataset from composite

00f12dc

MNT: typing of EllipsisType

4702e19

FIX: loading data from files

566ae48

FIX: typing

2763627

FIX: typing

b1e3590

MNT: clean-up

84dc14d

Update tiled/adapters/hdf5.py

1134d70

Co-authored-by: Dan Allan <[email protected]>

Update tiled/adapters/hdf5.py

c879652

Co-authored-by: Dan Allan <[email protected]>

ENH: implement an NDSlice class

0dc620d

genematx added 3 commits March 22, 2025 18:22

Merge branch 'hdf5-array-adapter' into virtual-dataset

66f7b4f

TST: fix url to path conversion

26cab9e

Merge branch 'hdf5-array-adapter' into virtual-dataset

495d391

genematx requested review from danielballan and jwlodek March 24, 2025 18:15

genematx added 18 commits March 24, 2025 14:23

MNT: changelog and lint

24ec005

TST: fix Windows tests

f1a3a99

Merge branch 'hdf5-array-adapter' into virtual-dataset

5e0c6d2

Merge branch 'main' into awkward-refactor

1d101d4

TST fix slicing tests

1fd5db0

Merge branch 'awkward-refactor' into virtual-dataset

1d20c5d

TST: fix failing tests

eff4103

TST: create distinct temp directories

acebf7b

REV:

4fd5578

ENH: exporting NDSlices with Ellipsis to JSON

c9439c4

Merge branch 'main' into hdf5-array-adapter

6debf2a

Merge branch 'hdf5-array-adapter' into virtual-dataset

cbdc7bd

FIX: NDSlice import

66fc4ea

FIX: NDSlice import

b7f1f9e

MNT: lint

ddaf6ba

STY: fix endpoints for composite

eed70fb

Merge branch 'main' into virtual-dataset

299c0e9

MNT: lint

ad46c08

genematx commented Apr 3, 2025

View reviewed changes

ENH: encode long query parameters

da9434c

danielballan reviewed Apr 4, 2025

View reviewed changes

ENH: use POST method by xarray client

f9c2669

danielballan closed this Jul 2, 2025

Virtual Datasets #931

Virtual Datasets #931

Uh oh!

Conversation

genematx commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielballan commented Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

genematx commented Mar 24, 2025 •

edited

Loading