Skip to content

Conversation

@genematx
Copy link
Contributor

@genematx genematx commented Mar 24, 2025

This adds a possibility to create and download "virtual" datasets from contents of Consolidated containers. The data is collected server-side and presented to the client as an Xarray; no new nodes or da sources are created in the process.

Usage example:

import numpy as np
import pandas as pd

X = c.create_composite("test")

X.write_array(np.arange(10), key="time", specs=["xarray_coord"], dims=["time"])
X.write_array(np.random.randn(10, ), key="arr1", dims=["time"])
X.write_array(np.random.randn(10, ), key="arr2", dims=["time"])
df = pd.DataFrame({"colA": np.random.randn(10),
                                  "colB": np.random.randint(0, 10, 10),
                                  "colC": np.random.choice(["a", "b", "c", "d", "e"], 10)})
X.write_dataframe(df, key="tab1", metadata={"rows_dim": "time"})

The, to create a dataset from all arrays in c['test'], one can use the .to_dataset() method and then .read() on the resulting DatasetClient to fetch the data as an xarray:

In: X
Out: <Composite {'time', 'arr1', 'arr2', 'colA', 'colB', 'colC'}>

In: ds
Out: <DatasetClient ['time', 'arr1', 'arr2', 'colB', 'colA', 'colC']>

In: ds.read()
Out:
<xarray.Dataset> Size: 480B
Dimensions:  (time: 10)
Coordinates:
  * time     (time) int64 80B 0 1 2 3 4 5 6 7 8 9
Data variables:
    arr1     (time) float64 80B 0.787 0.4187 0.4231 ... 0.5161 1.222 -0.2083
    arr2     (time) float64 80B -1.1 0.5923 -1.155 -1.166 ... -1.562 1.416 2.133
    colB     (time) int64 80B 0 5 1 6 8 0 0 5 7 8
    colA     (time) float64 80B 1.816 0.5406 -0.2252 ... -0.03061 -0.5295 -1.278
    colC     (time) object 80B 'b' 'a' 'b' 'e' 'a' 'a' 'd' 'b' 'c' 'e'

To build a dataset from a subset of the consolidated contents, pass their keys to the .to_dataset() method.

By default, all arrays included in the dataset are treated as xarray_data_var's; to mark any of them as a coordinate, set its specs to specs = ["xarray_coord"], as for the "time" dimension above. Since the data arrays stored as table columns can not be assigned specs individually, one desires to use them as dataset coordinates, this can be accomplished by setting the "column_specs" key in the table metadata, e.g. "column_specs": {"colA": ["xarray_coord"]}.

Virtual datasets can include variables with multiple dimensions and dimensions with non-matching sizes, for example:

X.write_array(np.linspace(0, 1, 20), key="time_2x", specs=["xarray_coord"], dims=["time_2x"])
X.write_array(np.random.randn(20, ), key="arr3", dims=["time_2x"])
X.write_array(np.random.randn(20, 13, 17), key="img1", dims=["time_2x", 'x', 'y'])

In: X.to_dataset('time', 'time_2x', 'colA', 'arr3', 'img1').read()
Out:
<xarray.Dataset> Size: 36kB
Dimensions:  (time: 10, time_2x: 20, x: 13, y: 17)
Coordinates:
  * time     (time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * time_2x  (time_2x) float64 160B 0.0 0.05263 0.1053 ... 0.8947 0.9474 1.0
Dimensions without coordinates: x, y
Data variables:
    colA     (time) float64 80B 1.816 0.5406 -0.2252 ... -0.03061 -0.5295 -1.278
    arr3     (time_2x) float64 160B 0.4221 0.02674 0.3115 ... -0.8132 -0.1114
    img1     (time_2x, x, y) float64 35kB 2.165 -0.5372 ... -0.1032 -1.558

Finally, it is possible to align the variable sizes along the left-most dimension ("resample" or "zip_shortest") when building the dataset (before downloading the xarray).

ds = X.to_dataset('time', 'time_2x', 'colA', 'arr3', 'img1', align='resample')

In: ds.read()
Out:
<xarray.Dataset> Size: 18kB
Dimensions:  (time: 10, time_2x: 10, x: 13, y: 17)
Coordinates:
  * time     (time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * time_2x  (time_2x) float64 80B 0.0 0.1053 0.2105 ... 0.7368 0.8421 0.9474
Dimensions without coordinates: x, y
Data variables:
    colA     (time) float64 80B 1.816 0.5406 -0.2252 ... -0.03061 -0.5295 -1.278
    arr3     (time_2x) float64 80B 0.4221 0.3115 -1.195 ... 0.8859 -0.8132
    img1     (time_2x, x, y) float64 18kB 2.165 -0.5372 -0.5396 ... -1.05 0.709

In: ds['arr3']
Out: <ArrayClient shape=(10,) chunks=((10,),) dtype=float64 dims=('time_2x',)>

Checklist

  • Add a Changelog entry
  • Add the ticket number which this PR closes to the comment section

def links_for_composite(structure_family, structure, base_url, path_str):
links = {}
links["full"] = f"{base_url}/composite/full/{path_str}"
links["meta"] = f"{base_url}/composite/meta/{path_str}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a better name instead of /composite/meta/? Also, I'm not 100% sure that /composite/full/ (returns a DatasetClient) is consistent with /container/full/.

("colD", tab2["colD"]),
("colE", tab2["colE"]),
("colF", tab2["colF"]),
# ("colG", tab3["colG"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this running into trouble due to types? If so #941 will fix, once that's in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, due to setting up a different type of storage in tests

# ("colG", tab3["colG"]),
# ("colH", tab3["colH"]),
# ("colI", tab3["colI"]),
# ("sps1", sps1.todense()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet

x = from_context(context)["x"]
ds = x.to_dataset()
actual = ds[name].read()
assert np.array_equal(actual, expected.squeeze())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the inconsistency arise that makes squeeze() necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for arrays with shapes (10, 1) and (20, 1). When creating an xarray, I assumed that trailing ones in the shape could be dropped (to avoid declaring another coordinate).

assert set(xarr.dims) == {"time_1x", "time_2x", "x", "y"}
assert set(xarr.data_vars) == {"arr1", "arr3", "img1", "img2", "colA", "colD"}

# Revert the metadata changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should live in a finally block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the idea!

query_dict = {"align": align} if align else {}
# Encode the parts in the query sring
if parts is not None:
query_dict["code"] = await entry.encode_keys(parts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This encoding is severely opaque, and seems unusual in a JSON API. Could we just forgo the link altogether? The links are convenience, but not a requirement. We don't provide them for the POST endpoints, for example.


return sorted(all_keys)

async def encode_keys(self, keys):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suspicious about this. I'll comment where it's used, below. :-D

Name of the first (leftmost) dimension. Default is 'time'.
align : str, optional
If not None, align the arrays in the dataset. Options are:
- 'zip_shortest': Trim all arrays to the length of the shortest one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel some reservations about signing the server up to do this much computing. Party for the load, which can be mitigated by limiting this feature to smaller datasets, but partly for the complexity of the API. Of these three, zip_shortest is unambigous, but it seems to be that padding and resampling invite future parameterization: "Pad with what?" and "Resample with what options?"

Would it be acceptable to start with only zip_shortest and take a little more time to consider the implications of the others?

keys=None,
return_adapters=False,
adapter_keys=None,
default_dim0="time",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bluesky assumption. The bluesky-tiled-plugins can provide this; Tiled should not elevate any particular string as a default.

@danielballan
Copy link
Member

We considered making this critical for representing Bluesky data. We concluded that we can use a simpler approach, so this is not needed.

We do not rule out adding support for something like this in the future, but we are not currently pursuing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants