-
Notifications
You must be signed in to change notification settings - Fork 65
Virtual Datasets #931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virtual Datasets #931
Conversation
Co-authored-by: Dan Allan <[email protected]>
Co-authored-by: Dan Allan <[email protected]>
| def links_for_composite(structure_family, structure, base_url, path_str): | ||
| links = {} | ||
| links["full"] = f"{base_url}/composite/full/{path_str}" | ||
| links["meta"] = f"{base_url}/composite/meta/{path_str}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly a better name instead of /composite/meta/? Also, I'm not 100% sure that /composite/full/ (returns a DatasetClient) is consistent with /container/full/.
| ("colD", tab2["colD"]), | ||
| ("colE", tab2["colE"]), | ||
| ("colF", tab2["colF"]), | ||
| # ("colG", tab3["colG"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this running into trouble due to types? If so #941 will fix, once that's in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, due to setting up a different type of storage in tests
| # ("colG", tab3["colG"]), | ||
| # ("colH", tab3["colH"]), | ||
| # ("colI", tab3["colI"]), | ||
| # ("sps1", sps1.todense()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet
| x = from_context(context)["x"] | ||
| ds = x.to_dataset() | ||
| actual = ds[name].read() | ||
| assert np.array_equal(actual, expected.squeeze()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the inconsistency arise that makes squeeze() necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for arrays with shapes (10, 1) and (20, 1). When creating an xarray, I assumed that trailing ones in the shape could be dropped (to avoid declaring another coordinate).
| assert set(xarr.dims) == {"time_1x", "time_2x", "x", "y"} | ||
| assert set(xarr.data_vars) == {"arr1", "arr3", "img1", "img2", "colA", "colD"} | ||
|
|
||
| # Revert the metadata changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should live in a finally block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the idea!
tiled/server/core.py
Outdated
| query_dict = {"align": align} if align else {} | ||
| # Encode the parts in the query sring | ||
| if parts is not None: | ||
| query_dict["code"] = await entry.encode_keys(parts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This encoding is severely opaque, and seems unusual in a JSON API. Could we just forgo the link altogether? The links are convenience, but not a requirement. We don't provide them for the POST endpoints, for example.
|
|
||
| return sorted(all_keys) | ||
|
|
||
| async def encode_keys(self, keys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suspicious about this. I'll comment where it's used, below. :-D
| Name of the first (leftmost) dimension. Default is 'time'. | ||
| align : str, optional | ||
| If not None, align the arrays in the dataset. Options are: | ||
| - 'zip_shortest': Trim all arrays to the length of the shortest one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel some reservations about signing the server up to do this much computing. Party for the load, which can be mitigated by limiting this feature to smaller datasets, but partly for the complexity of the API. Of these three, zip_shortest is unambigous, but it seems to be that padding and resampling invite future parameterization: "Pad with what?" and "Resample with what options?"
Would it be acceptable to start with only zip_shortest and take a little more time to consider the implications of the others?
| keys=None, | ||
| return_adapters=False, | ||
| adapter_keys=None, | ||
| default_dim0="time", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bluesky assumption. The bluesky-tiled-plugins can provide this; Tiled should not elevate any particular string as a default.
|
We considered making this critical for representing Bluesky data. We concluded that we can use a simpler approach, so this is not needed. We do not rule out adding support for something like this in the future, but we are not currently pursuing it. |
This adds a possibility to create and download "virtual" datasets from contents of Consolidated containers. The data is collected server-side and presented to the client as an Xarray; no new nodes or da sources are created in the process.
Usage example:
The, to create a dataset from all arrays in
c['test'], one can use the.to_dataset()method and then.read()on the resultingDatasetClientto fetch the data as an xarray:To build a dataset from a subset of the consolidated contents, pass their keys to the
.to_dataset()method.By default, all arrays included in the dataset are treated as
xarray_data_var's; to mark any of them as a coordinate, set its specs tospecs = ["xarray_coord"], as for the"time"dimension above. Since the data arrays stored as table columns can not be assigned specs individually, one desires to use them as dataset coordinates, this can be accomplished by setting the "column_specs" key in the table metadata, e.g."column_specs": {"colA": ["xarray_coord"]}.Virtual datasets can include variables with multiple dimensions and dimensions with non-matching sizes, for example:
Finally, it is possible to align the variable sizes along the left-most dimension ("resample" or "zip_shortest") when building the dataset (before downloading the xarray).
Checklist