-
Notifications
You must be signed in to change notification settings - Fork 36
Description
During the discussions that developed StreamResource and StreamDatum documents, the "producer" side was being built, but much of the "consumer" side remained to be developed. I think we agreed to stay open to adjustments when the time came to build out the consumer side. I hope that window is still open.
Here's a StreamResource from DLS I22, courtesy of @DiamondJoseph:
{
"name": "stream_resource",
"doc": {
"uid": "158167d9-796b-4e12-afff-5b3dd903815c",
"data_key": "waxs-sum",
"spec": "AD_HDF5_SWMR_SLICE",
"root": "/dls/i22/data/2023/cm33873-5",
"resource_path": "i22-723667-waxs-hdf.h5",
"resource_kwargs": {
"path": "/entry/instrument/NDAttributes/StatsTotal",
"multiplier": 1,
"timestamps": "/entry/instrument/NDAttributes/NDArrayTimeStamp"
},
"path_semantics": "posix",
"run_start": "617184e5-5e24-40b1-80f7-0056d51e44f8"
}
}I can see nothing to change about these:
uid, unique key for the documentdata_key, tells us which field (column) in the Event Stream this maps torun_start, tells us which BlueskyRun this belongs to
I might suggest re-evaluating that names of:
spec, a key in a registry that tells us which reader ("handler") to useresource_kwargs, additional configuration for the reader
In Tiled, we use MIME types, including standard ones such as image/tiff and custom ones named like application/x-hdf5-smwr-slice, in the role of spec. Maybe it's worth considering whether we want to use mimtype here, as it is a recognized standard way to spell, "This is a format, which tells you which I/O code to read it."
Taking a broader-than Python look at the world, resource_kwargs (emphasis on "kwargs") is a bit of a Python-ism. These are JSON-serializable parameters that are needed by the code that reads the data. This code does not necessary have to be in Python. In Tiled, the analogous entity is simply called parameters.
The existing names aren't doing any harm, so we have to weigh the pain of change against the marginal benefit of maybe-better names. If we leave them as is, I think that would be fine. Just worth considering.
The way we spell the location of the data seems more problematic:
root, an absolute filepathresource_path, a relative filepath (sometimes filename)path_semantics, a two-member enum that iswinorposix
I think we should consider replacing all of this with a URI, like file://localhost/dls/i22/data/2023/cm33873-5/i22-723667-waxs-hdf.h5".
- We foresee data being available in protocols other than
file:, such ass3:. - The partitioning of the path between
rootandresource_pathis a guess about what parts of the path will be stable over time and what parts may change as the data moves. Different people make different guesses, and at NSLS-II this has been a mess. - IIRC,
path_semanticsis completely vestigial, based on a misunderstanding that Windows required backslashes in paths. In fact, anything newer than Windows 95 is fine with forward slashes.
In Tiled, the same logical resource may be available in multiple places. This is expressed as a one-to-many relation between logical datasets and URIs (effectively). If the StreamResource just had a URI, this would be simpler and more future-proof.
Here is a StreamDatum:
{
"name": "stream_datum",
"doc": {
"stream_resource": "158167d9-796b-4e12-afff-5b3dd903815c",
"uid": "158167d9-796b-4e12-afff-5b3dd903815c/2",
"seq_nums": {
"start": 6,
"stop": 7
},
"indices": {
"start": 5,
"stop": 6
},
"descriptor": "8ff32942-9b21-4542-aee1-16c04291950d"
}
}The descriptor is on StreamDatum instead of on StreamResource with data_key. I think it's worth revisiting whether we really really need that, because life would be simplify if the descriptor and data_key were together on the same document.