Skip to content

Reviewing StreamResource and StreamDatum from the storage, access perspective #296

@danielballan

Description

@danielballan

During the discussions that developed StreamResource and StreamDatum documents, the "producer" side was being built, but much of the "consumer" side remained to be developed. I think we agreed to stay open to adjustments when the time came to build out the consumer side. I hope that window is still open.

Here's a StreamResource from DLS I22, courtesy of @DiamondJoseph:

  {
    "name": "stream_resource",
    "doc": {
      "uid": "158167d9-796b-4e12-afff-5b3dd903815c",
      "data_key": "waxs-sum",
      "spec": "AD_HDF5_SWMR_SLICE",
      "root": "/dls/i22/data/2023/cm33873-5",
      "resource_path": "i22-723667-waxs-hdf.h5",
      "resource_kwargs": {
        "path": "/entry/instrument/NDAttributes/StatsTotal",
        "multiplier": 1,
        "timestamps": "/entry/instrument/NDAttributes/NDArrayTimeStamp"
      },
      "path_semantics": "posix",
      "run_start": "617184e5-5e24-40b1-80f7-0056d51e44f8"
    }
  }

I can see nothing to change about these:

  • uid, unique key for the document
  • data_key, tells us which field (column) in the Event Stream this maps to
  • run_start, tells us which BlueskyRun this belongs to

I might suggest re-evaluating that names of:

  • spec, a key in a registry that tells us which reader ("handler") to use
  • resource_kwargs, additional configuration for the reader

In Tiled, we use MIME types, including standard ones such as image/tiff and custom ones named like application/x-hdf5-smwr-slice, in the role of spec. Maybe it's worth considering whether we want to use mimtype here, as it is a recognized standard way to spell, "This is a format, which tells you which I/O code to read it."

Taking a broader-than Python look at the world, resource_kwargs (emphasis on "kwargs") is a bit of a Python-ism. These are JSON-serializable parameters that are needed by the code that reads the data. This code does not necessary have to be in Python. In Tiled, the analogous entity is simply called parameters.

The existing names aren't doing any harm, so we have to weigh the pain of change against the marginal benefit of maybe-better names. If we leave them as is, I think that would be fine. Just worth considering.

The way we spell the location of the data seems more problematic:

  • root, an absolute filepath
  • resource_path, a relative filepath (sometimes filename)
  • path_semantics, a two-member enum that is win or posix

I think we should consider replacing all of this with a URI, like file://localhost/dls/i22/data/2023/cm33873-5/i22-723667-waxs-hdf.h5".

  1. We foresee data being available in protocols other than file:, such as s3:.
  2. The partitioning of the path between root and resource_path is a guess about what parts of the path will be stable over time and what parts may change as the data moves. Different people make different guesses, and at NSLS-II this has been a mess.
  3. IIRC, path_semantics is completely vestigial, based on a misunderstanding that Windows required backslashes in paths. In fact, anything newer than Windows 95 is fine with forward slashes.

In Tiled, the same logical resource may be available in multiple places. This is expressed as a one-to-many relation between logical datasets and URIs (effectively). If the StreamResource just had a URI, this would be simpler and more future-proof.

Here is a StreamDatum:

  {
    "name": "stream_datum",
    "doc": {
      "stream_resource": "158167d9-796b-4e12-afff-5b3dd903815c",
      "uid": "158167d9-796b-4e12-afff-5b3dd903815c/2",
      "seq_nums": {
        "start": 6,
        "stop": 7
      },
      "indices": {
        "start": 5,
        "stop": 6
      },
      "descriptor": "8ff32942-9b21-4542-aee1-16c04291950d"
    }
  }

The descriptor is on StreamDatum instead of on StreamResource with data_key. I think it's worth revisiting whether we really really need that, because life would be simplify if the descriptor and data_key were together on the same document.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions