Skip to content

[FEATURE REQUEST] Multi-tensor samples from complex data sources #79

Open
@elistevens

Description

@elistevens

Is your feature request related to a problem? Please describe.
I want to know how to accomplish the following. It doesn't need to require zero effort on the user's part, but there needs to be a clear best-hangar-practices path to a workable setup.

Take a source data format that is complex (e.g. DICOM, JPEG) and infeasible to reconstitute bit-exact from the tensor+metadata form. Each instance of the raw data produces a sample that consists of 2 (or more) tensors; an image tensor and a 1D tensor that encodes things like lat/long or age/sex/etc. (to be concatenated with the output of the convolutional layers prior to the fully connected layers). To be clear, this is intended to be an illustrative example, not a concrete use case.

Per my reading of the docs, right now these two tensors wouldn't qualify as being in the same hangar dataset (it's not clear if that's problematic or not).

Let's express the above conversion as:

f_v1(raw) -> (t1, t2)

Users will need to:

  • Update the conversion function to f_v2 and repopulate t1 and t2.
  • Update the conversion function to f_v3 which outputs (t1, t2, t3).
  • Update the raw data for a sample and repopulate t1 and t2.
  • Be handed the pair of t1 and t2 for training/validation (including when training is randomized).
  • Retrieve the raw data given IDs/tags/metadata included with the training sample (for use in an external viewer, manual investigation, etc.).

Describe the solution you'd like
I think that changing the definition of a sample to be a tuple of binary blobs plus a tuple of tensors plus metadata would work, but I haven't considered the potential impacts from that kind of change. Seems potentially large.

Describe alternatives you've considered
Another option would be to have separate datasets for t1 and t2 and combine them manually, plus manage the binary blobs separately. That seems like a lot of infra work, and might be at risk of having drift between the samples themselves, and with the blobs.

Additional context
I suspect that I want/expect Hangar to solve a larger slice of the problem than it's intended to, but it's not clear at first glance what the intended approach would be for more complicated setups like the above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationenhancementNew feature or requestneeds decisionDiscussion is ongoing to determine what to do.questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions