On redundancy as validation: entities vs. metadata

Relates to https://github.com/bids-standard/bids-specification/issues/2155.

I've had something resembling an insight on the topic of removal vs. empowerment of the Inheritance Principle (too many issues to link), having done the deep dive linked above, and thought I'd share in case others find it enlightening.

In data files, the "subject" and "session" entities have *redundancy*: they are reproduced in both parent directory and file name.
For manual curation of datasets this provides some degree of *error detection*.
The relationship between permissible suffixes and modality directory is more complex but serves a similar purpose (#55 would be a more direct recapitulation).
#63 (and maybe others?) would want to repeat this same kind of redundancy structure.

Now contrast this against key-value metadata. With attempts to empower the Inheritance Principle, one of the aims is to elucidate the myriad complex relationships between data, by identifying metadata that is shared across many data files and defining it just once; the location of that shared metadata file in terms of parent directory / entities (/ suffix), and the metadata that is common / distinct between files, communicates the nature of the relationship.

This process is therefore **explicitly removing redundancy**.

Thus far the argument for removal of the Inheritance Principle has largely been on the basis of avoiding unnecessary complexity. I don't think I've heard from anyone the argument that forcing all metadata for a given data file to be defined explicitly as associated with that one data file provides an intrinsic error detection mechanism.

If the IP is to be present in any form, it should be explicit in the documentation that involving the IP in manual data curation is dangerous, and if possible, it would be better to instead rely on automated tools to identify and remove metadata redundancy (eg. https://github.com/Lestropie/IP-freely/issues/2).

These complex relationships between data based on mutual vs. distinct metadata are present in BIDS datasets, regardless of whether the IP is utilised in their storage. It's only a distinction of whether those relationships are made more prominent in the filesystem structure through exploitation of the IP, or only visible through either *a priori* definition of a set of entities / suffixes to wildcard or a deep interrogation of the full metadata relational graph.

That insight has nudged me away somewhat from the IP advocacy side...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

On redundancy as validation: entities vs. metadata #91

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

On redundancy as validation: entities vs. metadata #91

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions