Skip to content

On redundancy as validation: entities vs. metadata #91

@Lestropie

Description

@Lestropie

Relates to bids-standard/bids-specification#2155.

I've had something resembling an insight on the topic of removal vs. empowerment of the Inheritance Principle (too many issues to link), having done the deep dive linked above, and thought I'd share in case others find it enlightening.

In data files, the "subject" and "session" entities have redundancy: they are reproduced in both parent directory and file name.
For manual curation of datasets this provides some degree of error detection.
The relationship between permissible suffixes and modality directory is more complex but serves a similar purpose (#55 would be a more direct recapitulation).
#63 (and maybe others?) would want to repeat this same kind of redundancy structure.

Now contrast this against key-value metadata. With attempts to empower the Inheritance Principle, one of the aims is to elucidate the myriad complex relationships between data, by identifying metadata that is shared across many data files and defining it just once; the location of that shared metadata file in terms of parent directory / entities (/ suffix), and the metadata that is common / distinct between files, communicates the nature of the relationship.

This process is therefore explicitly removing redundancy.

Thus far the argument for removal of the Inheritance Principle has largely been on the basis of avoiding unnecessary complexity. I don't think I've heard from anyone the argument that forcing all metadata for a given data file to be defined explicitly as associated with that one data file provides an intrinsic error detection mechanism.

If the IP is to be present in any form, it should be explicit in the documentation that involving the IP in manual data curation is dangerous, and if possible, it would be better to instead rely on automated tools to identify and remove metadata redundancy (eg. Lestropie/IP-freely#2).

These complex relationships between data based on mutual vs. distinct metadata are present in BIDS datasets, regardless of whether the IP is utilised in their storage. It's only a distinction of whether those relationships are made more prominent in the filesystem structure through exploitation of the IP, or only visible through either a priori definition of a set of entities / suffixes to wildcard or a deep interrogation of the full metadata relational graph.

That insight has nudged me away somewhat from the IP advocacy side...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions