Skip to content

Conversation

bclenet
Copy link
Contributor

@bclenet bclenet commented Apr 10, 2025

This is a work in progress PR proposing a specification update for BEP028 BIDS-Prov.

bclenet and others added 30 commits March 18, 2025 11:06
@bclenet
Copy link
Contributor Author

bclenet commented Oct 16, 2025

Hi @rwblair,

As discussed earlier, here are the issues I'm currently facing:

  • schema | validation of provenance files: I created the schema/rules/json/prov.yaml file; are there other things to do in this files for validating the contents of the objects in the Activities (resp. Environments, ProvEntities, Software) arrays ?
  • schema | validation of provenance-related metadata fields in sidecars: should I create a schema/rules/sidecars/prov.yaml to describe that some new provenance-related metadata fields are optional in all sidecars ?
  • validator | .json provenance files are considered as sidecars, hence raising the error : SIDECAR_WITHOUT_DATAFILE -> i'm afraid i'll not be able to come up with a clean modification of the validator to mark these files as standalone json.
  • schema | how/where to allow for a prov/[prov-<label>] directory in the schema ?
  • macros | how to add an optional prov/[prov-<label>] in the filename templates of the Provenance Files section (using the MACROS___make_filename_template macro)
  • CI | pdf rendering fails due to a assert_no_multiline_links which I don't understand

Thanks for your help :)

@rwblair
Copy link
Member

rwblair commented Oct 17, 2025

@bclenet While thinking about the first two points on how to validate certain parts of the files I started to realize the nature of the changes that will be required of the schema to validate them.

Here is my understanding of the main rules that this bep wants to enforce that require information outside of what has historically been used for validation, and the issues they raise.:

  • provenance.tsv "provenance_label": There exists at least one file in the dataset that uses each value in the column for its prov- entity label.
    • This is similar to the rule we have for participants.tsv that states any value in participant_id column must have a corresponding files in the dataset or be referenced in the phenotype directory. For this to work we had to populate a special field in the dataset context that had information about the names of the subject directories. Populating these fields is not representable in the Schema presently and must be implemented by the validators.
  • Every id used in all prov files is unique with respect to the dataset.
    • When a given file is validated we build a context for it that has all the information it needs to be validated by the schema. Typically this involves loading a sidecar or a handful of specific associated files. This type of pan-dataset assertion is not possible in the current schema.
  • *_ent.json "locatedAt": Its value exists in the dataset.
    • Doable in the current schema.

The following are all similar:

  • *_ent.json "generatedBy": Its values only reference existing Activity ids
  • *_act.json "AssociatedWith": Its Values only reference existing Software ids.
  • *_act.json "Used": Its Values in references only reference existing Environment ids or ProvEntity Ids
  • anybidsfile.json "GeneratedBy": Its Values only reference existing Activity ids
  • anyBidsfile.json "SidecarGeneratedBy": Its Values only reference existing Activty ids
    • This suffers from the above issue. For any given prov file we are validating we must load arbitrarily many others and check the value at a specific key inside each of their arrays. We could try and come up with new semantics for schema entries that would allow this. Another option would be to come up with a new way in the schema to aggregate values from multiple files into a single place, and then a way of running checks on the aggregated data. But, even if all the, for example, Software objects were gathered in a single place we'd have another problem...
    • The expression language is incapable of iterating through an array of objects and running a check on a specific key for each element. We could extend the expression language to have a function like flatten(array: List[List | dict], key: Optional[star]. If the input is an array of arrays then we use normal flatten semantics of putting all elements of all arrays in a single array and return that. If the input is an Array of objects we return an array of each objects value at the key specified.

One thing I like about this proposal is that each json file is simple enough to be immediately understood by a human. I was playing around with alternative ways of organizing data from the examples that might be more amenable to the current expression language and they were all much more difficult to read at a glance. The UID in the Ids makes me think this was not meant to be produced or consumed by humans, but I'm a sucker for looking at any json file that comes across my path.

Please let me know if I have misunderstood/misinterpreted any of the rules from the BEP.

@effigies Any comments on my characterizations of the schema's short comings with respect to the above rules?

@bclenet This only sort of addressed your first two issues, for the remaining four:

  • sidecar without datafile - We do need to add a field to the schema to indicate this, next schema hack I'll bring it up, and take ownership of adding its interpretation to the javascript validator.
  • prov subdirectories - I've got a local branch that's capable of doing this, I'll try and push it upstream when its ready.
  • Macros and CI - I still need to look into these.

`MD5`; `SHA1`; `SHA-224` ; `SHA-256` ; `SHA-384` ; `SHA-512` ;
`SHA3-224`; `SHA3-256`; `SHA3-384`; `SHA3-512`; `BLAKE2B-256`; `BLAKE3-256`;
`SHAKE128`; `SHAKE256`. Otherwise, key MAY be an arbitrary label.
The corresponding value is the checksum as computed by the function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The corresponding value is the checksum as computed by the function.
The corresponding value is the checksum as computed by the function identified by the key.

`MD5`; `SHA1`; `SHA-224` ; `SHA-256` ; `SHA-384` ; `SHA-512` ;
`SHA3-224`; `SHA3-256`; `SHA3-384`; `SHA3-512`; `BLAKE2B-256`; `BLAKE3-256`;
`SHAKE128`; `SHAKE256`. Otherwise, key MAY be an arbitrary label.
The corresponding value is the checksum as computed by the function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The corresponding value is the checksum as computed by the function.
The corresponding value is the checksum as computed by the function identified by the key.


The Resource Description Framework (RDF) is a method to describe and exchange graph data.

The terms defined in this part of the BIDS specification are based on the [W3C Prov](https://www.w3.org/TR/2013/REC-prov-o-20130430/) standard. Their relations with W3C Prov terms are defined in the [`provenance-context.json`]() file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provenance-context.json isn't mentioned any where else on the page. May want to clarify that it exists as json file in the specification itself like metaschema.json.

Further datasets are available from
the [BIDS examples repository](https://bids-website.readthedocs.io/en/latest/datasets/examples.html#provenance).

## Overview
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this heading and move the sentences under it to the above section (#provenance), I think the opening sentences of a page are generally understood to be an overview.


This description is based on the [W3C Prov](https://www.w3.org/TR/2013/REC-prov-o-20130430/) standard.

### General principles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These principles could be moved into the top level section, other sections of the standard (behavioral, phenotypic, etc) use requirement keywords in their opening salvos. I'd also remove the new lines between the sentences.

- *sub-001_T1w_preproc.nii* is the skull striped image;
- the *"Brain extraction"* activity was performed using the *FSL* software within a *Linux* software environment.

Provenance objects are described as JSON objects in BIDS. They are stored inside **provenance files** (see [Provenance files](#provenance-files)). Additionally, metadata of provEntities can be stored as BIDS metadata inside sidecar JSON files (see [Provenance of a BIDS file](#provenance-of-a-bids-file)) as well as in `dataset_description.json` files (see [Provenance of a BIDS dataset](#provenance-of-a-bids-dataset)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Provenance objects are described as JSON objects in BIDS. They are stored inside **provenance files** (see [Provenance files](#provenance-files)). Additionally, metadata of provEntities can be stored as BIDS metadata inside sidecar JSON files (see [Provenance of a BIDS file](#provenance-of-a-bids-file)) as well as in `dataset_description.json` files (see [Provenance of a BIDS dataset](#provenance-of-a-bids-dataset)).
Provenance objects are described as JSON objects in BIDS. They are stored inside **provenance files** (see [Provenance files](#provenance-files)). Additionally, metadata for provEntities can be stored inside the JSON sidecar file for any BIDS data file (see [Provenance of a BIDS file](#provenance-of-a-bids-file)), as well as in `dataset_description.json` files (see [Provenance of a BIDS dataset](#provenance-of-a-bids-dataset)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants