Skip to content

Conversation

ericearl
Copy link
Collaborator

@ericearl ericearl commented May 30, 2025

The BEP leads can meet as-needed to discuss this BEP PR

Coordinate a meeting by emailing Eric Earl: [email protected].

Communicate on this PR to provide feedback otherwise.

HTML preview of this BEP

BEP036 brings guidelines for best tabular phenotypic data to the BIDS specification.

  • Includes an appendix called phenotype.md
  • Includes a new AdditionalValidation key for the dataset_description.json, for which the usage is described in the modality agnostic files sections
  • Includes the new option to store session_id as the second column in the participants.tsv

Additional Links

  1. Original Google Doc
  2. Draft BIDS Validator errors and warnings
  3. BIDS Examples PR

Co-authored-by: Eric Earl [email protected] @ericearl
Co-authored-by: Samuel Guay [email protected] @SamGuay
Co-authored-by: Sebastian Urchs [email protected] @surchs
Co-authored-by: Arshitha Basavaraj [email protected] @Arshitha

ericearl and others added 4 commits May 20, 2025 08:24
Quick update before merging our PR on surchs fork
BEP036 brings guidelines for best tabular phenotypic data to the BIDS specification.

- Includes an appendix called `phenotype.md`
- Includes admonitions for the guidelines in-line with modality agnostic files sections

---------

Co-authored-by: Eric Earl <[email protected]>
Co-authored-by: Samuel Guay <[email protected]>
Co-authored-by: Sebastian Urchs <[email protected]>
Co-authored-by: Arshitha B <[email protected]>
Changed "e.g." to "for example" to follow contributing style guidelines.
each `phenotype/<measurement_tool_name>.json` data dictionary.
This improves reusability and provides clarity about the measurement tool.

### 5. Use the demographics file for common variables about participants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying from https://github.com/surchs/bids-specification/pull/1/files#r2103117486

For this section, would it make sense to suggest that demo-like information be prioritized in this file rather than participants.tsv, making the latter primarily a list of subject IDs? I haven't seen this explicitly addressed anywhere, though I'm unsure if it's something we want to formalize 😬
Something like this could follow the paragraph?:

When all demographic data is stored in phenotype/demographics.tsv, participants.tsv may serve primarily as a minimal listing of subject identifiers with only the participant_id column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It'd be good to mention this.

Put the phenotypic and assessment data content where it belongs.
- Added in a new guideline 7 to encourage the use of participants and sessions files for different uses.
- Re-numbered old guidelines 7-9 to 8-10.
Removing excess line I forgot to remove earlier. Thanks remark CI!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is added by accident in surchs@0eba71d @ericearl

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will try to get it out of there.

Comment on lines 39 to 51
1. Each row MUST start with `participant_id`.

1. Each TSV file MUST contain a `session_id` column when
multiple [sessions](../glossary.md#session-entities)[^1] are present
in the data set regardless of whether those sessions are in
the `phenotype/` data, `sub-<label>/` data, or a combination of the two.

1. If more than one of the same measurement tool is acquired within
the same `session_id`, a `run_id` column MUST be added.

1. To encode the acquisition time for a measurement tool’s `session_id`,
add the `session_id` to the sessions file and
include the OPTIONAL `acq_time` column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Each row MUST start with `participant_id`.
1. Each TSV file MUST contain a `session_id` column when
multiple [sessions](../glossary.md#session-entities)[^1] are present
in the data set regardless of whether those sessions are in
the `phenotype/` data, `sub-<label>/` data, or a combination of the two.
1. If more than one of the same measurement tool is acquired within
the same `session_id`, a `run_id` column MUST be added.
1. To encode the acquisition time for a measurement tool’s `session_id`,
add the `session_id` to the sessions file and
include the OPTIONAL `acq_time` column.
a. Each row MUST start with `participant_id`.
b. Each TSV file MUST contain a `session_id` column when
multiple [sessions](../glossary.md#session-entities)[^1] are present
in the data set regardless of whether those sessions are in
the `phenotype/` data, `sub-<label>/` data, or a combination of the two.
c. If more than one of the same measurement tool is acquired within
the same `session_id`, a `run_id` column MUST be added.
d. To encode the acquisition time for a measurement tool’s `session_id`,
add the `session_id` to the sessions file and
include the OPTIONAL `acq_time` column.

Or just regular list? I wouldn't nest one top-level enumeration in another.

Comment on lines 254 to 256
Optional: Yes

An aggregated sessions file CAN be provided at the dataset root.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really optional though, right? Or rather, it's only optional in the sense that you could chose the other option and make subject-level sessions.tsv files.

As mentioned above: I would take a stance here and make one of the two options the recommended one. To me that would be the root-level file. And then we can say a word on why we recommend the option.

## Sessions file

Template:
### Option 1: Segregated sessions files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on the main heading for Sessions file:

There is a lot of commonality b/w the root-level and subject-level sessions.tsv. i.e. everything about what kind of info should go in them and what they are for. So how about we pull that info up under the heading. And then only explain the differences in the two options sections

Comment on lines 277 to 299
`sessions.json` example:

```JSON
{
"participant_id": {
"Description": "Participant identifier"
},
"session_id": {
"Description": "Session identifier for the session",
"Levels": {
"ses-predrug": "session before drug administration",
"ses-postdrug": "session after drug administration",
"ses-followup": "follow-up session"
}
},
"acq_time": {
"Description": "Acquisition time of the session"
},
"systolic_blood_pressure": {
"Description": "Systolic blood pressure measured at the beginning of the session in mmHg"
}
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above: this should go under the Sessions File heading - right now the example table of which columns to put in a sessions.tsv is listed under the "segregated" option, but the data dictionary under the "aggregated" option

ericearl and others added 3 commits September 25, 2025 08:55
Added in easily-agreeable suggestions in a batch.

Co-authored-by: Sebastian Urchs <[email protected]>
Attempt to address more of @surchs comments.
Thanks for catching that excess newline, remark!
Remove acq_time as a phenotype column recommendation/option, as it should go into the sessions file instead.
ericearl and others added 3 commits September 30, 2025 06:15
Remove acq_time__phenotype from columns.yaml since it was removed from the rest of the schema.
Accept Sebastian's suggestion about the phrasing of guideline 8.

Co-authored-by: Sebastian Urchs <[email protected]>
@ericearl
Copy link
Collaborator Author

@effigies @rwblair Here is a blurb for the community review period to make announcements easier. If edits are needed, I will apply them directly to this comment before tomorrow.


Community Review: BEP036 - Phenotypic Data Guidelines

We are pleased to announce the community review period for BIDS Extension Proposal (BEP) 036!

BEP036 extends the BIDS standard to include an appendix with 10 tabular phenotypic data guidelines you can opt into for the BIDS validator. We have developed the extension to allow everyone to follow good practices in preparing their tabular phenotypic data. Additionally, this BEP introduces the ability to include session_id as a second column in participants files and to aggregate sessions files to the root-level, allowing you to store longitudinal tabular data about participants and sessions, respectively, inside those files.

To view the file differences in either pull request, click the "Files changed" tab.

@effigies
Copy link
Collaborator

effigies commented Oct 16, 2025

Encoding the acquisition time for a measurement tool’s session_id, is RECOMMENDED. This information MUST be stored in the sessions.tsv file at the root level of the dataset in the acq_time column.

This is logically equivalent to "the acq_time column MUST NOT appear in a phenotype TSV file", but it takes some thinking about to get there. The spec should just say that.

"if anyone uses sessions, everyone uses sessions."

This is extremely difficult to do without requiring a root-level /sessions.tsv to the exclusion of subject-level sub-<label>_sessions.tsv files. The reason is that sessions columns in phenotype are analyzed on their own. If we can depend on the presence or absence of sessions.tsv as an indication of whether there are any sessions in the dataset, then when we visit a phenotype file, we can check that length(columns.session_id) > 0 iff exists('/sessions.tsv'). Similarly when visiting a subject directory, we can check that length(subject.sessions.ses_dirs) > 0 iff exists('/sessions.tsv').

7. Use the sessions file at the root-level

If there is more than one session for any one participant, then it is RECOMMENDED to provide a sessions file at the dataset root. The sessions file MUST list all sessions for all subjects across imaging and tabular phenotypic data. The data dictionary JSON file’s session_id field MUST include Levels with the description of each session_id.

The bolded text is not doable in the current schema. This would need access to all the (subject, session) pairs in /sessions.tsv and in each phenotype file. I think it's tractable, but we will need to extend the validation context and implement those changes in the validator.

10. Respect participant privacy when recording acquisition times

When needed to preserve participant privacy, you SHOULD record relative acquisition times with respect to the earliest session. Relative session acquisition times MAY be listed as durations from the earliest session (baseline) in days, months, or years using the acq_time column.

Unvalidatable and ambiguous. I think this should just piggy-back off of common principles:

Dates can be shifted by a random number of days for privacy protection reasons. To distinguish real dates from shifted dates, is is RECOMMENDED to set shifted dates to the year 1925 or earlier. Note that some data formats do not support arbitrary recording dates. [...] For longitudinal studies dates MUST be shifted by the same number of days within each subject to maintain the interval information. For example: 1867-06-15T13:45:30


Aggregate participant information across all sessions into one tabular TSV file per
measurement or phenotypic assessment and store this file in the `/phenotype` directory.
Demographic information is a special case and MUST be aggregated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of right now there are suggestions of what counts as demographic data, from a validation perspective this is hard to enforce without specific field names being listed in the schema. My interpretation is that these are then to become forbidden columns in any pheno/*.tsv? Are there any other demographic fields we'd like to enforce that on beyond sex age, gender, race, household_income?

measurement or phenotypic assessment and store this file in the `/phenotype` directory.
Demographic information is a special case and MUST be aggregated
in the `participants.tsv` file at the root level of the dataset.
It is RECOMMENDED to use the `age` column in the `participants.tsv` file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we could validate the appropriate age being used in each session based on the relative acq_times if present but I don't think its worth the effort. Maybe monotonically increasing age like schema.rules.checks.mri.VolumeTimingNotMonotonicallyIncreasing would be a compromise?


### 3. Add `MeasurementToolMetadata` to each tabular phenotypic measurement tool

Whenever possible, it is RECOMMENDED to add `MeasurementToolMetadata` to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue for this bep: In this and in the main phenotype article its implied that every tsv in the phenotype directory is a "Measurement Tool", but never explicitly stated that this is the only kind of tsv. Gave me pause when reviewing this, but it may be obvious to everyone else.

Comment on lines +54 to +55
- If more than one of the same measurement tool is acquired within
the same `session_id`, a `run_id` column MUST be added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If more than one of the same measurement tool is acquired within
the same `session_id`, a `run_id` column MUST be added.
- If a measurement tool is acquired multiple times within a single session, a `run_id` column must be added to disambiguate the separate acquisitions.

Note: This MUST is implicitly enforced by the combined index columns for phenotype tsv, If multiple results are acquired for the same subject and session with no run_id column the index check will error out.

Comment on lines +57 to +59
- Encoding the acquisition time for a measurement tool’s `session_id`,
is RECOMMENDED. This information MUST be stored in the `sessions.tsv`
file at the root level of the dataset in the `acq_time` column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@effigies mentioned this as "This is logically equivalent to "the acq_time column MUST NOT appear in a phenotype TSV file", but it takes some thinking about to get there. The spec should just say that."

I agree with the explicit "MUST NOT", But it also goes a step further in enforcing a root level sessions.tsv.

The combination of values in the `participant_id`, `session_id`, and `run_id` (if present)
columns MUST be unique for the entire tabular file.

### 5. Store demographic data in the participants file and instrument data in the phenotype directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mentioned in ### 1. Aggregate data across sessions, moving the two closer together or combining them would be nice.

Create one tabular file for each instrument
in the phenotypic and assessment data directory.

### 6. Record participant properties in the participants file and session properties in the sessions file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phenotypes aren't properties of participants? ;)

https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files/data-summary-files.html#sessions-file
For pathology states When different from healthy, pathology SHOULD be specified.

Should this entry be taken as overriding the main spec, and this field should go in participants.tsv instead?

Properties of participants MAY include things like
age, sex, race, or household income.
Properties of sessions MAY include things like
acquisition time, measurement device properties,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like participant properties, any way to explicitly list what is session appropriate metadata in the schema will make enforcing these rules easier/allow for making strong requirement claims. There is much less consistency here, so it may not be feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BEP enhancement New feature or request phenotype

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants