Description
TLDR
In the participants.tsv
file, the age
and sex
columns are sometimes not well defined, and this leads to (unnecessary) issues on the side of tool developers (and thus eventually the users). We should improve either the spec or the validator, or both.
cc @jasmainak @agramfort @adam2392 @hoechenberger
came up in: mne-tools/mne-bids#396
Intro
The specification says the following about the participants.tsv file:
In case of single session studies this file has one compulsory column participant_id that consists of sub-, followed by a list of optional columns describing participants.
so strictly speaking, all columns that are not participant_id
are OPTIONAL, and thus SHOULD be described in an accompanying participants.json
.
For optional columns that are not described, the validator currently emits a warning such as this:
1: [WARN] Tabular file contains custom columns not described in a data dictionary (code: 82 - CUSTOM_COLUMN_WITHOUT_DESCRIPTION)
./participants.tsv
Evidence: Columns: group not defined, please define in: /participants.json
Yet, the validator treats some "optional" columns differently, i.e., these columns are accepted WITHOUT warning. Examples of these are:
- age
- sex
However, the specification does not cover that these two variables are "expected optional columns". The expected behavior would be to raise a warning also for age and sex.
I could not pin down the exact part of the validator that is responsible for this behavior, but it may be this line:
perhaps @nellh or @rwblair can help
The problem
The issue that arises from this (apart from inconsistency) is that users define their own levels for the sex
column, and are NOT reminded by the validator to please define their levels further in a participant.json
.
As a result, these values are hard (or impossible) to parse by software.
E.g., we may have the following participants.json
:
participant_id age sex
sub-05 25 fem
sub-06 30 ma
sub-07 26 ma
what's fem
? what's ma
?
How to fix?
I think we should do one of the following:
- fix the validator so that it emits a warning if age and sex are columns in
participants.tsv
but have no description in an accompanyingparticipants.json
OR
- Amend the
participants.tsv
part of specification and explicitly say that age and sex are "to-be-expected" columns ... and then also define the expected inputs:
- age MUST be a float (years since birth)
- if a user wants to specify age differently, they must make their own custom column, e.g.
age_in_months
- if a user wants to specify age differently, they must make their own custom column, e.g.
- sex MUST be a string (here we need to discuss, which strings we accept. Most straight forward would perhaps be "male", "female",
"undefined", "other", but I would like somebody with a bit more experience in inclusive language to make a suggestion here.- again: if a user wants to do their own sex column they can make their own custom column with a wide range of acceptable factor levels