Skip to content

Improve spec/validation of participants.tsv #458

Open
@sappelhoff

Description

@sappelhoff

TLDR

In the participants.tsv file, the age and sex columns are sometimes not well defined, and this leads to (unnecessary) issues on the side of tool developers (and thus eventually the users). We should improve either the spec or the validator, or both.

cc @jasmainak @agramfort @adam2392 @hoechenberger

came up in: mne-tools/mne-bids#396

Intro

The specification says the following about the participants.tsv file:

In case of single session studies this file has one compulsory column participant_id that consists of sub-, followed by a list of optional columns describing participants.

so strictly speaking, all columns that are not participant_id are OPTIONAL, and thus SHOULD be described in an accompanying participants.json.

For optional columns that are not described, the validator currently emits a warning such as this:

1: [WARN] Tabular file contains custom columns not described in a data dictionary (code: 82 - CUSTOM_COLUMN_WITHOUT_DESCRIPTION)
  ./participants.tsv
    Evidence: Columns: group not defined, please define in: /participants.json

Yet, the validator treats some "optional" columns differently, i.e., these columns are accepted WITHOUT warning. Examples of these are:

  • age
  • sex

However, the specification does not cover that these two variables are "expected optional columns". The expected behavior would be to raise a warning also for age and sex.

I could not pin down the exact part of the validator that is responsible for this behavior, but it may be this line:

https://github.com/bids-standard/bids-validator/blob/dfabbfb058daca406ed1d0897c3a25be059a5ad6/bids-validator/utils/summary/collectSubjectMetadata.js#L31

perhaps @nellh or @rwblair can help

The problem

The issue that arises from this (apart from inconsistency) is that users define their own levels for the sex column, and are NOT reminded by the validator to please define their levels further in a participant.json.

As a result, these values are hard (or impossible) to parse by software.

E.g., we may have the following participants.json:

participant_id	age	sex
sub-05	25	fem
sub-06	30	ma
sub-07	26	ma

what's fem? what's ma?

How to fix?

I think we should do one of the following:

  1. fix the validator so that it emits a warning if age and sex are columns in participants.tsv but have no description in an accompanying participants.json

OR

  1. Amend the participants.tsv part of specification and explicitly say that age and sex are "to-be-expected" columns ... and then also define the expected inputs:
  • age MUST be a float (years since birth)
    • if a user wants to specify age differently, they must make their own custom column, e.g. age_in_months
  • sex MUST be a string (here we need to discuss, which strings we accept. Most straight forward would perhaps be "male", "female", "undefined", "other", but I would like somebody with a bit more experience in inclusive language to make a suggestion here.
    • again: if a user wants to do their own sex column they can make their own custom column with a wide range of acceptable factor levels

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions