Description
Your idea
BIDS has generally followed the convention of adopting human-readable or widely-adopted standards for its files. At 1.0, we used .tsv
for all tabular files except physiological and stimulus recordings, which use a headerless .tsv.gz
format. In 1.9, we added a headerless motion.tsv
file, which is quite large. The eye-tracking BEP (#1128) is underway, which is having to cope with some limitations in the TSV options.
In 2024 we now have over a decade of the Apache Parquet format development. The format specification is open, and there is a Project(Arrow) which includes native libraries or bindings for Python, MATLAB, R, Julia, Java, Javascript and C, among others.
For data that do not benefit from human readability (TSV files > ~1k lines), Parquet offers advantages such as typed columns, chunked compression, as well as not requiring round-trips between floating point and ASCII decimal representations.
I propose the following:
- Allow
.parquet
files anywhere that a TSV or TSV-GZ file is currently permitted. - RECOMMEND to use
.tsv
for high-level metadata tables, such asparticipants.tsv
,*_sessions.tsv
and*_scans.tsv
as well as*_channels.tsv
,*_electrodes.tsv
and similar metadata files. - Requirements on column orderings, types, uniqueness should be unchanged.
This is pulled out of #197, which is about N-dimensional data. I am excerpting the relevant recent posts here:
it may be good to revive this discussion as i'm seeing a few upcoming use cases that will require a more sophisticated consideration for many things that are now in TSVs.
here is a temporary proposal to narrow down the conversation.
- apache parquet for table like formats (the reason i'm separating this out is that there are significant efficiencies in not considering this a subset of n-d array).
...
I am +1 for parquet to be adopted for any TSV data files (physio, stim, motion, blood). It's an open spec with broad implementation and readily available command-line tools for inspection. I think it should probably be discouraged if not prohibited for metadata files (participants.tsv, samples.tsv, sessions.tsv and scans.tsv, electrodes.tsv, channels.tsv), which benefit from human readability. I think it will often be a poor choice for events.tsv, but I wouldn't rule it out.
I am not sure that there is an actual "to-do" here for N-dimensional named arrays except to adopt them in principle so that a BEP that needs this structure can use it. I do not think there is any call to allow an events.zarr file with 2D onsets or 3D durations. HDF5 and Zarr are both already present in NWB, SNIRF and OME-Zarr.
Another +1 for the usage of parquet for tabular data, e.g. physio, stim, motion, etc. I like to call these types of data "measurements" and call e.g. participants.tsv, eletrodes.tsv, etc. "records." The current TSV have some problems that are limiting for measurements:
- you need to truncate decimals which means you lose precision
- they are very space-inefficient
- you don't have direct/random access to data
Gzipping the TSVs doesn't really solve any of these issues. Parquet is more performant in read, write, and storage volume, and is an open standard with large cross-platform support.
We are looking at adopting BIDS for neurophysiology applications. Without a binary-style filetype option, we would need to convert our efficient data storage solutions into TSV which is a much less efficient/performant file type than the current solution. Being able to use parquet for physio etc. would make me much more comfortable with adopting BIDS.