Skip to content

How to split large datasets #935

Open
@mariehbourget

Description

@mariehbourget

While working on the Microscopy BEP, it was brought to our attention that some very large microscopy datasets sometimes need to be split across different folders. For example because of limitations or performance issue with large files or large number of files in a single repository.

I was wondering if this issue has come up in BIDS in the past and if there is an official mechanism for dealing with such situations?

Here is an example to illustrate my thoughts.
In this example, one subject (sub-01) has 2000 samples (sample-0001 to sample-2000), and each of the sample has 20 chunks (chunk-01 to chunk-20), as illustrated below:

dataset
└── sub-01
     └── microscopy
            ├── sub-01_sample-0001_chunk-01_BF.tif
            ├── sub-01_sample-0001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-0001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-2000_chunk-01_BF.tif
            ├── sub-01_sample-2000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-2000_chunk-20_BF.tif

Let’s say that the dataset needs to be split in 2, I would suggest to split the dataset with the first 1000 samples in one dataset (dataset1) and the samples 1001 to 2000 in another dataset (dataset2), as follow:

dataset-01
└── sub-01
     └── microscopy
            ├── sub-01_sample-0001_chunk-01_BF.tif
            ├── sub-01_sample-0001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-0001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-1000_chunk-01_BF.tif
            ├── sub-01_sample-1000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-1000_chunk-20_BF.tif

dataset-02
└── sub-01
     └── microscopy
            ├── sub-01_sample-1001_chunk-01_BF.tif
            ├── sub-01_sample-1001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-1001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-2000_chunk-01_BF.tif
            ├── sub-01_sample-2000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-2000_chunk-20_BF.tif

Would that splitting method make sense with BIDS?
And in a case like this, is there a way to "link" the 2 datasets together, in dataset_description.json for example?
Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions