How to split large datasets

While working on the Microscopy BEP, it was brought to our attention that some very large microscopy datasets sometimes need to be split across different folders. For example because of limitations or performance issue with large files or large number of files in a single repository.

I was wondering if this issue has come up in BIDS in the past and if there is an official mechanism for dealing with such situations?

Here is an example to illustrate my thoughts.
In this example, one subject (`sub-01`) has 2000 samples (`sample-0001 to sample-2000`), and each of the sample has 20 chunks (`chunk-01 to chunk-20`), as illustrated below:
```
dataset
└── sub-01
     └── microscopy
            ├── sub-01_sample-0001_chunk-01_BF.tif
            ├── sub-01_sample-0001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-0001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-2000_chunk-01_BF.tif
            ├── sub-01_sample-2000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-2000_chunk-20_BF.tif
```
Let’s say that the dataset needs to be split in 2, I would suggest to split the dataset with the first 1000 samples in one dataset (`dataset1`) and the samples 1001 to 2000 in another dataset (`dataset2`), as follow:

```
dataset-01
└── sub-01
     └── microscopy
            ├── sub-01_sample-0001_chunk-01_BF.tif
            ├── sub-01_sample-0001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-0001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-1000_chunk-01_BF.tif
            ├── sub-01_sample-1000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-1000_chunk-20_BF.tif

dataset-02
└── sub-01
     └── microscopy
            ├── sub-01_sample-1001_chunk-01_BF.tif
            ├── sub-01_sample-1001_chunk-02_BF.tif
            ├── ...
            ├── sub-01_sample-1001_chunk-20_BF.tif
            ├── ...
            ├── sub-01_sample-2000_chunk-01_BF.tif
            ├── sub-01_sample-2000_chunk-02_BF.tif
            ├── ...
            └── sub-01_sample-2000_chunk-20_BF.tif
```
Would that splitting method make sense with BIDS?
And in a case like this, is there a way to "link" the 2 datasets together, in `dataset_description.json` for example?
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to split large datasets #935

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to split large datasets #935

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions