Code repository for the paper "Across-subject ensemble-learning alleviates the need for large samples for fMRI decoding" by Himanshu Aggarwal, Liza Al-Shikhley and Bertrand Thirion.
A preprint version of this work can be found here: https://hal.science/hal-04636523
git clone git@github.com:man-shu/ensemble-fmri.gitDownload downsampled 3mm event-wise GLM effect-size maps of fMRI datasets. It is only ~7GB in size. Yes it is indeed small
cd ensemble-fmri
wget https://zenodo.org/records/12204275/files/data.zip
unzip data.zip -d dataCreate a new conda environment with the dependencies.
conda env create -f env/ensemblepy.yml
conda activate ensemblepyWe used a conditional permutation importance method to calculate the importance scores as provided in this package and explained in this paper. I have modified the code to work with the ensemble setting. To install the package, run:
git clone git@github.com:man-shu/Variable_Importance.git
cd Variable_Importance/BBI_package
pip install .
cd ../..We also compared the ensemble approach with the conventional Graph Convolutional Network (GCN) classifier. The GCN implementation used is inspired by this tutorial
It needs the following dependencies:
pip install torch torch-geometricAll the scripts expect event-wise GLM effect size maps in the data directory. The data structure should be as follows:
data/neuromod
└── 3mm
├── sub-01.nii.gz
├── sub-01_labels.csv
├── sub-01_runs.csv
├── sub-02.nii.gz
├── sub-02_labels.csv
├── sub-02_runs.csv
├── sub-03.nii.gz
├── sub-03_labels.csv
├── sub-03_runs.csv
├── sub-05.nii.gz
├── sub-05_labels.csv
└── sub-05_runs.csv-
Under
data, you need to create a sub-directory with the name of your dataset e.g.neuromodin this case. -
This sub-directory should contain another sub-directory named as resolution of the data e.g.
3mmin this case. Note that we downsampled all our data to 3mm resolution because it is computationally expensive to work with the full resolution data. -
The
3mmdirectory should contain the effect size maps for each subject. The effect size maps should be in the nifti format. The filenames should be in the format<subject_id>.nii.gze.g.sub-01.nii.gzin this case. The subject_id in itself could be anything as long as it is unique for each file. Each volume in this 4D nifti file should correspond to an event in the task. All runs should be concatenated in this one nifti file. -
The
3mmdirectory should also contain the labels for each subject. The labels should be in a CSV file. The filenames should be in the format<subject_id>_labels.csve.g.sub-01_labels.csvin this case. The CSV file should have one column without any header. The column should contain the labels for each event/volume in nifti file. -
Finally, you just need to add the name of your dataset (the sub-directory you just created under
data) in all the scripts.For example, in
scripts/vary_train_size.py, you need to add your dataset's name in thedataslist as follows:# in lines 42-48 datas = [ "neuromod", "forrest", "rsvp", "bold", "aomic_anticipation", "your_dataset_name" ]
You can even remove the other datasets if you are only interested in your dataset.
Section Reproduce the results in the paper explains what each script does and how to run them.
If you want to know how to get event-wise GLM effect size maps, you can follow this tutorial. Note that the method to compute event-wise GLM effect size maps is referred to as Least Squares Separation (LSS) in the tutorial.
We did the same for the AOMIC dataset using the scripts/glm/glm_anticipation.py script. So once you have downloaded the AOMIC dataset from here you can update the paths in the script and run it to get the event-wise GLM effect size maps. Note that you only need the files with the *task-anticipation* wildcard in the filename from the AOMIC dataset.
To generate numbers plotted in Fig 2 and Fig 3 (over varying training sizes) in the paper:
-
using the data in the
datadirectory, saving the results in theresultsdirectory, with 20 parallel jobs, and DiFuMo features, run:python scripts/vary_train_size.py data results 20 difumo
-
or with full-voxel space features, run:
python scripts/vary_train_size.py data results 20 voxels
To generate numbers plotted in Fig 4 (over varying numbers of subjects in the ensemble) in the paper:
-
using the data in the
datadirectory, saving the results in theresultsdirectory, with 20 parallel jobs, and DiFuMo features, run:python scripts/vary_n_subs.py data results 20 difumo
-
or with full-voxel space features, run:
python scripts/vary_n_subs.py data results 20 voxels
For computing the importance scores for the ensemble setting using the DiFuMo features and RandomForest classifier, plotted in Supplementary Fig. 2 in the paper:
python scripts/feat_imp.py data results 20To use the pretraining strategy with L1 penalized Linear SVC (the results plotted in Supplementary Figure 2), you can change line 175 under the pretrain function in utils.py from:
clf = LinearSVC(dual="auto")to:
clf = LinearSVC(dual="auto", penalty="l1")For comparing the ensemble approach with the GCN approach, run:
python scripts/compare_with_gcn.py data results 20 gcn/param_grid.jsonTo plot Figures 2 and 3 in the paper, run:
python scripts/plotting/plot_fig2_fig3.pyTo plot Figure 4 in the paper, run:
python scripts/plotting/plot_fig4.pyTo plot the feature importance scores, run the following command. This will plot feature importance scores on surfaces and glass brains for each subject in each dataset. The plots will be saved in the plots/feat_imp directory.
python scripts/plotting/plot_suppfig1.pyThe plots will look like this:

To plot the comparison with GCN, run the following command. This will create a similar plot as in Fig 2 but with the GCN results.
python scripts/plotting/plot_gcn.pyWe ran the experiments on a CPU-based cluster with 72 nodes and 376 GB of RAM, but only used 20 parallel jobs. The OS was Ubuntu 18.04.6 LTS.
The time taken for each experiment is as follows:
# command
time python scripts/vary_train_size.py data results 20 difumo
# output
264026.23s user 4271.26s system 1606% cpu 4:38:23.80 total# command
time python scripts/vary_train_size.py data results 20 voxels
# output
541518.33s user 20061.83s system 1715% cpu 9:05:44.40 total# command
time python scripts/vary_n_subs.py data results 20 difumo
# output
264768.95s user 644.23s system 1804% cpu 4:05:10.79 total# command
time python scripts/vary_n_subs.py data results 20 voxels
# output
361548.14s user 5086.45s system 1828% cpu 5:34:16.29 total# command
time python scripts/feat_imp.py data results 20
# output
2596918.56s user 65902.53s system 2082% cpu 35:31:34.59 totalIf you don't have conda installed, you can install it from here.
In case you face any issues, please feel free to open an issue on this repository.



