This repository gives the step-by-step instructions to train different variations of the U-Net architecture for Medical Image Semantic Segmentation (using PyTorch). As of now this process is limited to Brain Tumor Segmentation, but will later be extended to other anatomies.
Clone this repository using git clone and then cd into it. Ideally, this documentation is written under the assumption that the user is working on a SGE based HPC cluster (Hence, the terminology that I will be using subsequently will be influenced by this assumption) with sufficient GPU and CPU memory. The CPU and GPU memory requirements are further elaborated in the documentation in the respective folders.
PyTorch framework is used. For uses on the cbica-cluster (although you can see the required modules at the beginning of each trainer script I will mention them here anyway):
module load pytorch/1.0.1 # Always needed
module load gcc/5.2.0 # Needed somtimes because pandas package throws errors sometimes if this is not loaded Set of instructions on how to preprocess the raw BraTS data:
- Follow instructions on this page to download the BraTS training data
- Create a new folder called
Original_dataand within it a folder calledtrain:
cd ${brats_data} # the data was downloaded and extracted in this location
mkdir Original_data
cd Original_data
mkdir train- Copy/Move all the patients from the
HGGandLGGfolders into the foldertrainwhich is mentioned above (wherever it may be located)
cd ${brats_data}
mv HGG/* Original_data/train/
mv LGG/* Original_data/train/- All the scripts (whichever are relavent) are written with repect to the data folder structure of the BraTS dataset.
- So, it is important to note that, if one is not using the BraTS data and/or is using different/additional data, it must comply with the BraTS dataset folder structure which will be described in the subsequent points.
- Let's take the case of the data in
Original_data/train/withnpatients. npatients correspond tonfolders in theOriginal_data/train/- The name of each of these folders is the
patient_IDof that particular patient. - The
patient_IDcould be any alpha-numeric sequence. - Each
patient_IDfolder consists 5*.nii.gzfiles with the following names :patient_ID_t1.nii.gz,patient_ID_t2.nii.gz,patient_ID_t1ce.nii.gz,patient_ID_flair.nii.gzandpatient_ID_seg.nii.gz, corresponding to 4 imaging modalities and 1 ground truth segmentation mask.
So, in short, whatever data you use, it is expected to be in the folder structure that is explained in the points above.
- Open
${repo_location}/Preprocess_Data/pp.pywith your favorite editor and change the variablepath_datato${brats_data}/Orignal_data/train/as mentioned in the point number 2 above (Do not forget the/at the end - this is assumed to be present during file parsing). - Change the
pathvariable in${repo_location}/Preprocess_Data/pp.pyto the folder where you wish to save the preprocessed data, preferably to something understandable such as${brats_data}/Preprocessed_data/train/(this location needs to be present before the script runs). - Run the file
pp.pyusingpython ${repo_location}/Preprocess_Data/pp.pyafter making sure that all the dependencies [numpy, math, nibabel, tqdm] are installed. Doing this will preprocess the raw data and write it to the location specified inpath.
- Open the
${repo_location}/csv_all/ccsv.pywith your favorite editor and change the variabletrain_pathto the path defined by thepathvariable in the previous section, i.e.,${brats_data}/Preprocessed_data/train/. - Run the
ccsv.pyusingpython ${repo_location}/csv_all/ccsv.py, again after making sure that the necessary dependencies [csv, pandas] are installed. - The training process is done using
5fold cross validation, hence10CSV files are generated in the location defined bytrain_path:5each for training and validation folds.
Open the ${repo_location}/train_parameters.cfg file and change the training hyperparameters such as Number of Epochs (num_epochs), Optimizer (opt), Loss Function (which_loss), batch size and so on. The descriptions for each of the hyperparameters is documented in the file itself.
TODO : ADD VARIABLE FOR PAUSE-RESUME TRAINING
- The training script
${repo_location}/submission_scripts/trainer.pytakes in 2 command line arguments : first is the path to the trainingcsvfile of a given fold, and the second is the path to the validationcsvfile of a given fold (the respective pairs are generated at point 4 of the CSV preparation step above). cdinto the${repo_location}/submission_scriptsfolder. There are 5 submission scripts (one for each fold).- Edit each of the submission scripts to make sure that the correct paths to the training and validation scripts is passed as arguments to
trainer.py(these are generated in the CSV file section). - Run each of the submission scripts (
trainer_f*.sh) either bybash script_name.shorqsub script_name.sh(if you are using a SGE computing cluster)
- The weights (models) are saved as
*.ptfiles. - All the models will be saved in the folder that you specified in the
model_pathparameter in thetrain_parameters.cfgfile - The saved models follow is specific naming scheme
- Each of the models is named in the form
modXYYY.pt Xtells us which fold is the model from i.e.1-5andYYYtells us what is the epoch number where the best model weights were obtained.- Depending on the value of the parameter
save_bestin thetrain_paramters.cfgfile,$save_bestnumber of models will be saved. Hence, the total number of weight files that will be saved are :$save_best * number_of_folds
cd into the gen_seg folder which in-short stands for generate segmentations. After this you can cd into either of gen_train or gen_validation or gen_test according to which dataset's segmentation you want to generate.