Refine https://github.com/lifanchen-simm/transformerCPI/ for large dataset and multiple GPU training
conda env create -f py36_tCPI.yml
First run
sh script/generate_map.sh to generate protein_map.pkl and smiles_map.pkl, which is the mapping from smiles to smiles_feature, and protein_seq to protein_seq_feature
Then run sh script/main.sh to start training.
If you want to stop the training process, run sh script/stop.sh
Comparison to the orginal repo https://github.com/lifanchen-simm/transformerCPI
Advantages:
- You can use
torch.nn.DataParallel(along withtorch.cuda.amp) to accelerate your training process. - For large scale dataset, Using
DTADatasetinDataUtil.pyalong withtorch.nn.DataLoadercan accelerate your data loading process and tremendously reduce the memory usage. - I change the code into solving regression problem instead of classification problem in the original paper.
Thanks for the brilliant work of authors in this paper https://doi.org/10.1093/bioinformatics/btaa524