MolPMoFiT

Implementation of Inductive transfer learning for Molecular Activity Prediction: Next-Gen QSAR Models with MolPMoFiT

Molecular Prediction Model Fine-Tuning (MolPMoFiT) is a transfer learning method based on self-supervised pre-training + task-specific fine-tuning for QSPR/QSAR modeling.

MolPMoFiT is adapted from the ULMFiT using Pytorch and Fastai v1. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with a specific endpoints.

Enviroment

We recommand to build the enviroment with Conda.

conda env create -f molpmofit.yml

Datasets

We provide all the datasets needed to reproduce the experiments in the data folder.

data/MSPM contains the dataset to train the general domain molecular structure prediction model.
data/QSAR contains the datasets for QSAR tasks.

Experiments

The code is provided as jupyter notebook in the notebooks folder. All the code was developed in a Ubuntu 18.04 workstation with 2 Quadro P4000 GPUs.

01_MSPM_Pretraining.ipynb: Training the general domain molecular structure prediction model(MSPM).
02_MSPM_TS_finetuning.ipynb: (1) Fine-tuning the general MSPM on a target dataset to generate a task-specific MSPM model. (2) Fine-tuning the task-specific MSPM to tran a QSAR model.
03_QSAR_Classifcation.ipynb: Fine-tuning the general domain MSPM to train a classification model.
04_QSAR_Regression.ipynb: Fine-tuning the general domain MSPM to train a regression model.

Pre-trained Models Download

Download ChEMBL_1M_atom. See notebooks/05_Pretrained_Models.ipynb for instructions of usage.
- This model is trained on 1M ChEMBL molecules with the atomwise tokenization method (original MoPMoFiT).
Download ChEMBL_1M_SPE. See notebooks/06_SPE_Pretrained_Models.ipynb for instructions of usage.
- This model is trained on 1M ChEMBL molecules with the SMILES pair encoding tokenization method.
- SMILES Pair Encoding (SmilesPE) is A Data-Driven Substructure Tokenization Algorithm for Deep Learning.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
notebooks		notebooks
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
TOC.PNG		TOC.PNG
molpmofit.yml		molpmofit.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MolPMoFiT

Enviroment

Datasets

Experiments

Pre-trained Models Download

About

Uh oh!

Releases

Packages

Languages

kamilm/MolPMoFiT

Folders and files

Latest commit

History

Repository files navigation

MolPMoFiT

Enviroment

Datasets

Experiments

Pre-trained Models Download

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages