A dual-stream neural network combining 1D-CNN (for SMILES sequences) and Graph Convolutional Networks (for molecular graphs) to predict multiple molecular properties simultaneously.
This project uses 5 MoleculeNet datasets:
- BBBP - Blood-Brain Barrier Permeability (Classification)
- Lipophilicity - Lipophilicity prediction (Regression)
- SAMPL (FreeSolv) - Solvation free energy (Regression)
- ESOL (Delaney) - Aqueous solubility (Regression)
- Tox21 - 12 toxicity endpoints (Multi-task Classification)
sMLles/
├── smiles_gcn/ # Main package
│ ├── models/ # Model architectures (CNN, GCN, fusion, prediction heads)
│ ├── data/ # Data loading and preprocessing
│ ├── training/ # Training pipeline (trainer, losses, metrics)
│ └── utils/ # Utility functions (config, visualization, chemistry)
├── data/ # MoleculeNet datasets (BBBP, Lipophilicity, SAMPL, ESOL, Tox21)
├── configs/ # Configuration files
├── scripts/ # Training scripts
│ ├── train.py # Main training script
│ └── test_data_loading.py # Data loading validation
├── sMLles_training/ # Training outputs
│ ├── checkpoints/ # Saved model checkpoints
│ └── results/ # Training curves and metrics
├── train_on_colab.ipynb # Colab notebook for training
└── requirements.txt # Python dependencies
# Create conda environment
conda create -n smiles-gcn python=3.9
conda activate smiles-gcn
# Install PyTorch for Apple Silicon
pip install torch torchvision
# Install PyTorch Geometric
pip install torch-geometric
# Install RDKit
conda install -c conda-forge rdkit
# Install other dependencies
pip install -r requirements.txt# Train on local machine (requires GPU or Apple Silicon)
python scripts/train.py --config configs/default_config.yaml
# Or use the Colab notebook for cloud GPU training
# Open train_on_colab.ipynb in Google ColabThe training script automatically:
- Loads and preprocesses all 5 MoleculeNet datasets
- Validates SMILES and builds molecular graphs
- Trains the dual-stream model with multi-task learning
- Evaluates on validation set after each epoch
- Saves best model checkpoint based on validation performance
- Generates training curves and metrics visualizations
SMILES Input → CNN Stream (1D Conv layers) →
→ Fusion Layer → Multi-Task Heads → Predictions
Molecular Graph → GCN Stream (Graph Conv) →
python scripts/train.py --config configs/default_config.yamlpython scripts/evaluate.py --checkpoint checkpoints/best_model.pt