This project implements a deep learning workflow for predicting molecular docking scores, incorporating an active learning strategy. The goal is to efficiently identify promising molecular candidates from a large unlabeled pool by iteratively training a model on a small labeled dataset and strategically selecting the most informative samples from the unlabeled pool to be labeled and added to the training data.
This approach reduces the need for extensive experimental labeling while improving model performance on relevant data points (e.g., those with low docking scores).
Use Case
This workflow is particularly useful in drug discovery and computational chemistry where obtaining experimental docking scores for a vast library of molecules is expensive and time‑consuming. By using active learning, we can prioritize which molecules to synthesize and test, focusing on those predicted to have strong binding affinities (low docking scores) or those where the model is most uncertain about its prediction.
The provided Python script (combined_docking_script.py) consolidates the following steps, structured into logical modules:
- Data Handling – Efficiently loads and processes a large compound library, including shuffling and splitting data without loading everything into memory.
- Label Retrieval – Fetches or merges docking scores (labels) for selected compounds.
- Docking Score Prediction – Trains a deep learning model to predict docking scores based on molecular features.
- Metrics Calculation – Evaluates model performance using various metrics relevant to regression and virtual screening.
- Acquisition Strategy – Implements different strategies to select the most informative compounds from the unlabeled pool for labeling.
- Active Learning Orchestration – Manages the iterative active‑learning loop, coordinating data updates, model retraining, and evaluation.
- Visualization – Provides plots to monitor training progress and model performance.
This workflow combines the principles of MolPAL (Molecular Pool‑based Active Learning) for intelligent sample selection with Deep Docking’s approach of using fast QSAR/deep‑learning models to accelerate the scoring process. citeturn0file0
- Figure A – Conceptual Docking: The standard, brute‑force molecular docking process, where a large set of molecules are individually docked against a target protein to find the best binding poses and scores.
- Figure B – MolPAL Approach: Shows a cycle involving prediction (surrogate model), selection (acquisition function), and docking (labeling the selected samples), which are then used to retrain the model, closing the active‑learning loop.
- Figure C – Deep Docking Speed‑up: Demonstrates how Deep Docking can screen an ultra‑large database ~50× faster than standard docking by using a QSAR surrogate.
- Figure D – Molecular Docking Breakdown: Decomposes a classical docking engine into the search algorithm component (e.g., Monte‑Carlo, genetic algorithm) and the scoring function component (e.g., force‑field, empirical, ML‑based).
| Layer | What it Does | Key Classes / Files |
|---|---|---|
| Data Structure | Streams the 2.1 M‑compound library from multiple text files, shuffles on the fly, and exposes PyTorch‑style train/val/test splits that live on disk (so nothing ever has to fit in RAM). | CompoundDataset, ShardedLoader |
| Label Retriever | Queries your external docking pipeline (or merges pre‑computed labels) and caches results. | DockingLabelFetcher |
| Surrogate Model | Generates features (Morgan FP, RDKit descriptors, or graphs) and trains a model (MLP, Transformer, or Message‑Passing Network) with hyper‑parameter search. | DockingRegressor |
| Metrics | ROC, RMSE/MAE, Kendall‑τ, “library‑coverage” (percentage of the pool explored). | MetricSuite |
| Acquisition | Implements six strategies: Random, Greedy (μ), UCB (μ + βσ), Thompson Sampling, Expected Improvement (EI), Probability of Improvement (PI). | AcquisitionStrategy |
| Orchestrator | Runs the active‑learning loop, handles early stopping, logging (Weights & Biases), and plotting. | DockingALRunner |
MolPAL contributes the acquisition‑function toolbox that optimises which molecules we label next, while Deep Docking contributes the idea of using a fast QSAR/NN surrogate to shortcut classical docking. Together they let us screen libraries ≈ 50 × faster while still focusing on the most informative or highest‑affinity compounds.
Below is the iterative active‑learning loop. Steps 3-9 have to be repeated for N_ACTIVE_LEARNING_ITERATIONS or until Δ‑validation‑loss < ε.
- Environment Setup & Dependencies
conda create -n dds python=3.10 && conda activate dds pip install -r requirements.txt # torchsparse wandb rdkit‑pypi etc.
- Seed Acquisition – Randomly acquire K molecules to form the initial labelled dataset
(X_seed, y_seed). - Initial Training – Train the
DockingRegressoron the seed set with early stopping. - Inference – Predict μ(x) and uncertainty σ(x) for all remaining unlabeled molecules.
- Select – Apply an acquisition strategy (e.g., Greedy, UCB, EI) to choose S molecules from the pool.
- Label – Retrieve the true docking scores for the selected molecules via your external pipeline.
- Analyze – Compute metrics (e.g., mean MSE for the new batch) and log them.
- Update – Append
(X_new, y_new)to the training set and optionally remove them from the pool. - Retrain – Warm‑start and retrain the model on the enlarged training set.
- Validate – Evaluate on a held‑out validation split and plot loss curves.
- Terminate – Stop when the validation loss plateaus or the max number of iterations is reached.
Suggested hyper‑parameters
K = 10 000, S = 10 000, N = 15.
| Name | Formula | Notes |
|---|---|---|
| Random | x ∼ U(0, 1) | Pure exploration |
| Greedy | x = argmin μ̂(x) | Exploitation – selects samples with the lowest predicted score |
| UCB | argmin (μ̂(x) + β σ̂(x)) | β≈ 1–3. Balances exploration & exploitation; requires uncertainty estimate |
| Thompson Sampling | Sample f ∼ N(μ̂(x), σ̂²(x)) and pick argmin f(x) | Stochastic; needs σ̂ |
| Expected Improvement (EI) | EI(x) = γ(x) Φ(z) + σ̂(x) ϕ(z) where γ(x) = (f* − μ̂(x)) / σ̂(x) | Favors high potential for improvement; needs σ̂ |
| Probability of Improvement (PI) | PI(x) = Φ(z) with same γ(x) definition | More conservative than EI; needs σ̂ |
(Φ and ϕ denote the CDF and PDF of the standard normal distribution.)
- Sparse fingerprints: Consider storing fingerprints as
torch.sparse_csr_tensorfor memory efficiency. torchsparse/torchscatter: Useful for graph representations.- Early stopping: Use a patience of ~10 epochs.
- WandB logging: Track wall‑clock time per acquisition iteration to compare strategies.
- Batch size: Try acquisition sizes of 1 k, 5 k, and 10 k—larger batches don’t always accelerate discovery.
- Clone or save the script (
combined_docking_script.py). - Install dependencies:
conda create -n dds python=3.10 && conda activate dds pip install pandas numpy torch scikit-learn tqdm matplotlib rdkit-pypi wandb - Prepare your data:
- Labeled data: a zip containing a CSV with a
SMILEScolumn and ar_i_docking_scorecolumn. - Unlabeled data: a CSV with
SMILES(and ideallyZINCID).
- Labeled data: a zip containing a CSV with a
- Update constants in the script: paths, learning‑rate, batch‑size, etc.
python combined_docking_script.pyThe script:
- Sets up logging and chooses CPU/GPU.
- Loads/featurizes data and builds DataLoaders.
- Trains the initial model and plots losses.
- Enters the active‑learning loop for
N_ACTIVE_LEARNING_ITERATIONS. - Logs every iteration (WandB is strongly recommended).
- Saves the best model to
best_model.ptand writes plots to thelogs/directory.
- Acquisition function: Pass
'greedy','mc_dropout','random', or'EDL'when constructingDockingModelActiveLearning. - Model architecture: Edit
DockingModelto experiment with layers, activations, and dropout. - Fingerprint parameters:
get_morgan(radius=2, n_bits=1024)—feel free to tweak. - Multiprocessing:
DataProcessorusesmultiprocessingequal to CPU cores.
- Logs: Console and
logs/logfile.log. - Plots: Training/validation losses and true vs predicted scatter.
- WandB: Rich dashboards for every run & iteration.
- Additional acquisition functions (e.g., Expected Model Change, QBC).
- Alternate molecular representations (ECFP, descriptors, GNNs).
- Containerize with Docker for reproducible deployment.
- Robust error‑handling & input validation.
- Smarter data splitting & CV strategies.