Deep Docking with Active Learning

Project Description

This project implements a deep learning workflow for predicting molecular docking scores, incorporating an active learning strategy. The goal is to efficiently identify promising molecular candidates from a large unlabeled pool by iteratively training a model on a small labeled dataset and strategically selecting the most informative samples from the unlabeled pool to be labeled and added to the training data.
This approach reduces the need for extensive experimental labeling while improving model performance on relevant data points (e.g., those with low docking scores).

Use Case
This workflow is particularly useful in drug discovery and computational chemistry where obtaining experimental docking scores for a vast library of molecules is expensive and time‑consuming. By using active learning, we can prioritize which molecules to synthesize and test, focusing on those predicted to have strong binding affinities (low docking scores) or those where the model is most uncertain about its prediction.

Workflow Overview

The provided Python script (combined_docking_script.py) consolidates the following steps, structured into logical modules:

Data Handling – Efficiently loads and processes a large compound library, including shuffling and splitting data without loading everything into memory.
Label Retrieval – Fetches or merges docking scores (labels) for selected compounds.
Docking Score Prediction – Trains a deep learning model to predict docking scores based on molecular features.
Metrics Calculation – Evaluates model performance using various metrics relevant to regression and virtual screening.
Acquisition Strategy – Implements different strategies to select the most informative compounds from the unlabeled pool for labeling.
Active Learning Orchestration – Manages the iterative active‑learning loop, coordinating data updates, model retraining, and evaluation.
Visualization – Provides plots to monitor training progress and model performance.

This workflow combines the principles of MolPAL (Molecular Pool‑based Active Learning) for intelligent sample selection with Deep Docking’s approach of using fast QSAR/deep‑learning models to accelerate the scoring process. citeturn0file0

Visualizing the Workflow and Concepts

Figure A – Conceptual Docking: The standard, brute‑force molecular docking process, where a large set of molecules are individually docked against a target protein to find the best binding poses and scores.
Figure B – MolPAL Approach: Shows a cycle involving prediction (surrogate model), selection (acquisition function), and docking (labeling the selected samples), which are then used to retrain the model, closing the active‑learning loop.
Figure C – Deep Docking Speed‑up: Demonstrates how Deep Docking can screen an ultra‑large database ~50× faster than standard docking by using a QSAR surrogate.
Figure D – Molecular Docking Breakdown: Decomposes a classical docking engine into the search algorithm component (e.g., Monte‑Carlo, genetic algorithm) and the scoring function component (e.g., force‑field, empirical, ML‑based).

High‑Level Architecture

Layer	What it Does	Key Classes / Files
Data Structure	Streams the 2.1 M‑compound library from multiple text files, shuffles on the fly, and exposes PyTorch‑style train/val/test splits that live on disk (so nothing ever has to fit in RAM).	`CompoundDataset`, `ShardedLoader`
Label Retriever	Queries your external docking pipeline (or merges pre‑computed labels) and caches results.	`DockingLabelFetcher`
Surrogate Model	Generates features (Morgan FP, RDKit descriptors, or graphs) and trains a model (MLP, Transformer, or Message‑Passing Network) with hyper‑parameter search.	`DockingRegressor`
Metrics	ROC, RMSE/MAE, Kendall‑τ, “library‑coverage” (percentage of the pool explored).	`MetricSuite`
Acquisition	Implements six strategies: Random, Greedy (μ), UCB (μ + βσ), Thompson Sampling, Expected Improvement (EI), Probability of Improvement (PI).	`AcquisitionStrategy`
Orchestrator	Runs the active‑learning loop, handles early stopping, logging (Weights & Biases), and plotting.	`DockingALRunner`

Why Combine MolPAL & Deep Docking?

MolPAL contributes the acquisition‑function toolbox that optimises which molecules we label next, while Deep Docking contributes the idea of using a fast QSAR/NN surrogate to shortcut classical docking. Together they let us screen libraries ≈ 50 × faster while still focusing on the most informative or highest‑affinity compounds.

End‑to‑End Workflow

Below is the iterative active‑learning loop. Steps 3-9 have to be repeated for N_ACTIVE_LEARNING_ITERATIONS or until Δ‑validation‑loss < ε.

Environment Setup & Dependencies

conda create -n dds python=3.10 && conda activate dds
pip install -r requirements.txt        # torchsparse wandb rdkit‑pypi etc.

Seed Acquisition – Randomly acquire K molecules to form the initial labelled dataset (X_seed, y_seed).
Initial Training – Train the DockingRegressor on the seed set with early stopping.
Inference – Predict μ(x) and uncertainty σ(x) for all remaining unlabeled molecules.
Select – Apply an acquisition strategy (e.g., Greedy, UCB, EI) to choose S molecules from the pool.
Label – Retrieve the true docking scores for the selected molecules via your external pipeline.
Analyze – Compute metrics (e.g., mean MSE for the new batch) and log them.
Update – Append (X_new, y_new) to the training set and optionally remove them from the pool.
Retrain – Warm‑start and retrain the model on the enlarged training set.
Validate – Evaluate on a held‑out validation split and plot loss curves.
Terminate – Stop when the validation loss plateaus or the max number of iterations is reached.

Suggested hyper‑parameters
K = 10 000, S = 10 000, N = 15.

Acquisition‑Function Cheat‑Sheet (Regression)

Name	Formula	Notes
Random	x ∼ U(0, 1)	Pure exploration
Greedy	x = argmin μ̂(x)	Exploitation – selects samples with the lowest predicted score
UCB	argmin (μ̂(x) + β σ̂(x))	β≈ 1–3. Balances exploration & exploitation; requires uncertainty estimate
Thompson Sampling	Sample f ∼ N(μ̂(x), σ̂²(x)) and pick argmin f(x)	Stochastic; needs σ̂
Expected Improvement (EI)	EI(x) = γ(x) Φ(z) + σ̂(x) ϕ(z) where γ(x) = (f − μ̂(x)) / σ̂(x)*	Favors high potential for improvement; needs σ̂
Probability of Improvement (PI)	PI(x) = Φ(z) with same γ(x) definition	More conservative than EI; needs σ̂

(Φ and ϕ denote the CDF and PDF of the standard normal distribution.)

Practical Tips & TODOs

Sparse fingerprints: Consider storing fingerprints as torch.sparse_csr_tensor for memory efficiency.
torchsparse / torchscatter: Useful for graph representations.
Early stopping: Use a patience of ~10 epochs.
WandB logging: Track wall‑clock time per acquisition iteration to compare strategies.
Batch size: Try acquisition sizes of 1 k, 5 k, and 10 k—larger batches don’t always accelerate discovery.

Setup & Installation

Clone or save the script (combined_docking_script.py).

Install dependencies:

conda create -n dds python=3.10 && conda activate dds
pip install pandas numpy torch scikit-learn tqdm matplotlib rdkit-pypi wandb

Prepare your data:
- Labeled data: a zip containing a CSV with a SMILES column and a r_i_docking_score column.
- Unlabeled data: a CSV with SMILES (and ideally ZINCID).
Update constants in the script: paths, learning‑rate, batch‑size, etc.

Running the Script

python combined_docking_script.py

The script:

Sets up logging and chooses CPU/GPU.
Loads/featurizes data and builds DataLoaders.
Trains the initial model and plots losses.
Enters the active‑learning loop for N_ACTIVE_LEARNING_ITERATIONS.
Logs every iteration (WandB is strongly recommended).
Saves the best model to best_model.pt and writes plots to the logs/ directory.

Configuration & Customization

Acquisition function: Pass 'greedy', 'mc_dropout', 'random', or 'EDL' when constructing DockingModelActiveLearning.
Model architecture: Edit DockingModel to experiment with layers, activations, and dropout.
Fingerprint parameters: get_morgan(radius=2, n_bits=1024)—feel free to tweak.
Multiprocessing: DataProcessor uses multiprocessing equal to CPU cores.

Interpreting Output

Logs: Console and logs/logfile.log.
Plots: Training/validation losses and true vs predicted scatter.
WandB: Rich dashboards for every run & iteration.

Potential Improvements

Additional acquisition functions (e.g., Expected Model Change, QBC).
Alternate molecular representations (ECFP, descriptors, GNNs).
Containerize with Docker for reproducible deployment.
Robust error‑handling & input validation.
Smarter data splitting & CV strategies.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
workflow.py		workflow.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Docking with Active Learning

Project Description

Workflow Overview

Visualizing the Workflow and Concepts

High‑Level Architecture

Why Combine MolPAL & Deep Docking?

End‑to‑End Workflow

Acquisition‑Function Cheat‑Sheet (Regression)

Practical Tips & TODOs

Setup & Installation

Running the Script

Configuration & Customization

Interpreting Output

Potential Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deep Docking with Active Learning

Project Description

Workflow Overview

Visualizing the Workflow and Concepts

High‑Level Architecture

Why Combine MolPAL & Deep Docking?

End‑to‑End Workflow

Acquisition‑Function Cheat‑Sheet (Regression)

Practical Tips & TODOs

Setup & Installation

Running the Script

Configuration & Customization

Interpreting Output

Potential Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages