DiffDock Fine Tuning

This is a fork of DiffDock-L with performance optimizations and wrapper scripts designed to make fine-tuning easier.

The original DiffDock-L source can be found here.

System Requirements

The wrapper scripts in this repo are designed to run on an HPC using Slurm for job management and Singularity for container management. The Slurm installation must have a cuda module and an openbabel module.

The HPC these scripts were tested on have general compute nodes with 48+ CPU cores, and GPU nodes with 4 GPUs. Both types of nodes have 120+GB of general RAM available, and each GPU in the GPU nodes have 80GB of VRAM. If your HPC's general compute nodes have fewer CPUs per node, it may be worth editing cpus-per-task in preprocess.sbatch and num_dataloader_workers in preprocess.sh accordingly. If your HPC's GPU nodes have a different number of attached GPUs, it may be necessary to edit tasks and gres in train_diffdock.sbatch accordingly.

Obtaining Pretrained Model

This repository is designed for fine-tuning an existing DiffDock-L model rather than training a whole new model from scratch, which can be very computationally expensive. The original authors of DiffDock-L have kindly published their model here, which is suitable for this purpose.

Environmental Setup

DiffDock requires a Singularity container with an appropriate Conda environment installed.

For instructions to setup a Conda environment inside of a Singularity container, see here, or check with your HPC administrator.

From inside the newly created Conda Singularity container, DiffDock's dependencies can be installed by running

conda env create --file environment.yml
conda activate diffdock

If using wandb.ai to track training progress, wandb can be set up by following just step 1 of the directions here.

Dataset Preparation

The dataset used for training and testing is a curated merge between the KLIFS and Modi and Dunbrack dataset. It can be found on Zenodo at doi:10.5281/zenodo.17373044

These scripts have been written assuming a rather specific layout of the dataset being used for both training and testing. We assume that both training and testing data are co-located in the same directory, and the distrinction between training and testing is managed through "split files", which are just plain text documents containing lists of complexes. Complexes need to be named in the form {PDB ID}_{CHAIN ID}_{LIGAND NAME}_{OPTIONAL SUFFIX}. The dataset directory is assumed to contain sub-directories of each complex. Each sub-directory should contain 2 files: a protein file, and a ligand file. The protein files should contain the chain of the apo protein that actually binds to the ligand, while the ligand file should contain all the HETATM entries consisting of the ligand. The protein file should be named using the scheme {PDB ID}_protein.pdb, while the ligand file should be named like {PDB ID}_ligand.pdb. Each complex sub-directory should only contain one protein chain and one ligand, so higher order complexes need to be broken up accordingly. An example dataset directory might look like this:

my_dataset/8ssp_A_627_510/8ssp_protein.pdb
my_dataset/8ssp_A_627_510/8ssp_ligand.pdb
my_dataset/8ssp_A_PDO_507/8ssp_protein.pdb
my_dataset/8ssp_A_PDO_507/8ssp_ligand.pdb

ESM Embedding Preparation

It's still necessary to generate ESM embeddings for all the proteins in the dataset. In order to do this, simply follow the instructions outlined in the original DiffDock repo, but skip the very last step, where it says to run esm_embeddings_to_pt.py. This last step collates all the ESM embeddings in the dataset into one big file, which we have noticed can sometimes cause out of memory issues. Instead, we have modified the dataloader to process the ESM embeddings as individual files to keep the memory footprint down.

Configuration

The last manual step in setting up DiffDock is to edit the configuration file, config.sh. This file contains a number of configuration variables (e.g. run names, input data paths, etc.) needed by downstream scripts. See the comments in config.sh for more details on what each variable means.

Fine Tuning

Preprocessing

While the training code can theoretically be run with no preprocessing step and will simply perform preprocessing on the fly in the dataloader thread, in practice, some ligands with a large number of rotatable bonds will take excessive amounts of CPU time in the conformer matching stage. In order to make the most efficient use of GPU node time, we've shuffled much of the CPU expensive operations into a discrete preprocessing step that can be run on a general compute node.

Our preprocessing scripts also take care of the problem of hydrogenation. Because most crystal structures lack hydrogens, there are some ambiguities in how the ligand file should be loaded, and RDKit, the library DiffDock uses for this purpose, often makes mistakes. In particular, RDKit doesn't seem readily able to distinguish between aromatic and cycloalkane rings in PDB files, which can have a profound impact on model performance. Our preprocessing scripts take care of this problem by explicitly adding hydrogens to the ligand and converting it to an SDF file. See preprocess_ligs.py for details on the algorithm.

Once config.sh is properly set up, preprocessing should be as easy as running

sbatch preprocess.sbatch

Note that this will attempt to launch a job on a 48 core general compute node. Please adjust cpus-per-task in preprocess.sbatch if this isn't feasible on your HPC.

Training

After preprocessing is complete, training can be started by running

sbatch train_diffdock.sbatch

Note that this will attempt to launch a job on a GPU node with 4 open GPUs that lasts 2 days. Please adjust tasks and gres in train_diffdock.sbatch if your HPC does not have GPU nodes with 4 GPUs. Both variables must match the number of GPUs. Note that reducing this number from 4 will likely make training take longer.

Testing

Running The Model

Unlike the preprocessing and training scripts, the model eval script is designed to be run as an array job, since model evalutation is easily parallelizable in a way that doesn't require communication between workers. To evaluate the model's performance on the test dataset, simply run

sbatch --array=0-<number of workers> eval_model.sbatch

Statistical Summary

The eval script will create a "results" directory (configured in config.sh) containing PDB files depicting all the reverse diffusion steps of each docking run. It will also contain accuracy statistics from each eval worker, which can be aggregated using

sbatch stats.sbatch

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
app		app
confidence		confidence
data		data
datasets		datasets
examples		examples
models		models
spyrmsd		spyrmsd
utils		utils
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.sh		config.sh
default_confidence_args.yaml		default_confidence_args.yaml
default_inference_args.yaml		default_inference_args.yaml
diffdock-s_inference_args.yaml		diffdock-s_inference_args.yaml
environment.yml		environment.yml
eval_model.sbatch		eval_model.sbatch
eval_model.sh		eval_model.sh
evaluate.py		evaluate.py
evaluate_alloset.py		evaluate_alloset.py
gen_cache.py		gen_cache.py
infer_confidence.py		infer_confidence.py
inference.py		inference.py
overview.png		overview.png
preprocess.sbatch		preprocess.sbatch
preprocess.sh		preprocess.sh
preprocess_ligs.py		preprocess_ligs.py
requirements.txt		requirements.txt
stats.py		stats.py
stats.sbatch		stats.sbatch
stats.sh		stats.sh
train.py		train.py
train_diffdock.sbatch		train_diffdock.sbatch
train_diffdock.sh		train_diffdock.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DiffDock Fine Tuning

System Requirements

Obtaining Pretrained Model

Environmental Setup

Dataset Preparation

ESM Embedding Preparation

Configuration

Fine Tuning

Preprocessing

Training

Testing

Running The Model

Statistical Summary

About

Uh oh!

Releases

Packages

Languages

License

echen1214/DiffDock-Fine-Tune

Folders and files

Latest commit

History

Repository files navigation

DiffDock Fine Tuning

System Requirements

Obtaining Pretrained Model

Environmental Setup

Dataset Preparation

ESM Embedding Preparation

Configuration

Fine Tuning

Preprocessing

Training

Testing

Running The Model

Statistical Summary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages