A snakemake based RBFE end-to-end workflow using BioSimSpace and sire with modular minimisation and equilibration staging and support for multiple engines.
The included example is a 3-ligand subset of the tyk2 dataset:
with minimisation/equilibration based on the protocol described in [Daniel R. Roe, Bernard R. Brooks; A protocol for preparing explicitly solvated systems for stable molecular dynamics simulations. J. Chem. Phys. 7 August 2020; 153 (5): 054123](https://doi.org/10.1063/5.0013849). The full rulegraph and DAG of the workflow are:
First, ensure that you have a conda environment with all required packages installed. An example environment is included in workflow/envs/RBFE_workflow_environment.yml
. This can be used to create an environment by running
mamba create env --file workflow/envs/RBFE_workflow_environment.yml
mamba activate openbiosim
Once this environment has been created the workflow can be run in a few ways. To run locally use the command
snakemake --cores all --resources gpu=1 {extra options}
See the snakemake docs for a full list of available command line options.
Note that this assumes that only a single GPU is available on your machine. If you have more than one GPU available the --resources gpu=X
can be scaled up, however this workflow does not contain any internal GPU management, meaning that all simulations will end up being run on GPU0.
For multi-GPU machines or clusters this workflow should be run via SLURM, with accounting enabled. In order to do this an additional package is required: https://github.com/snakemake/snakemake-executor-plugin-slurm. In addition to this, it is advised to have a SLURM-specific profile, which defines the per-job resources. A basic profile is included in profiles/config.yaml
, but this will need to be updated for your machine-specific requirements.
Once this is all setup the workflow can be run using something like
snakemake --executor slurm --profile profiles --use-conda --jobs 100 --resources gpu={max number GPUs} {extra options}
with jobs
and --resources gpu
defining the maximum number of jobs that snakemake
will place in the SLURM queue at any given time.
Managing resources within molecular dynamics workflows is hardware and engine dependent, and should be adjusted on a machine-to-machine basis. Within this workflow some options are provided - namely the simulation_threads
config option and the cpus_per_gpu
option within profiles/config.yaml
.
The simulation_threads
option defines the number of threads assigned to each rule which involves running one or more simulations. This option can have a large impact on performance, especially when using Gromacs, and as such careful consideration should be given to finding an optimal value for your specific hardware. The cpus_per_gpu
option is specific to SLURM
and should be set to a value greater than or equal to simulation_threads
when running the workflow using SLURM
.
All settings for each stage of this workflow are defined within config/config.yaml
, this includes settings for network preparation, solvation, minimisation, equilibration and production. Full descriptions of each available setting can be found in doc/workflow_settings.md
.
All ligand folders should be placed in inputs/ligands
, these can be any BioSimSpace-supported format, and currently are assumed to NOT be parameterised. The protein of interest should be placed in the inputs/protein
folder, and is assumed to be parameterised. All ligands should be in their correct docking pose.
The workflow includes rules and scripts for the preparation of the network, namely scripts/prepare_network.py
. This script is part of the workflow, and will be run as the first step within it. If you wish to run this script in isolation in order to fine-tune your network then it can be run prior to running the workflow - if a folder called network
is detected, and has a file named network.dat
within it, then the prep_network
rule will be skipped by the workflow. If you wish to create your own network then simply create the network
folder and include an appropriately formatted network.dat
file.
If network preparation is run as part of the workflow settings can be defined in the network_settings
portion of the config.
The second step in the workflow is the preparation of the free and bound systems. Included in this step is ligand parameterisation, creation of the protein-ligand bound system and solvation, defined by the settings given within config.yaml
.
Minimisation and equilibration steps can be added and removed according to user requirements, and will execute in the order defined in the config file, regardless of naming. Take for example this simple two-step protocol:
if additional equilibration is needed, for instance at a higher timestep or in a different regime, additional steps can be added to the config and will subsequently become part of the workflow. Here we've added an additional NPT step to the bound leg equilibration:
Once all systems have been created and end states minimised and equilibrated, the next step is to create the merged systems. This takes place in the fep_prep
step. Each perturbation defined within the network/network.dat
will be generated for both free and bound legs.
Settings for this stage (outside of those defined in network.dat
) are taken directly from the production-settings
section of config.yaml
.
Now that everything is prepared production can be run, with all settings for production runs defined by the production-settings
section of config.yaml
, and most of these settings can be defined independently for bound and free legs. Each perturbation defined within network/network.dat
will be run num-replicas
times for both bound and free legs.
Once production runs are finished, automated analysis will be run on all successful legs. This analysis can be configured within the analysis-settings
section of config.yaml
. By default the workflow will attempt to use openFE-cinnabar for network plotting, but if this is not found it will default to a native method.
Note that any experimental data is assumed to be given in the same layout as the provided example, that is 3 columns: ligand
, experimental DG
, experimental error
, and that experimental values must be given in one of three forms: kcal/mol
, kj/mol
, Ki(μM)
.
Results for the example cycle, run using the SOMD
engine are as follows: