GitHub - GBLille/MassiveFold

MassiveFold is a tool that allows to massively expand the sampling of structure predictions by improving the computation of AlphaFold based predictions.

It optimizes the parallelization of the structure inference by splitting the computation on CPU for alignments, running automatically batches of structure predictions on GPU, and gathering the results in one global output directory, with a global ranking and a variety of plots.

MassiveFold uses AFmassive, ColabFold or AlphaFold3 as inference engine; AFmassive is an updated version of Björn Wallner's AFsample that offers additional diversity parameters for massive sampling.

MassiveFold: parallelize protein structure prediction

MassiveFold's design (see schematic below) is optimized for GPU cluster usage. It allows fast computation for massive sampling by automatically splitting a large run of numerous predictions into several jobs. Each of these individual jobs are computed on a single GPU node and their results are then gathered as a single output with each prediction ranked on a global level instead of the level of each individual job.

This automatic splitting is also convenient for massive sampling on a single GPU server to manage jobs priorities.

MassiveFold is optimized for SLURM workload manager (Simple Linux Utility for Resource Management) as it relies heavily on its features (job array, job dependency, etc...). But it can still be used sequentially (no batches running in parallel)(see documentation).

A run is composed of three steps:

alignment: on CPU, sequence alignments is the first step (can be skipped if alignments are already computed)
structure prediction: on GPU, structure predictions follow the massive sampling principle. The total number of predictions is divided into smaller batches and each of them is distributed on a single GPU. These jobs wait for the alignment job to be over, if the alignments are not provided by the user.
post_treatment: on CPU, it finishes the job by gathering all batches outputs and produces plots with the plots module to visualize the run's performances. This job is executed only once all the structure predictions are over.

Installation

MassiveFold was developed to run massive sampling with AFmassive, ColabFold and AlphaFold3, and relies on them for its installation.

Follow the MassiveFold installation guide.
It details these steps of the MassiveFold installation:

Uninstallation

To uninstall MassiveFold, remove the three conda environments (massivefold-1.8.2, mf-afmassive-1.1.10, mf-colabfold-1.6.1 and mf-alphafold3-1.1.0) and remove the MassiveFold folder you cloned. Make sure you copy all the files and folders you want to keep from the output and log directories somewhere else.

Usage

Running MassiveFold

Usage section includes the most simple way to run MassiveFold with examples. For more detail on its functioning and other cases, see the usage documentation.

First, activate the conda environment:

conda activate massivefold-1.8.2

Then launch MassiveFold:

massivefold run -s <SEQUENCE_PATH> -r <RUN_NAME> -p <NUMBER_OF_PREDICTIONS_PER_MODEL> -f <JSON_PARAMETERS_FILE> -t <TOOL>

Example for AFmassive:

massivefold run -s input/H1140.fasta -r afm_default -p 5 -f AFmassive_params.json

Example for ColabFold:

massivefold run -s input/H1140.fasta -r cf_default -p 5 -f ColabFold_params.json

Example for AlphaFold3:

massivefold run -s input/H1140.fasta -r af3_default -p 5 -f AlphaFold3_params.json

Screening a receptor with ligands

First, activate the conda environment:

conda activate massivefold-1.8.2

To screen a single protein receptor with multiple ligands, launch:

massivefold screen -s <receptor_fasta_file> -l <ligand_list_csv> -f <AlphaFold3_params.json>

Here the default value of -p which is the number of seed (for AF3) or prediction per model (for AF2) is 1.
See documentation for further details

Discover PPI between receptors and ligands

First, activate the conda environment:

conda activate massivefold-1.8.2

To launch a PPI discovery round between a set of protein (or dna or rna) receptors and a set of protein (or dna or rna) ligands, run:

massivefold ppi --receptors <receptor_list> --ligands <ligand_list> -f <AlphaFold3_params.json>

You can also screen each potential PPI (receptor-ligand pair) with a defined list of small molecules by using --context <ligand_list_csv> (same usage as screening).

Here the default value of -p which is the number of seeds (for AF3) or predictions per model (for AF2) is 1.
See documentation for further details.

Launch several runs

First, activate the conda environment:

conda activate massivefold-1.8.2

To launch multiple runs for a single sequence combination, use multirun pipeline, run the following command:

massivefold multirun -s <SEQUENCE_PATH> --setup <JSON_SETUP_FILE>

Example:

massivefold multirun -s input/H1140.fasta --setup ../src/massivefold/examples/multirun_setup.json

massivefold_plots: output representation

Additionally to the configuration of the plots parameters inside MassiveFold JSON param file, the plot module can also be used on an already produced MassiveFold (or AlphaFold2) output to evaluate visually its predictions.

For more details on this usage, see MassiveFold plots documentation.

Troubleshooting

Some known issues were identified and can be prevented by following steps described in the troubleshooting documentation.

Citation

If you use MassiveFold in your work, please cite:

Raouraoua N. et al. MassiveFold: unveiling AlphaFold’s hidden potential with optimized and parallelized massive sampling. 2024. Nature Computational Science, DOI: 10.1038/s43588-024-00714-4,
https://www.nature.com/articles/s43588-024-00714-4

Authors

Nessim Raouraoua (UGSF - UMR 8576, France)
Claudio Mirabello (NBIS, Sweden)
Thibaut Véry (IDRIS, France)
Christophe Blanchet (IFB, France)
Björn Wallner (Linköping University, Sweden)
Marc F Lensink (UGSF - UMR8576, France)
Guillaume Brysbaert (UGSF - UMR 8576, France)

This work was carried out as part of Work Package 4 of the MUDIS4LS project led by the French Bioinformatics Institute (IFB). It was initiated at the IDRIS Open Hackathon, part of the Open Hackathons program. The authors would like to acknowledge OpenACC-Standard.org for their support.

Name		Name	Last commit message	Last commit date
Latest commit History 819 Commits
docs		docs
imgs		imgs
src/massivefold		src/massivefold
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
install.sh		install.sh
mf-afmassive.yml		mf-afmassive.yml
mf-alphafold3.yml		mf-alphafold3.yml
mf-colabfold.yml		mf-colabfold.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MassiveFold: parallelize protein structure prediction

Installation

Uninstallation

Usage

Running MassiveFold

Screening a receptor with ligands

Discover PPI between receptors and ligands

Launch several runs

massivefold_plots: output representation

Troubleshooting

Citation

Authors

About

Uh oh!

Releases 29

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MassiveFold: parallelize protein structure prediction

Installation

Uninstallation

Usage

Running MassiveFold

Screening a receptor with ligands

Discover PPI between receptors and ligands

Launch several runs

massivefold_plots: output representation

Troubleshooting

Citation

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 29

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages