MassiveFold is a tool that allows to massively expand the sampling of structure predictions by improving the computation of AlphaFold based predictions.
It optimizes the parallelization of the structure inference by splitting the computation on CPU for alignments, running automatically batches of structure predictions on GPU, and gathering the results in one global output directory, with a global ranking and a variety of plots.
MassiveFold uses AFmassive, ColabFold or AlphaFold3 as inference engine; AFmassive is an updated version of Björn Wallner's AFsample that offers additional diversity parameters for massive sampling.
MassiveFold's design (see schematic below) is optimized for GPU cluster usage. It allows fast computation for massive sampling by automatically splitting a large run of numerous predictions into several jobs. Each of these individual jobs are computed on a single GPU node and their results are then gathered as a single output with each prediction ranked on a global level instead of the level of each individual job.
This automatic splitting is also convenient for massive sampling on a single GPU server to manage jobs priorities.
MassiveFold is optimized for SLURM workload manager (Simple Linux Utility for Resource Management) as it relies heavily on its features (job array, job dependency, etc...). But it can still be used sequentially (no batches running in parallel)(see documentation).
A run is composed of three steps:
-
alignment: on CPU, sequence alignments is the first step (can be skipped if alignments are already computed)
-
structure prediction: on GPU, structure predictions follow the massive sampling principle. The total number of predictions is divided into smaller batches and each of them is distributed on a single GPU. These jobs wait for the alignment job to be over, if the alignments are not provided by the user.
-
post_treatment: on CPU, it finishes the job by gathering all batches outputs and produces plots with the plots module to visualize the run's performances. This job is executed only once all the structure predictions are over.
MassiveFold was developed to run massive sampling with AFmassive, ColabFold and AlphaFold3, and relies on them for its installation.
Follow the MassiveFold installation guide.
It details these steps of the MassiveFold installation:
To uninstall MassiveFold, remove the three conda environments (massivefold-1.8.2, mf-afmassive-1.1.10, mf-colabfold-1.6.1 and
mf-alphafold3-1.1.0) and remove the MassiveFold folder you cloned. Make sure you copy all the files and folders you want
to keep from the output and log directories somewhere else.
Usage section includes the most simple way to run MassiveFold with examples. For more detail on its functioning and other cases, see the usage documentation.
First, activate the conda environment:
conda activate massivefold-1.8.2Then launch MassiveFold:
massivefold run -s <SEQUENCE_PATH> -r <RUN_NAME> -p <NUMBER_OF_PREDICTIONS_PER_MODEL> -f <JSON_PARAMETERS_FILE> -t <TOOL> Example for AFmassive:
massivefold run -s input/H1140.fasta -r afm_default -p 5 -f AFmassive_params.jsonExample for ColabFold:
massivefold run -s input/H1140.fasta -r cf_default -p 5 -f ColabFold_params.jsonExample for AlphaFold3:
massivefold run -s input/H1140.fasta -r af3_default -p 5 -f AlphaFold3_params.jsonFirst, activate the conda environment:
conda activate massivefold-1.8.2To screen a single protein receptor with multiple ligands, launch:
massivefold screen -s <receptor_fasta_file> -l <ligand_list_csv> -f <AlphaFold3_params.json>Here the default value of -p which is the number of seed (for AF3) or prediction per model (for AF2) is 1.
See documentation for further details
First, activate the conda environment:
conda activate massivefold-1.8.2To launch a PPI discovery round between a set of protein (or dna or rna) receptors and a set of protein (or dna or rna) ligands, run:
massivefold ppi --receptors <receptor_list> --ligands <ligand_list> -f <AlphaFold3_params.json>You can also screen each potential PPI (receptor-ligand pair) with a defined list of small molecules by using --context <ligand_list_csv> (same usage as screening).
Here the default value of -p which is the number of seeds (for AF3) or predictions per model (for AF2) is 1.
See documentation for further details.
First, activate the conda environment:
conda activate massivefold-1.8.2To launch multiple runs for a single sequence combination, use multirun pipeline, run the following command:
massivefold multirun -s <SEQUENCE_PATH> --setup <JSON_SETUP_FILE>Example:
massivefold multirun -s input/H1140.fasta --setup ../src/massivefold/examples/multirun_setup.jsonAdditionally to the configuration of the plots parameters inside MassiveFold JSON param file, the plot module can also be used on an already produced MassiveFold (or AlphaFold2) output to evaluate visually its predictions.
For more details on this usage, see MassiveFold plots documentation.
Some known issues were identified and can be prevented by following steps described in the troubleshooting documentation.
If you use MassiveFold in your work, please cite:
Raouraoua N. et al. MassiveFold: unveiling AlphaFold’s hidden potential with optimized and parallelized massive
sampling. 2024. Nature Computational Science, DOI: 10.1038/s43588-024-00714-4,
https://www.nature.com/articles/s43588-024-00714-4
Nessim Raouraoua (UGSF - UMR 8576, France)
Claudio Mirabello (NBIS, Sweden)
Thibaut Véry (IDRIS, France)
Christophe Blanchet (IFB, France)
Björn Wallner (Linköping University, Sweden)
Marc F Lensink (UGSF - UMR8576, France)
Guillaume Brysbaert (UGSF - UMR 8576, France)
This work was carried out as part of Work Package 4 of the MUDIS4LS project led by the French Bioinformatics Institute (IFB). It was initiated at the IDRIS Open Hackathon, part of the Open Hackathons program. The authors would like to acknowledge OpenACC-Standard.org for their support.