Skip to content

Commit bb3f51e

Browse files
authored
Merge pull request #405 from aqlaboratory/multimer
Full multimer merge
2 parents ce21136 + c33a0bd commit bb3f51e

File tree

106 files changed

+316810
-2014
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

106 files changed

+316810
-2014
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
.vscode/
2+
.idea/
23
__pycache__/
34
*.egg-info
45
build
@@ -8,3 +9,4 @@ dist
89
data
910
openfold/resources/
1011
tests/test_data/
12+
cutlass/

README.md

+115-24
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,31 @@ _Figure: Comparison of OpenFold and AlphaFold2 predictions to the experimental s
77
A faithful but trainable PyTorch reproduction of DeepMind's
88
[AlphaFold 2](https://github.com/deepmind/alphafold).
99

10+
## Contents
11+
12+
- [OpenFold](#openfold)
13+
- [Contents](#contents)
14+
- [Features](#features)
15+
- [Installation (Linux)](#installation-linux)
16+
- [Download Alignment Databases](#download-alignment-databases)
17+
- [Inference](#inference)
18+
- [Monomer inference](#monomer-inference)
19+
- [Multimer Inference](#multimer-inference)
20+
- [Soloseq Inference](#soloseq-inference)
21+
- [Training](#training)
22+
- [Testing](#testing)
23+
- [Building and Using the Docker Container](#building-and-using-the-docker-container)
24+
- [Copyright Notice](#copyright-notice)
25+
- [Contributing](#contributing)
26+
- [Citing this Work](#citing-this-work)
27+
1028
## Features
1129

1230
OpenFold carefully reproduces (almost) all of the features of the original open
13-
source inference code (v2.0.1). The sole exception is model ensembling, which
14-
fared poorly in DeepMind's own ablation testing and is being phased out in future
15-
DeepMind experiments. It is omitted here for the sake of reducing clutter. In
16-
cases where the *Nature* paper differs from the source, we always defer to the
31+
source monomer (v2.0.1) and multimer (v2.3.2) inference code. The sole exception is
32+
model ensembling, which fared poorly in DeepMind's own ablation testing and is being
33+
phased out in future DeepMind experiments. It is omitted here for the sake of reducing
34+
clutter. In cases where the *Nature* paper differs from the source, we always defer to the
1735
latter.
1836

1937
OpenFold is trainable in full precision, half precision, or `bfloat16` with or without DeepSpeed,
@@ -63,7 +81,7 @@ To install:
6381
For some systems, it may help to append the Conda environment library path to `$LD_LIBRARY_PATH`. The `install_third_party_dependencies.sh` script does this once, but you may need this for each bash instance.
6482

6583

66-
## Usage
84+
## Download Alignment Databases
6785

6886
If you intend to generate your own alignments, e.g. for inference, you have two
6987
choices for downloading protein databases, depending on whether you want to use
@@ -112,7 +130,16 @@ DeepMind's pretrained parameters, you will only be able to make changes that
112130
do not affect the shapes of model parameters. For an example of initializing
113131
the model, consult `run_pretrained_openfold.py`.
114132

115-
### Inference
133+
## Inference
134+
135+
OpenFold now supports three inference modes:
136+
- [Monomer Inference](#monomer-inference): OpenFold reproduction of AlphaFold2. Inference available with either DeepMind's pretrained parameters or OpenFold trained parameters.
137+
- [Multimer Inference](#multimer-inference): OpenFold reproduction of AlphaFold-Multimer. Inference available with DeepMind's pre-trained parameters.
138+
- [Single Sequence Inference (SoloSeq)](#soloseq-inference): Language Model based structure prediction, using [ESM-1b](https://github.com/facebookresearch/esm) embeddings.
139+
140+
More instructions for each inference mode are provided below:
141+
142+
### Monomer inference
116143

117144
To run inference on a sequence or multiple sequences using a set of DeepMind's
118145
pretrained parameters, first download the OpenFold weights e.g.:
@@ -131,14 +158,14 @@ python3 run_pretrained_openfold.py \
131158
--mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
132159
--pdb70_database_path data/pdb70/pdb70 \
133160
--uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
134-
--output_dir ./ \
135161
--bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
136-
--model_device "cuda:0" \
137162
--jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
138163
--hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
139164
--hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
140165
--kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign \
141166
--config_preset "model_1_ptm" \
167+
--model_device "cuda:0" \
168+
--output_dir ./ \
142169
--openfold_checkpoint_path openfold/resources/openfold_params/finetuning_ptm_2.pt
143170
```
144171

@@ -176,13 +203,6 @@ To enable it, add `--trace_model` to the inference command.
176203
To get a speedup during inference, enable [FlashAttention](https://github.com/HazyResearch/flash-attention)
177204
in the config. Note that it appears to work best for sequences with < 1000 residues.
178205

179-
Input FASTA files containing multiple sequences are treated as complexes. In
180-
this case, the inference script runs AlphaFold-Gap, a hack proposed
181-
[here](https://twitter.com/minkbaek/status/1417538291709071362?lang=en), using
182-
the specified stock AlphaFold/OpenFold parameters (NOT AlphaFold-Multimer). To
183-
run inference with AlphaFold-Multimer, use the (experimental) `multimer` branch
184-
instead.
185-
186206
To minimize memory usage during inference on long sequences, consider the
187207
following changes:
188208

@@ -221,7 +241,78 @@ efficent AlphaFold-Multimer more than double the time. Use the
221241
at once. The `run_pretrained_openfold.py` script can enable this config option with the
222242
`--long_sequence_inference` command line option
223243

224-
#### SoloSeq Inference
244+
Input FASTA files containing multiple sequences are treated as complexes. In
245+
this case, the inference script runs AlphaFold-Gap, a hack proposed
246+
[here](https://twitter.com/minkbaek/status/1417538291709071362?lang=en), using
247+
the specified stock AlphaFold/OpenFold parameters (NOT AlphaFold-Multimer).
248+
249+
### Multimer Inference
250+
251+
To run inference on a complex or multiple complexes using a set of DeepMind's pretrained parameters, run e.g.:
252+
253+
```bash
254+
python3 run_pretrained_openfold.py \
255+
fasta_dir \
256+
data/pdb_mmcif/mmcif_files/ \
257+
--uniref90_database_path data/uniref90/uniref90.fasta \
258+
--mgnify_database_path data/mgnify/mgy_clusters_2022_05.fa \
259+
--pdb_seqres_database_path data/pdb_seqres/pdb_seqres.txt \
260+
--uniref30_database_path data/uniref30/UniRef30_2021_03 \
261+
--uniprot_database_path data/uniprot/uniprot.fasta \
262+
--bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
263+
--jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
264+
--hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
265+
--hmmsearch_binary_path lib/conda/envs/openfold_venv/bin/hmmsearch \
266+
--hmmbuild_binary_path lib/conda/envs/openfold_venv/bin/hmmbuild \
267+
--kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign \
268+
--config_preset "model_1_multimer_v3" \
269+
--model_device "cuda:0" \
270+
--output_dir ./
271+
```
272+
273+
As with monomer inference, if you've already computed alignments for the query, you can use
274+
the `--use_precomputed_alignments` option. Note that template searching in the multimer pipeline
275+
uses HMMSearch with the PDB SeqRes database, replacing HHSearch and PDB70 used in the monomer pipeline.
276+
277+
**Upgrade from an existing OpenFold installation**
278+
279+
The above command requires several upgrades to existing openfold installations.
280+
281+
1. Re-download the alphafold parameters to get the latest
282+
AlphaFold-Multimer v3 weights:
283+
284+
```bash
285+
bash scripts/download_alphafold_params.sh openfold/resources
286+
```
287+
288+
2. Download the [UniProt](https://www.uniprot.org/uniprotkb/)
289+
and [PDB SeqRes](https://www.rcsb.org/) databases:
290+
291+
```bash
292+
bash scripts/download_uniprot.sh data/
293+
```
294+
295+
The PDB SeqRes and PDB databases must be from the same date to avoid potential
296+
errors during template searching. Remove the existing `data/pdb_mmcif` directory
297+
and download both databases:
298+
299+
```bash
300+
bash scripts/download_pdb_mmcif.sh data/
301+
bash scripts/download_pdb_seqres.sh data/
302+
```
303+
304+
3. Additionally, AlphaFold-Multimer uses upgraded versions of the [MGnify](https://www.ebi.ac.uk/metagenomics)
305+
and [UniRef30](https://uniclust.mmseqs.com/) (previously UniClust30) databases. To download the upgraded databases, run:
306+
307+
```bash
308+
bash scripts/download_uniref30.sh data/
309+
bash scripts/download_mgnify.sh data/
310+
```
311+
Multimer inference can also run with the older database versions if desired.
312+
313+
314+
### Soloseq Inference
315+
225316
To run inference for a sequence using the SoloSeq single-sequence model, you can either precompute ESM-1b embeddings in bulk, or you can generate them during inference.
226317

227318
For generating ESM-1b embeddings in bulk, use the provided script: `scripts/precompute_embeddings.py`. The script takes a directory of FASTA files (one sequence per file) and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:
@@ -260,7 +351,7 @@ python3 run_pretrained_openfold.py \
260351
--output_dir ./ \
261352
--model_device "cuda:0" \
262353
--config_preset "seq_model_esm1b_ptm" \
263-
--openfold_checkpoint_path openfold/resources/openfold_params/seq_model_esm1b_ptm.pt \
354+
--openfold_checkpoint_path openfold/resources/openfold_soloseq_params/seq_model_esm1b_ptm.pt \
264355
--uniref90_database_path data/uniref90/uniref90.fasta \
265356
--pdb70_database_path data/pdb70/pdb70 \
266357
--jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
@@ -274,7 +365,7 @@ SoloSeq allows you to use the same flags and optimizations as the MSA-based Open
274365

275366
**NOTE:** Due to the nature of the ESM-1b embeddings, the sequence length for inference using the SoloSeq model is limited to 1022 residues. Sequences longer than that will be truncated.
276367

277-
### Training
368+
## Training
278369

279370
To train the model, you will first need to precompute protein alignments.
280371

@@ -412,17 +503,17 @@ environment. These run components of AlphaFold and OpenFold side by side and
412503
ensure that output activations are adequately similar. For most modules, we
413504
target a maximum pointwise difference of `1e-4`.
414505

415-
## Building and using the docker container
506+
## Building and Using the Docker Container
416507

417-
### Building the docker image
508+
**Building the Docker Image**
418509

419510
Openfold can be built as a docker container using the included dockerfile. To build it, run the following command from the root of this repository:
420511

421512
```bash
422513
docker build -t openfold .
423514
```
424515

425-
### Running the docker container
516+
**Running the Docker Container**
426517

427518
The built container contains both `run_pretrained_openfold.py` and `train_openfold.py` as well as all necessary software dependencies. It does not contain the model parameters, sequence, or structural databases. These should be downloaded to the host machine following the instructions in the Usage section above.
428519

@@ -462,7 +553,7 @@ python3 /opt/openfold/run_pretrained_openfold.py \
462553
--openfold_checkpoint_path /database/openfold_params/finetuning_ptm_2.pt
463554
```
464555

465-
## Copyright notice
556+
## Copyright Notice
466557

467558
While AlphaFold's and, by extension, OpenFold's source code is licensed under
468559
the permissive Apache Licence, Version 2.0, DeepMind's pretrained parameters
@@ -475,7 +566,7 @@ replaces the original, more restrictive CC BY-NC 4.0 license as of January 2022.
475566
If you encounter problems using OpenFold, feel free to create an issue! We also
476567
welcome pull requests from the community.
477568

478-
## Citing this work
569+
## Citing this Work
479570

480571
Please cite our paper:
481572

@@ -504,4 +595,4 @@ If you use OpenProteinSet, please also cite:
504595
primaryClass={q-bio.BM}
505596
}
506597
```
507-
Any work that cites OpenFold should also cite AlphaFold.
598+
Any work that cites OpenFold should also cite [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) and [AlphaFold-Multimer](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1) if applicable.

environment.yml

+1
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ dependencies:
1414
- pytorch-lightning==1.5.10
1515
- biopython==1.79
1616
- numpy==1.21
17+
- pandas==2.0
1718
- PyYAML==5.4.1
1819
- requests
1920
- scipy==1.7

0 commit comments

Comments
 (0)