Skip to content

Commit d21d46d

Browse files
authored
ok: Merge pull request #3067 from jnwei/patch-1
Revise OpenFold3 dataset description and citation
2 parents 4d3389f + 4ce4c68 commit d21d46d

1 file changed

Lines changed: 17 additions & 9 deletions

File tree

datasets/openfold3.yaml

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,16 @@
11
Name: OpenFold3 Training Data
2-
Description: This dataset contains MSAs and predicted structures for 13 million long (sequence length >= 200 amino acids) monomers from the MGNIFY database. These MSAs were generated using the AF3 protocol, and were used to predict structures with AlphaFold2. This data serves as the long monomer distillation set for Openfold3, an open-source, all-atom ligand, RNA and protein structure prediction software.
3-
Documentation: https://github.com/aqlaboratory/openfold3-training-data-RODA/tree/main
4-
Contact: https://github.com/aqlaboratory/openfold3-training-data-RODA/issues
5-
ManagedBy: OpenFold
2+
Description: |
3+
This dataset contains MSAs and predicted structures used to train OpenFold3 preview, an open-source, all-atom ligand, RNA and protein structure prediction software. This includes -
4+
- PDB - 245k structures and alignments from the RCSB Protein Data Bank - https://www.rcsb.org/
5+
- Long monomer distillation set - ~13 million long (sequence length >= 200 amino acids) monomers from the MGNIFY database - https://www.ebi.ac.uk/metagenomics/.
6+
- Short monomer distillation set - 400k short (sequence length < 200 amino acid) monomers from the MGNIFY database - https://www.ebi.ac.uk/metagenomics/.
7+
- Disordered set - AF2-predicted structures for unresolved segments missing from the PDB
8+
- RNA - OF3p2-predicted RNA monomer structures generated from a clustered version of RFAM (current version)
9+
For the distillation sets MSAs were generated using the AF3 protocol, and were used to predict structures with AlphaFold2, more details can be found in our whitepaper - https://portal.openfold.omsf.io/reports/of3p2_technical_report.pdf
10+
For a full description and an interactive data explorer, please visit https://portal.openfold.omsf.io/datasets
11+
Documentation: https://portal.openfold.omsf.io/datasets
12+
Contact: https://github.com/aqlaboratory/openfold-3/issues
13+
ManagedBy: OpenFold Consortium
614
UpdateFrequency: Never
715
Tags:
816
- openfold
@@ -14,7 +22,7 @@ Tags:
1422
- life sciences
1523
- aws-pds
1624
License: "[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)"
17-
Citation: "Additionally, please cite [our prior manuscript](https://www.nature.com/articles/s41592-024-02272-z)."
25+
Citation: "Additionally, please cite [our repository](https://doi.org/10.5281/zenodo.19001000) and [our prior manuscript](https://www.nature.com/articles/s41592-024-02272-z)."
1826
Resources:
1927
- Description: A repository of MSAs and 3D protein structural coordinates used to train OpenFold3.
2028
ARN: arn:aws:s3:::openfold3-data
@@ -27,9 +35,9 @@ DataAtWork:
2735
AuthorName: Glòria Macià
2836
AuthorURL: https://www.linkedin.com/in/gloriamacia/
2937
Publications:
30-
- Title: "OpenProteinSet: Training data for structural biology at scale"
31-
URL: "https://arxiv.org/abs/2308.05326"
32-
AuthorName: Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Jarosch, Lukas; Berenberg, Daniel; Fisk, Ian, et al
38+
- Title: OpenFold3-preview2 Technical Report
39+
URL: https://portal.openfold.omsf.io/reports/of3p2_technical_report.pdf
40+
AuthorName: The OpenFold3 Team
3341
- Title: "OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization"
34-
URL: "https://www.nature.com/articles/s41592-024-02272-z"
42+
URL: https://www.nature.com/articles/s41592-024-02272-z
3543
AuthorName: Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O'Donnell, Timothy J, et al

0 commit comments

Comments
 (0)