Each Rosetta dataset has its own directory containing the following:
- The dataset in different formats (.tsv, .db, and .h5 files)
- A list of PDB files corresponding to the variants in the dataset (pdb_fns.txt)
- A splits directory containing train, validation, and test splits
- Standardization parameters computed on the train set (in the splits directory)
To create a Rosetta dataset that you can use for pretraining METL, you must first run Rosetta simulations using our metl-sim repository. After successfully running the Rosetta simulations with that repository, you will have a database containing protein variants and their corresponding Rosetta energies. We have provided a sample database at avgfp_rosettafy_sample.db, which you can use to test the code in this repository without needing to run the Rosetta simulations yourself. This sample database contains 10,000 GFP variants and can be used as a starting point to create a processed Rosetta dataset, which can in turn be used to pretrain sample GFP METL-Local models.
- If you would like to access our full raw Rosetta databases, see the metl-sim repository.
- If you would like to access our processed Rosetta datasets, including the datasets we used to pretrain METL-Local and METL-Global models, see the metl-pub repository.
A database acquired from the metl-sim repository, including the sample database provided here, must first be processed into a Rosetta dataset before it can be used to pretrain METL.
- Create a Rosetta dataset using the script parse_rosetta_data.py to parse and format the Rosetta data generated by our metl-sim repository. This step is necessary to remove duplicates, handle NaN values, and remove outliers.
- Create train, validation, and test splits for the Rosetta dataset using split_dataset.py.
- Compute standardization parameters from the train set using compute_rosetta_standardization.py. The standardization parameters are needed during training and evaluation to standardize the various Rosetta energies so that they are on similar scales with mean 0 and standard deviation 1.
- Make sure all PDB files in the Rosetta dataset are listed in pdb_index.csv.
See the notebook generate_rosetta_dataset.ipynb for a complete example of how perform the above steps. This example uses the sample database avgfp_rosettafy_sample.db to generate a processed Rosetta dataset, which is also provided here in the avgfp directory.