This directory is for storing raw experimental data downloaded from databases, paper supplements, etc. This data is processed into a uniform format that can be used to train models with our codebase. The processed data is contained in the data/dms_data directory. The script we used to process the data is provided for reference, but you will need to modify it to work with your own data.
For purposes of this repository, we are only including the avgfp dataset as an example.
| Dataset | Reference | First Author | Year | Acquired From | Link |
|---|---|---|---|---|---|
| avgfp | Local fitness landscape of the green fluorescent protein | Sarkisyan | 2016 | Associated data on figshare, amino_acid_genotypes_to_brightness.tsv | Figshare, Direct download |
You can process this raw dataset by running the following command from the root of the repository.
Note you may have to delete the avgfp.tsv in data/dms_data/avgfp if it already exists, in order to get this script to run.
python code/parse_raw_dms_data.py avgfp