This repository contains all the scripts and data to reproduce the results of:
D. K. Sydykova, C. O. Wilke (2018). Theory of measurement for site-specific evolutionary rates in amino-acid sequences
-
analytical_ratescontains rates that were calculated using analytical derivations. The following list is the files contained in this direcotry and their descriptions.-
all_sites_aa.csvcontians site-wise rates for every site in egg white lysozyme (PDB ID: 132L) calculated for different times (columnsiteinall_sites_aa.csvdirectly corresponds to columnSITEin132L_A_foldx_ddG.txt). These rates were calculated assuming that the true model is a mutation-selection model, and the inference model is Jukes-Cantor (equations 3-5). This file was generated with the commandpython analytical_rate_aa.py -m 125 -q q_matrix/amino_acid/ -o all_sites_aa.csv. -
ten_sites_aa.csvcontains site-wise rates for the first ten sites in egg white lysozyme (PDB ID: 132L) calculated for different times (columnsiteinten_sites_aa.csvdirectly corresponds to columnSITEin132L_A_foldx_ddG.txt). These rates were calculated assuming that the true model is a mutation-selection model, and the inference model is Jukes-Cantor (equations 3-5). This file was generated with the commandpython analytical_rate_aa.py -m 10 -q q_matrix/amino_acid/ -o ten_sites_aa.csv. -
ten_sites_aa_true_JC.csvcontains site-wise rates for ten sites that were calculated under the assumption that both the true model and the inference model are Jukes-Cantor. This file was generated with the commandpython analytical_rate_aa_true_JC.py -m 10 -o ten_sites_aa_true_JC.csv. -
ten_sites_aa_QM.csvcontains site-wise rates for when rate is measured with an arbitrary QM matrix and for when rate is measured with a Jukes-Cantor matrix (equation 1). -
ten_sites_codon.csvcontians site-wise rates for every site in egg white lysozyme (PDB ID: 132L) calculated for different times (columnsiteinall_sites_aa.csvdirectly corresponds to columnSITEin132L_A_foldx_ddG.txt). These rates were calculated assuming that the true model is a codon mutation-selection model, and the inference model is an amino acid Jukes-Cantor (equation 6 and equations 22S and 24S). This file was generated with the commandpython analytical_rate_codon.py -m 10 -q q_matrix/codon/ -o ten_sites_codon.csv.
-
-
inferred_ratescontains files with site-wise rates inferred with HyPhy. There are two directories ininferred rates,raw_ratesandprocessed_rates.raw_ratescontains individual files for a simulated alignment (one file per alignment), andprocessed_ratescontains concatenated files fromraw_rates.The following list describes the directories contained inraw_rates:JC, inferred rates when the true and the inference models are both Jukes-Cantor-like.all_sites, inferred rates when the true model is MutSel, and the inference model is either Jukes-Cantor-like (JC), WAG, JTT, or LG.site_dupl, rates inferred with JC for alignments with different number of site duplicates.ten_sites, inferred rates when the true model is MutSel, and the inference model is JC.translated, inferred rates when the true model is a codon MutSel model, and the inference model is amino acid JC.
-
q_matrixcontains site-wise substitution matrices Q used for simulating alignments and for calculating site-wise rates. There are two directories inq_matrix,amino_acidandcodon.amino acidcontains amino acid substitution matrices, andcodoncontains codon substitution matrices.Files that start with
132L_Aindicate substitution matrices that were calculated using data from Echave et al (2015) for egg white lysozyme (PDB ID: 132L). For example, file132L_A_site79_q_matrix.npycorresponds to the substitution matrix calculated for site 79. The site positions here correspond to the site positions given by the file132L_A_foldx_ddG.txt, which was directly copied from the git repository for Echave et al (2015) https://github.com/wilkelab/therm_constraints_rate_variation. These matrices were calculated according to the mutation-selection (MutSel) theory by Halpern and Bruno (1998). The script to calculate amino acid MutSel matrices issrc/calculate_aa_mutsel_Q.py, and the script to calculate codon MutSel matrices issrc/calculate_codon_mutsel_Q.py.Files that start with
site0_JCindicate substitution matrices Q defined as Q = r(k)QJC. Here, QJC is the Jukes-Cantor-like matrix, and r(k) is the true rate at site k. True rates were generated by the scriptsrc/analytical_rate_aa_true_JC.pyand were stored inanalytical_rates/ten_sites_aa_true_JC.csv. The script to calculate amino acid substitution matrices Q defined as Q = r(k)QJC issrc/calculate_aa_true_JC_Q.py. -
treescontains tree files that were used for simulating alignments. Each file name stores the number of branches (n) in the tree and the branch lengths (bl). For example, filen2_bl0.005.tredescribes a tree with 2 branches of lengths 0.005 each. -
hyphycontains all scripts and files to run HyPhy. -
plotscontains plots generated for the manuscript. -
srccontains code to run the analysis.