Name	Name	Last commit message	Last commit date
Latest commit History 31 Commits
SARS2-mut-fitness @ 067fce1	SARS2-mut-fitness @ 067fce1
data	data
notebooks	notebooks
results	results
src	src
.gitmodules	.gitmodules
README.md	README.md
environment.yml	environment.yml

SARS2-synonymous-mut-rate

This repository has the code for carying out the analyses in SARS-CoV-2's mutation rate is highly variable between sites and is influenced by sequence context, genomic region, and RNA structure.

Organization of repository

SARS2-mut-fitness/ is a submodule of the Bloom and Neher pipeline with data files that we use as input for our analysis
data/ contains additional input data
notebooks/ contains Jupyter notebooks used to analyze the data
src/ contains additional Python scripts used to analyze the data
results/ contains outputs from the above notebooks and scripts
environment.yml encodes the environment used to run the notebooks and scripts

Key results files

results/curated_mut_counts.csv: is a file with curated site-specific mutational counts generated by notebooks/curate_counts.ipynb as described below. Key columns include:
- nt_mutation: the wildtype nucleotide, site, and mutant nucleotide for a given mutation
- synonymous: a bool indicating whether a mutation is synonymous
- ss_prediction: indicates whether a site is paired or unpaired in the RNA secondary structure of the SARS-CoV-2 genome, as predicted by Lan et al.; see data/lan_2022/41467_2022_28603_MOESM11_ESM.txt.
- motif: the 3mer sequence motif centered on a site
- actual_count: the counts of the mutation along the branches of the UShER tree
- count_terminal and count_non_terminal: counts on terminal or non-terminal branches, respectively
- actual_count_pre_omicron and actual_count_omicron: counts in pre-Omicron clades or Omicron clades, respectively
results/exploratory_figures/: contains figures generated by notebooks/analyze_counts.ipynb
the PDF files in results/ are figures generated by the Python scripts in src/

Summary of Jupyter notebooks

notebooks/curate_counts.ipynb
- This notebook generates the file with curated site-specific mutation counts.
- As input, the notebook takes the results of running the Bloom and Neher pipeline on an UShER tree with all sequences in GISAID as of 2024-04-24. The pipeline generates a file that reports the counts of each possible nucleotide mutation across the genome along the branches of the tree. In doing so, the pipeline divides the tree into several different clades and separately reports mutational counts for each clade, only reporting counts for mutations away from a given clade's founder sequence.
- The notebook curates these raw counts data as follows:
  - First, we identified all sites in the genome where the nucleotide identities at that site, the site's codon, and the site's 5' and 3' nucleotides are conserved in all clade founder sequences, including the Wuhan-Hu-1 sequence (note: we ignore the codon requirement for noncoding sites).
  - Next, we filtered out mutations at sites that: i) did not meet the above conservation criteria, ii) were masked in the UShER tree in any clade, iii) were identified as being error-prone (we also filtered out the set of error-prone sites identified by De Maio et al.).
  - Next, for the remaining mutations, we summed the counts of each mutation across all clades (using the counts in the actual_counts column, and only summing rows where the subset column equals all, as opposed to England or USA), resulting in the site-specific mutation counts that we use in our analyses.
  - To compute counts for terminal or non-terminal branches, we simply summed counts in the columns count_terminal or count_non_terminal, and to compute counts for pre-Omicron vs. Omicron clades, we simply summed counts for the relevant set of clades.
  - We wrote the curated counts to the file: results/curated_mut_counts.csv
notebooks/analyze_counts.ipynb
- This notebook reads in the curated counts from above and generates many of the plots that explore patterns in synonymous mutation counts between sites.

Summary of Python scripts in `src/`

helper.py is a helper script used by other scripts in the directory
the remaining Python scripts each generate a single figure from the paper, and depend on the dataframe of curated counts generated by notebooks/curate_counts.ipynb.
these scripts should only be run after running the notebook notebooks/curate_counts.ipynb

Running this pipeline with alternative UShER trees

In the notebook notebooks/curate_counts.ipynb, the variable fitness_results_dir points to a directory in the SARS2-mut-fitness submodule. This directory was generated by running the Bloom/Neher pipeline on a specific UShER tree (with all sequences in GISAID as of 2024-04-24). However the submodule also has similar directories generated by running the Bloom/Neher pipeline on other UShER trees, including ones that are publicly available. To run our analysis using results from a different tree, simply update the fitness_results_dir variable to point to the directory of interest. Then, rerun the code in this repository in the order described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SARS2-synonymous-mut-rate

Organization of repository

Key results files

Summary of Jupyter notebooks

Summary of Python scripts in `src/`

Running this pipeline with alternative UShER trees

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

matsengrp/SARS2-synonymous-mut-rate

Folders and files

Latest commit

History

Repository files navigation

SARS2-synonymous-mut-rate

Organization of repository

Key results files

Summary of Jupyter notebooks

Summary of Python scripts in src/

Running this pipeline with alternative UShER trees

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Summary of Python scripts in `src/`

Packages