Skip to content

Latest commit

 

History

History
69 lines (44 loc) · 5.24 KB

README.md

File metadata and controls

69 lines (44 loc) · 5.24 KB

juniper-analyses

Code to run analyses and generate figures for JUNIPER: Reconstructing Transmission Events from Next-Generation Sequencing Data at Scale by Specht et al. (2025).

Setup

Large File Merging

Before running code in this repository, please download the file juniper-analyses-oversize.zip from the Zenodo repository 10.5281/zenodo.14841603 and merge the contents with the juniper-analyses directory stored here. For example, the large file juniper-analyses-oversize/h5n1/res.RData from the Zenodo repository should be moved to juniper-analyses/h5n1/res.RData in this repository. The files deposited in the Zenodo repository were too large to host on this GitHub.

Packages

The analyses in this repository require the following R packages: outbreaker2, Rcpp, ape, ggplot2, igraph, readxl, reshape2, lubridate, coda, and cowplot. Additionally, synthetic data generation requires our custom R package simulatR. To install it, first install the devtools package, and then run devtools::install_github("broadinstitute/simulatR").

BEAST

Our comparison to the BadTrIP outbreak reconstruction tool requires the BEAST software, which may be downloaded and installed from beast2.org. We use BEAST version 2.7.5 for Mac OS X.

IQTree and TreeTime

Our comparison to the TransPhylo outbreak reconstruction tool requires the IQ-TREE and TreeTime software packages, which may be downloaded and installed from iqtree.org and treetime.readthedocs.io/en/latest/installation.html. We use IQ-TREE version 2.3.6 and TreeTime version 0.11.4 for Mac OS X.

De Novo iSNV Frequency Model

The code used to validate our model for the frequencies of de novo iSNVs can be found in isnv_frequency_model.R, and executed using

Rscript isnv_frequency_model.R

The estimate of the parameter r will be printed in the terminal, and a figure will be saved to the figs subdirectory.

Validation of JUNIPER on Synthetic Data

Data Generation and Model Execution

The directories experiment_1 through experiment_23, which contain the synthetic outbreaks used to validate JUNIPER and reconstructions thereof using JUNIPER, outbreaker2, BadTRIP, and TransPhylo, can be generated using the script synthetic_generation.R. This script may be run from the command line using five arguments, the first being the index of the synthetic outbreak to generate and reconstruction, and the second through fifth being the number of global iterations at which to run JUNIPER, outbreaker2, TransPhylo, and BadTrip. Note that the same number of global iterations for different methods may have drastically different runtimes and effective sample sizes. On the synthetic datasets, to achieve reasonable effective sample sizes, we recommend a default of 10,000 iterations each for JUNIPER and outbreaker2, 100,000 for BadTrIP, and 1,000,000 for TransPhylo.

As an example, to generate the directory experiment_1 with the above defaults, first navigate to the juniper-analyses directory. Then, from the command line, run

Rscript synthetic_generation.R 1 10000 10000 100000 1000000

Note that if the directory experiment_1 already exists, the script will prompt the user to delete it and try again. Additionally, running different versions of BEAST, IQ-TREE, and TreeTime and/or using different operating systems may require the user to modify the calls to said software packages, located in the functions compare_badtrip and compare_outbreaker within synthetic_generation.R.

This GitHub reflects the juniper-analyses repository after the synthetic_generation.R script has already been run for all 23 synthetic datasets, so that the user may explore the results without needing to regenerate them (which is time consuming).

Data Processing and Visualization

All performance statistics and figures for the synthetic data validation section of the JUNIPER manuscript were generated using the script synthetic_analysis.R. It can be run from the command line as follows:

Rscript synthetic_analysis.R

Summary statistics will be printed in the terminal, and figures will be saved to the figs subdirectory.

Validation of JUNIPER on Real Data

The script brsv-and-south-africa.R runs JUNIPER, outbreaker2, BadTRIP, and TransPhylo on the Bovine RSV and South Africa SARS-CoV-2 datasets and analyzes the results. As regenerating the model runs may be time consuming, we saved the pre-computed JUNIPER and BadTrIP runs for the bovine RSV outbreak, and pre-computed runs of all models for the SARS-CoV-2 outbreaks. To run analyses and generate figures from these pre-computed model outputs, run

Rscript brsv-and-south-africa.R

To regenerate all JUNIPER, outbreaker2, and TransPhylo outputs, run

Rscript brsv-and-south-africa.R -r

Note that BadTrIP is heavily time-consuming on both datasets due to the presence of ambiguous sites. To avoid re-running BadTrIP, we reused earlier BadTrIP output on both datasets from an earlier paper, titled "Inferring Viral Transmission Pathways from Within-Host Variation" (Specht et al. 2023). BadTrIP scrips from that project are available at github.com/broadinstitute/transmission-inference.

Application of JUNIPER to Large-Scale H5N1 and SARS-CoV-2 Datasets