Code to run analyses and generate figures for JUNIPER: Reconstructing Transmission Events from Next-Generation Sequencing Data at Scale by Specht et al. (2025).
Before running code in this repository, please download the file juniper-analyses-oversize.zip from the Zenodo repository 10.5281/zenodo.14841603 and merge the contents with the juniper-analyses directory stored here. For example, the large file juniper-analyses-oversize/h5n1/res.RData from the Zenodo repository should be moved to juniper-analyses/h5n1/res.RData in this repository. The files deposited in the Zenodo repository were too large to host on this GitHub.
The analyses in this repository require the following R packages: outbreaker2
, Rcpp
, ape
,
ggplot2
, igraph
, readxl
, reshape2
, lubridate
, coda
, and cowplot
. Additionally, synthetic data generation requires our custom R package simulatR
. To install it, first install the devtools
package, and then run devtools::install_github("broadinstitute/simulatR")
.
Our comparison to the BadTrIP
outbreak reconstruction tool requires the BEAST
software, which may be downloaded and installed from beast2.org
. We use BEAST
version 2.7.5 for Mac OS X.
Our comparison to the TransPhylo
outbreak reconstruction tool requires the IQ-TREE
and TreeTime
software packages, which may be downloaded and installed from iqtree.org
and treetime.readthedocs.io/en/latest/installation.html
. We use IQ-TREE
version 2.3.6 and TreeTime
version 0.11.4 for Mac OS X.
The code used to validate our model for the frequencies of de novo iSNVs can be found in isnv_frequency_model.R
, and executed using
Rscript isnv_frequency_model.R
The estimate of the parameter r will be printed in the terminal, and a figure will be saved to the figs
subdirectory.
The directories experiment_1
through experiment_23
, which contain the synthetic outbreaks used to validate JUNIPER
and reconstructions thereof using JUNIPER
, outbreaker2
, BadTRIP
, and TransPhylo
, can be generated using the script synthetic_generation.R
. This script may be run from the command line using five arguments, the first being the index of the synthetic outbreak to generate and reconstruction, and the second through fifth being the number of global iterations at which to run JUNIPER
, outbreaker2
, TransPhylo
, and BadTrip
. Note that the same number of global iterations for different methods may have drastically different runtimes and effective sample sizes. On the synthetic datasets, to achieve reasonable effective sample sizes, we recommend a default of 10,000 iterations each for JUNIPER
and outbreaker2
, 100,000 for BadTrIP
, and 1,000,000 for TransPhylo
.
As an example, to generate the directory experiment_1
with the above defaults, first navigate to the juniper-analyses
directory. Then, from the command line, run
Rscript synthetic_generation.R 1 10000 10000 100000 1000000
Note that if the directory experiment_1
already exists, the script will prompt the user to delete it and try again. Additionally, running different versions of BEAST
, IQ-TREE
, and TreeTime
and/or using different operating systems may require the user to modify the calls to said software packages, located in the functions compare_badtrip
and compare_outbreaker
within synthetic_generation.R
.
This GitHub reflects the juniper-analyses
repository after the synthetic_generation.R
script has already been run for all 23 synthetic datasets, so that the user may explore the results without needing to regenerate them (which is time consuming).
All performance statistics and figures for the synthetic data validation section of the JUNIPER
manuscript were generated using the script synthetic_analysis.R
. It can be run from the command line as follows:
Rscript synthetic_analysis.R
Summary statistics will be printed in the terminal, and figures will be saved to the figs
subdirectory.
The script brsv-and-south-africa.R
runs JUNIPER
, outbreaker2
, BadTRIP
, and TransPhylo
on the Bovine RSV and South Africa SARS-CoV-2 datasets and analyzes the results. As regenerating the model runs may be time consuming, we saved the pre-computed JUNIPER and BadTrIP runs for the bovine RSV outbreak, and pre-computed runs of all models for the SARS-CoV-2 outbreaks. To run analyses and generate figures from these pre-computed model outputs, run
Rscript brsv-and-south-africa.R
To regenerate all JUNIPER
, outbreaker2
, and TransPhylo
outputs, run
Rscript brsv-and-south-africa.R -r
Note that BadTrIP is heavily time-consuming on both datasets due to the presence of ambiguous sites. To avoid re-running BadTrIP, we reused earlier BadTrIP output on both datasets from an earlier paper, titled "Inferring Viral Transmission Pathways from Within-Host Variation" (Specht et al. 2023). BadTrIP scrips from that project are available at github.com/broadinstitute/transmission-inference
.