Skip to content

matsengrp/s2trajectory

Repository files navigation

SARS2 Mutation Fitness Trajectories

This repository contains the source code for a reproducible snakemake pipeline that takes as input url(s) for the desired pre-build MAT trees, and computes mutation statistics, summarized across the specified sliding windows through time.

Quick Start

You should have conda with mamba, see the snakemake installation docs for description of mamba.

To run the pipeline with a config.yaml

mamba env create -f environment.yml
mamba activate SARS2-mut-fitness-trajectory
snakemake --use-conda -j 4 # cpus

To run the clustering pipeline, do something like

snakemake -s Snakefile-clustering.smk

Summary

The primary steps for each tree can be summarized as;

  1. Use chronumental to get estimates for node times
  2. Summarize individual sliding-window subtree founders, as well as the respective mutation counts on branches which reside below.
  3. Compute the actual and expected counts for every possible mutation, in every window (This currently quite slow).

Inputs

The pipeline takes in a configuration file (config.yaml, by default) which looks like:

# List matutils pre-built mutation-annotated trees. 
# The entire pipeline is run independently on each tree listed
mat_trees:
  public_2023-06-21: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2023/06/21/public-2023-06-21
  # some_other_tree_name: some_other_tree_url_prefix

# suffixes, generally do not change
mat_suffix: .all.masked.nextclade.pangolin.pb.gz
meta_suffix: .metadata.tsv.gz

# Reference GTF and FASTA, and location of spike coding sequence
ref_fasta: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/bigZips/wuhCor1.fa.gz
ref_gtf: http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/bigZips/genes/ncbiGenes.gtf.gz

# gene annotations to add to the reference GTF, which is missing ORF9b, see:
# https://github.com/jbloomlab/SARS2-mut-fitness/issues/21
add_to_ref_gtf:
  ORF9b: [28284, 28577]

# all pairwise combinations of the parameters sets below will be run through the pipeline
window_sizes: [100] # size of sliding window for mutation spectra
window_steps: [50] #  

# The number of threads to use when computing actual/expected counts
expected_counts_cpus: 24

Outputs

results_public_2023-06-21
├── chronumental
│   ├── strain_date_inferred.tsv      # times inferred by chronumental
│   └── strain_date_meta.tsv          # times from sampling data
├── mat
│   ├── chronumental_timetree.nwk     # the output tree from chronumental
│   ├── extracted_tree.nwk            # the newick output from matUtils extract for input to chronumental
│   ├── extracted_tree.pb             # The same extracted tree, but in pb (in case there was subsetting)
│   ├── mat_tree.pb.gz                # original downloaded MAT tree
│   └── tree_metadata.tsv.gz          # original downloaded metadata
├── ref
│   ├── coding_sites.csv              # the site annotations for each nt in reference genome
│   ├── edited_ref.gtf                # the reference after modifications specificied in config
│   ├── original_ref.gtf              # the original reference GTF
│   └── ref.fa                        # the reference genome
├── summary
│   ├── branch_times.csv              # Tree branch statistics
│   └── chronumental_residuals.html   # An altair chart scatter plot of msample date / predicted.
└── windows
    └── size_100
        └── step_50
            ├── actual_counts.csv     # the actual counts for each mutation in each window
            ├── expected_counts.csv   # the expected counts for each mutation in each window
            ├── founder_muts.csv      # the founder mutations in each window
            ├── subtree_counts.csv    # the counts of mutations in each window
            └── tall_fitness_df.csv   # the fitness dataframe, with one row per mutation in each window

Chronumental GPU support.

The chronumental environment installs a version of jax with GPU support, compatible with CUDA 11.8, and cuDNN 8.6 -- both of which are available on ermine as a module:

module load cuDNN/8.6.0.163-CUDA-11.8.0

In brief, if you have these modules loaded in the environment before you run the pipeline, chronumental will use the GPU to speed things up. Note that you must also specify --use-gpu in the config.yaml keyword parameters for chronumental.

Resources

This workflow, in particular the expected counts calculation is quite computationally intesive. Little effort has been put into optimizing the algorithm as this is is still exploratory.

Currently, with 24 cpu's available:

518624.12user 6593.02system 9:15:10elapsed 1576%CPU (0avgtext+0avgdata 558667800maxresident)k
741632inputs+10836320outputs (242major+3494650349minor)pagefaults 0swaps

About

SARS2-mut-fitness-trajectory

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6