Skip to content

reznik-lab/mtdna-dlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

145 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mtdna-dlp

Scripts to generate figures

This directory contains scripts essential for the visualization of data figures as presented in the main text. The R Markdown file html/figures.Rmd can be compiled into an HTML document, which can then be viewed in any web browser. Please ensure you have the necessary packages installed to compile and view the figures as intended.

Installation (in the command line)

Clone this repository:

git clone https://github.com/reznik-lab/mtdna-dlp.git

Install miniconda (at least Python v3.6): https://docs.conda.io/en/latest/miniconda.html

Create a new conda environment and activate it (make sure it is activated before installing the dependencies):

conda create -n [environment]
conda activate [environment]

Install the following:

conda install -c bioconda ensembl-vep=95.3
conda install -c bioconda vt
conda install -c bioconda pybedtools
conda install -c bioconda bcftools
conda install -c bioconda samtools
conda install -c bioconda vcf2maf
conda install -c bioconda pysam
conda install -c bioconda gatk4
conda install -c conda-forge biopython
conda install numpy
conda install pandas
conda install -c conda-forge matplotlib

Additionally, a VEP offline cache needs to be installed (NOTE: CACHE MUST BE SAME VERSION AS VEP VERSION). Please refer to http://useast.ensembl.org/info/docs/tools/vep/script/vep_cache.html for instructions on how to install a VEP cache. Due to the size of the caches, it will likely take several hours to install.

You will also need to ensure that the reference genome fasta has a dictionary. Afer downloading the correct reference genome (make sure the mitochondrial chromosome name matches the one in your bam file, i.e. chrM or MT).

To download the reference genome, create a directory [reference] to store it and run the following:

curl -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
tar xzf refdata-gex-GRCh38-2020-A.tar.gz 

To make "genome.dict" file run the following in the same directory as your reference genome.fa file:

java -jar [your_directory]/mtdna-dlp/python/reference/picard.jar CreateSequenceDictionary R=genome.fa O=genome.dict

Bulk pipeline

Running the bulk pipeline (in the command line)

Subsetting for mitochondrial regions (where chrM can either be chrM or MT, depending on how it is annotated in the possorted_bam file):

conda activate [environment]
samtools view -b [library_id].bam chrM > [library_id]_chrM.bam

Index the input .bam file:

samtools index [library_id]_chrM.bam

*The index file needs to always be newer than the bam file, otherwise the pipeline will throw an error

Navigate to the directory with the bulkpipeline.py file and run the following (replace all brackets):

python3 bulkpipeline.py -d [data_directory] -w [working_directory] -l [library_id] -re [results_directory] -vc [optional_vep_cache_directory] -q [optional_mapping_quality] -Q [optional_base_quality] -s [optional_strand] -t [optional_threshold] -g [optional_genome] -r [optional_reference_fasta] -n [optional_normal] -nd [optional_normaldir] -m [optional_molecule] -c [optional_mincounts]

Parameter descriptions:

  • Tumor file: Path of the tumor sample
  • Working directory: path to directory with bulkpipeline.py file in it
  • Results directory: path to directory where results will be stored
  • (OPTIONAL) VEP cache directory: path to directory with VEP cache (default="$HOME/.vep")
  • (OPTIONAL) Mapping quality: minimum mapping quality (default=20)
  • (OPTIONAL) Base quality: minimum base quality (default=20)
  • (OPTIONAL) Strand: minimum number of reads mapping to forward and reverse strand to call mutation (default=2)
  • (OPTIONAL) Threshold: critical threshold for calling a cell wild-type (default=0.1)
  • (OPTIONAL) Genome: genome version (supported genomes are GRCh37, GRCh38, GRCm38, and mm10, default="GRCh37")
  • (OPTIONAL) Reference fasta: path to fasta file (by default will use a file from the reference folder that matches the genome, but best to use same reference used to align fastqs)
  • (OPTIONAL) Normal: Path of the matched normal sample (default="")
  • (OPTIONAL) Molecule: type of molecule (dna or rna, default="dna")
  • (OPTIONAL) Minimum counts: minimum number of read counts for MTvariantpipeline (default=100)

For example, a call to run the bulk pipeline with the required paramaters could look like this:

python3 bulkpipeline.py -t /my_data/file.bam -w /my_home/mtdna-dlp/python/ -re /my_home/mtdna-dlp/results/

Bulk pipeline outputs

The bulk pipeline should have the following files and subdirectories in the results directory:

  • [sample].bam.maf
  • [sample]_mutsig.tsv
  • MuTect2_results
  • MTvariant_results
  • TEMPMAFfiles

Single cell pipeline

Splitting the bam file into one file per cell

Using the output from cellranger (i.e. /outs/filtered_feature_bc_matrix/barcodes.tsv.gz and /outs/possorted_genome_bam.bam), subset of mitochondrial regions, where chrM could also be MT depending on which genome it was aligned to.

conda activate [environment]
samtools view -b possorted_genome_bam.bam chrM > chrM.bam

Then, run the split_bam.py file. For example:

python3 split_bam.py chrM.bam output_directory --barcode_csv barcodes.tsv.gz

Delete the undetermined.bam file:

rm -r [output_directory]/undetermined.bam

Running the single cell pipeline (in the command line)

Note, check the limit of open files (ulimit -n), and adjust as needed.

Index the split input .bam files:

conda activate [environment]
for each in [library_id]/*.bam; do samtools index "$each"; done

Navigate to the directory with the scMTpipeline.py file and run the following (replace all brackets):

python3 scMTpipeline.py -d [data_directory] -w [working_directory] -l [library_id] -re [results_directory] -vc [optional_vep_cache_directory] -q [optional_mapping_quality] -Q [optional_base_quality] -s [optional_strand] -t [optional_threshold] -g [optional_genome] -r [optional_reference_fasta] -m [optional_molecule] -c [optional_mincounts]

Parameter descriptions:

  • Data directory: path to directory with split input .bam files
  • Working directory: path to directory with scMTpipeline.py file in it
  • Library ID: name of sample
  • Results directory: path to directory where results will be stored
  • (OPTIONAL) VEP cache directory: path to directory with VEP cache (default="$HOME/.vep")
  • (OPTIONAL) Mapping quality: minimum mapping quality (default=20)
  • (OPTIONAL) Base quality: minimum base quality (default=20)
  • (OPTIONAL) Strand: minimum number of reads mapping to forward and reverse strand to call mutation (default=2)
  • (OPTIONAL) Threshold: critical threshold for calling a cell wild-type (default=0.1)
  • (OPTIONAL) Genome: genome version (supported genomes are GRCh37, GRCh38, GRCm38, and mm10, default="GRCh37)
  • (OPTIONAL) Reference fasta: path to fasta file (by default will use a file from the reference folder that matches the genome, but best to use same reference used to align fastqs)
  • (OPTIONAL) Molecule: type of molecule (dna or rna, default="dna")
  • (OPTIONAL) Minimum counts: minimum number of read counts for MTvariantpipeline (default=100)

For example, a call to run the single cell pipeline with the required paramaters could look like this:

python3 scMTpipeline.py -d /my_data/ -w /my_home/mtdna-dlp/python/ -l my_sample -re /my_home/mtdna-dlp/results/

To run scMTpipeline.py on the provided example data:

python3 scMTpipeline.py -d /mtdna-dlp/python/example_data/ -w /mtdna-dlp/python/ -l SA1101-A96155C -re /mtdna-dlp/python/results/

Single cell pipeline outputs

The single cell pipeline should have the following files and subdirectories in the results directory:

  • filteredfiles
  • merged
  • mergedTEMPMAFfiles
  • MuTect2_results
  • MTvariant_results
  • TEMPMAFfiles
  • [sample]-merged.fillout
  • [sample]_master.tsv
  • [sample]_vaf.tsv
  • [sample]_binary.tsv
  • [sample]_variants.tsv
  • [sample]_depth.tsv
  • [sample]_mutprob.tsv
  • [sample]_mutsig.tsv
  • [sample]_heteroplasmy.pdf
  • [sample]_haplogroups.txt (only for human DNA)
  • filtered[sample]-merged.bam (only for human DNA)

Clonal assignment model based on mtDNA mutations

python3 runmtclone.py -a [snv_alt] -d [snv_depth] -c [numberclone]

Parameter descriptions:

  • SNV alternate: Path of the matrix of alternate read counts across mtDNA variants (rows) by cells (columns)
  • SNV depth: Path of the matrix of total read counts across mtDNA variants (rows) by cells (columns)
  • Number of clones: Prior on the number of clones

Model outputs

  • Clone labels: Cell IDs along with the clone labels assigned to each cell, identifying the clonal population each cell belongs to.
  • SNV weights: Weights indicating the significance of each SNV in determining clonal populations.

Common Errors (and how to fix them)

1. GenomeAnalysisTK.jar (mapping qualities conversion) failure

Ensure that the reference file passed as input is the same as the one used in the alignment to create the bam. Ensure you have a genome.dict file in the same directory as the genome.fa file. You can create one:

java -jar [your_directory]/mtdna-dlp/python/reference/picard.jar CreateSequenceDictionary R=genome.fa O=genome.dict

2. samtools merge failure

Most likely, your system is throwing a too many files open error. Adjust the limit on the number of files like this:

ulimit -n 10000

3. perl vcf2maf.pl failure

vcf2maf.pl will not automatically overwrite files, so if the output file already exists, either delete it or add the following argument:

--vep-overwrite

About

scripts to generate figures/tables for mtDNA DLP+ paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors