Skip to content

PacificBiosciences/trgt-instability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

TRGT-instability

This tool quantifies tandem repeat (TR) instability from the output of TRGT in three stages: read-divergence extraction, per-repeat instability model fitting, and query-allele testing for excess instability. The tool models read-to-consensus divergence for each allele to define a baseline instability distribution for each repeat observed in the sequencing data.

Warning

Please note: TRGT-instability is under active development and should be used only for experimentation and feedback.

Authors and contributors

Authors and contributors Affiliations
Egor Dolzhenko PacBio
Adam English Baylor College of Medicine
Tom Mokveld PacBio
Guilherme de Sena Brandine PacBio
Zev Kronenberg PacBio
Galen Wright University of Manitoba
Britt Drogemoller University of Manitoba
William J. Rowell PacBio
Aaron M. Wenger PacBio
Mark F. Bennett Walter and Eliza Hall Institute
Ben Weisburd Broad Institute
Graham S. Erwin Baylor College of Medicine
Peng Jin Emory University School of Medicine
David Nelson Baylor College of Medicine
Harriet Dashnow University of Colorado Anschutz
Fritz J. Sedlazeck Baylor College of Medicine
Michael A. Eberle PacBio

Equal contribution: Egor Dolzhenko, Adam English, Fritz J. Sedlazeck, Michael A. Eberle.

Terminology

  • read divergence rate: the length-normalized edit distance between a read and its allele consensus
  • allele instability profile: the 15-bin count vector summarizing read divergence rates for one allele
  • repeat instability model: the fitted Dirichlet-multinomial model for one repeat locus

Command overview

  • trgt-instability divergence: computes edit distances between reads and their allele sequences used to determine divergence rates
  • trgt-instability model: fits an instability model from allele instability profiles for each repeat
  • trgt-instability test: tests a query allele for excess instability relative to the repeat-specific model
  • trgt-instability bin: calculates shared discretization bins for read divergence rates

Inputs

  • A repeat catalog with to analyze
  • TRGT outputs for all control samples
  • TRGT outputs for all case samples

A complete workflow

Run divergence on all case and control samples:

./trgt-instability divergence \
  --repeats repeat-catalog.bed \
  --spanning-reads sample.sorted.bam \
  --variants sample.sorted.vcf.gz \
  | gzip > sample.dists.txt.gz

divergence writes plain-text tab-delimited records to stdout, so the examples pipe that output through gzip for the downstream commands.

Combine divergence records for all control samples (the model command expects --data records to be sorted by repeat identifier (trid)):

zcat controls/*.dists.txt.gz | sort -k 1,1 -k 2,2n | gzip > controls-dists.txt.gz

If you want to compare instability of different repeats calculate shared discretization bins from the full control cohort (skip this step if you are analyzing repeats individually):

./trgt-instability bin --data controls-dists.txt.gz

Generate repeat-specific instability models for all repeats:

./trgt-instability model --data controls-dists.txt.gz | gzip > models.gz

or use shared bins in bins.txt:

./trgt-instability model --data controls-dists.txt.gz --bins <bin edges reported by the bin command> | gzip > models.gz

Now any case sample can be tested for excess instability:

./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz > sample-results.txt

Use --n-sim to control the maximum parametric-bootstrap depth per tested allele (default: 100000), and --threads to parallelize testing, for example:

./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz --threads 16 --n-sim 200000 > sample-results.txt

The results consist of repeat identifiers, allele sequences, and the associated p-values. A low p-value indicates that the allele instability profile is unlikely under the fitted repeat-specific baseline model, i.e. evidence for excess instability.

If you also want a (read depth aware) quantity to compare alleles by their instability, test can calculate posterior effect-size information derived from the fitted instability model:

./trgt-instability test \
  --models models.gz \
  --data cases/sample.dists.txt.gz \
  --report-effect-size \
  --n-posterior-draws 4000 \
  > sample-results-with-d.txt

With --report-effect-size, each output line contains:

  • d_median: posterior median of the Wasserstein distance from the fitted repeat-specific baseline
  • d_ci_lower: lower bound of the 95% credible interval for d
  • d_ci_upper: upper bound of the 95% credible interval for d

This d score is intended for ranking alleles by how far their instability profile is from the fitted repeat-specific baseline, (the p-value remains the significance measure). See docs/effect-size-reporting.md for details.

For large runs requiring multiple testing, adaptive stopping can avoid performing the full --n-sim bootstrap iterations for alleles that can no longer reach the target tail probability. You can derive this target probability automatically for Benjamini-Hochberg p-value correction using stop_p = q * rank / m, where m is the number of tested alleles:

./trgt-instability test \
  --models models.gz \
  --data cases/sample.dists.txt.gz \
  --threads 16 \
  --n-sim 24000000 \
  --adaptive-fdr-q 0.05 \
  --adaptive-fdr-rank 1 \
  > sample-results.txt

Reference

Need help?

If you notice any missing features, bugs, or need assistance with analyzing the output of TRGT-instability, please don't hesitate to open a GitHub issue or reach out to the authors by email.

Support information

TRGT-instability is a pre-release software intended for research use only and not for use in diagnostic procedures. While efforts have been made to ensure that TRGT-instability lives up to the quality that PacBio strives for, we make no warranty regarding this software.

As TRGT-instability is not covered by any service level agreement or the like, please do not contact a PacBio Field Applications Scientists or PacBio Customer Service for assistance with any TRGT-instability release. Please report all issues through GitHub instead. We make no warranty that any such issue will be addressed, to any extent or within any time frame.

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

About

A tool to quantify tandem repeat instability

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors