TRGT-instability

This tool quantifies tandem repeat (TR) instability from the output of TRGT in three stages: read-divergence extraction, per-repeat instability model fitting, and query-allele testing for excess instability. The tool models read-to-consensus divergence for each allele to define a baseline instability distribution for each repeat observed in the sequencing data.

Warning

Please note: TRGT-instability is under active development and should be used only for experimentation and feedback.

Authors and contributors

Authors and contributors	Affiliations
Egor Dolzhenko	PacBio
Adam English	Baylor College of Medicine
Tom Mokveld	PacBio
Guilherme de Sena Brandine	PacBio
Zev Kronenberg	PacBio
Galen Wright	University of Manitoba
Britt Drogemoller	University of Manitoba
William J. Rowell	PacBio
Aaron M. Wenger	PacBio
Mark F. Bennett	Walter and Eliza Hall Institute
Ben Weisburd	Broad Institute
Graham S. Erwin	Baylor College of Medicine
Peng Jin	Emory University School of Medicine
David Nelson	Baylor College of Medicine
Harriet Dashnow	University of Colorado Anschutz
Fritz J. Sedlazeck	Baylor College of Medicine
Michael A. Eberle	PacBio

Equal contribution: Egor Dolzhenko, Adam English, Fritz J. Sedlazeck, Michael A. Eberle.

Terminology

read divergence rate: the length-normalized edit distance between a read and its allele consensus
allele instability profile: the 15-bin count vector summarizing read divergence rates for one allele
repeat instability model: the fitted Dirichlet-multinomial model for one repeat locus

Command overview

trgt-instability divergence: computes edit distances between reads and their allele sequences used to determine divergence rates
trgt-instability model: fits an instability model from allele instability profiles for each repeat
trgt-instability test: tests a query allele for excess instability relative to the repeat-specific model
trgt-instability bin: calculates shared discretization bins for read divergence rates

Inputs

A repeat catalog with to analyze
TRGT outputs for all control samples
TRGT outputs for all case samples

A complete workflow

Run divergence on all case and control samples:

./trgt-instability divergence \
  --repeats repeat-catalog.bed \
  --spanning-reads sample.sorted.bam \
  --variants sample.sorted.vcf.gz \
  | gzip > sample.dists.txt.gz

divergence writes plain-text tab-delimited records to stdout, so the examples pipe that output through gzip for the downstream commands.

Combine divergence records for all control samples (the model command expects --data records to be sorted by repeat identifier (trid)):

zcat controls/*.dists.txt.gz | sort -k 1,1 -k 2,2n | gzip > controls-dists.txt.gz

If you want to compare instability of different repeats calculate shared discretization bins from the full control cohort (skip this step if you are analyzing repeats individually):

./trgt-instability bin --data controls-dists.txt.gz

Generate repeat-specific instability models for all repeats:

./trgt-instability model --data controls-dists.txt.gz | gzip > models.gz

or use shared bins in bins.txt:

./trgt-instability model --data controls-dists.txt.gz --bins <bin edges reported by the bin command> | gzip > models.gz

Now any case sample can be tested for excess instability:

./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz > sample-results.txt

Use --n-sim to control the maximum parametric-bootstrap depth per tested allele (default: 100000), and --threads to parallelize testing, for example:

./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz --threads 16 --n-sim 200000 > sample-results.txt

The results consist of repeat identifiers, allele sequences, and the associated p-values. A low p-value indicates that the allele instability profile is unlikely under the fitted repeat-specific baseline model, i.e. evidence for excess instability.

If you also want a (read depth aware) quantity to compare alleles by their instability, test can calculate posterior effect-size information derived from the fitted instability model:

./trgt-instability test \
  --models models.gz \
  --data cases/sample.dists.txt.gz \
  --report-effect-size \
  --n-posterior-draws 4000 \
  > sample-results-with-d.txt

With --report-effect-size, each output line contains:

d_median: posterior median of the Wasserstein distance from the fitted repeat-specific baseline
d_ci_lower: lower bound of the 95% credible interval for d
d_ci_upper: upper bound of the 95% credible interval for d

This d score is intended for ranking alleles by how far their instability profile is from the fitted repeat-specific baseline, (the p-value remains the significance measure). See docs/effect-size-reporting.md for details.

For large runs requiring multiple testing, adaptive stopping can avoid performing the full --n-sim bootstrap iterations for alleles that can no longer reach the target tail probability. You can derive this target probability automatically for Benjamini-Hochberg p-value correction using stop_p = q * rank / m, where m is the number of tested alleles:

./trgt-instability test \
  --models models.gz \
  --data cases/sample.dists.txt.gz \
  --threads 16 \
  --n-sim 24000000 \
  --adaptive-fdr-q 0.05 \
  --adaptive-fdr-rank 1 \
  > sample-results.txt

Reference

Need help?

If you notice any missing features, bugs, or need assistance with analyzing the output of TRGT-instability, please don't hesitate to open a GitHub issue or reach out to the authors by email.

Support information

TRGT-instability is a pre-release software intended for research use only and not for use in diagnostic procedures. While efforts have been made to ensure that TRGT-instability lives up to the quality that PacBio strives for, we make no warranty regarding this software.

As TRGT-instability is not covered by any service level agreement or the like, please do not contact a PacBio Field Applications Scientists or PacBio Customer Service for assistance with any TRGT-instability release. Please report all issues through GitHub instead. We make no warranty that any such issue will be addressed, to any extent or within any time frame.

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRGT-instability

Authors and contributors

Terminology

Command overview

Inputs

A complete workflow

Reference

Need help?

Support information

DISCLAIMER

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TRGT-instability

Authors and contributors

Terminology

Command overview

Inputs

A complete workflow

Reference

Need help?

Support information

DISCLAIMER

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages