This tool quantifies tandem repeat (TR) instability from the output of TRGT in three stages: read-divergence extraction, per-repeat instability model fitting, and query-allele testing for excess instability. The tool models read-to-consensus divergence for each allele to define a baseline instability distribution for each repeat observed in the sequencing data.
Warning
Please note: TRGT-instability is under active development and should be used only for experimentation and feedback.
| Authors and contributors | Affiliations |
|---|---|
| Egor Dolzhenko | PacBio |
| Adam English | Baylor College of Medicine |
| Tom Mokveld | PacBio |
| Guilherme de Sena Brandine | PacBio |
| Zev Kronenberg | PacBio |
| Galen Wright | University of Manitoba |
| Britt Drogemoller | University of Manitoba |
| William J. Rowell | PacBio |
| Aaron M. Wenger | PacBio |
| Mark F. Bennett | Walter and Eliza Hall Institute |
| Ben Weisburd | Broad Institute |
| Graham S. Erwin | Baylor College of Medicine |
| Peng Jin | Emory University School of Medicine |
| David Nelson | Baylor College of Medicine |
| Harriet Dashnow | University of Colorado Anschutz |
| Fritz J. Sedlazeck | Baylor College of Medicine |
| Michael A. Eberle | PacBio |
Equal contribution: Egor Dolzhenko, Adam English, Fritz J. Sedlazeck, Michael A. Eberle.
read divergence rate: the length-normalized edit distance between a read and its allele consensusallele instability profile: the 15-bin count vector summarizing read divergence rates for one allelerepeat instability model: the fitted Dirichlet-multinomial model for one repeat locus
trgt-instability divergence: computes edit distances between reads and their allele sequences used to determine divergence ratestrgt-instability model: fits an instability model from allele instability profiles for each repeattrgt-instability test: tests a query allele for excess instability relative to the repeat-specific modeltrgt-instability bin: calculates shared discretization bins for read divergence rates
- A repeat catalog with to analyze
- TRGT outputs for all control samples
- TRGT outputs for all case samples
Run divergence on all case and control samples:
./trgt-instability divergence \
--repeats repeat-catalog.bed \
--spanning-reads sample.sorted.bam \
--variants sample.sorted.vcf.gz \
| gzip > sample.dists.txt.gzdivergence writes plain-text tab-delimited records to stdout, so the examples
pipe that output through gzip for the downstream commands.
Combine divergence records for all control samples (the model command expects
--data records to be sorted by repeat identifier (trid)):
zcat controls/*.dists.txt.gz | sort -k 1,1 -k 2,2n | gzip > controls-dists.txt.gzIf you want to compare instability of different repeats calculate shared discretization bins from the full control cohort (skip this step if you are analyzing repeats individually):
./trgt-instability bin --data controls-dists.txt.gzGenerate repeat-specific instability models for all repeats:
./trgt-instability model --data controls-dists.txt.gz | gzip > models.gzor use shared bins in bins.txt:
./trgt-instability model --data controls-dists.txt.gz --bins <bin edges reported by the bin command> | gzip > models.gzNow any case sample can be tested for excess instability:
./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz > sample-results.txtUse --n-sim to control the maximum parametric-bootstrap depth per tested
allele (default: 100000), and --threads to parallelize testing, for
example:
./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz --threads 16 --n-sim 200000 > sample-results.txtThe results consist of repeat identifiers, allele sequences, and the associated p-values. A low p-value indicates that the allele instability profile is unlikely under the fitted repeat-specific baseline model, i.e. evidence for excess instability.
If you also want a (read depth aware) quantity to compare alleles by
their instability, test can calculate posterior effect-size information
derived from the fitted instability model:
./trgt-instability test \
--models models.gz \
--data cases/sample.dists.txt.gz \
--report-effect-size \
--n-posterior-draws 4000 \
> sample-results-with-d.txtWith --report-effect-size, each output line contains:
d_median: posterior median of the Wasserstein distance from the fitted repeat-specific baselined_ci_lower: lower bound of the 95% credible interval fordd_ci_upper: upper bound of the 95% credible interval ford
This d score is intended for ranking alleles by how far their instability
profile is from the fitted repeat-specific baseline, (the p-value
remains the significance measure). See
docs/effect-size-reporting.md for details.
For large runs requiring multiple testing, adaptive stopping can avoid performing
the full --n-sim bootstrap iterations for alleles that can no longer reach the
target tail probability. You can derive this target probability automatically
for Benjamini-Hochberg p-value correction using stop_p = q * rank / m, where m is the number of tested alleles:
./trgt-instability test \
--models models.gz \
--data cases/sample.dists.txt.gz \
--threads 16 \
--n-sim 24000000 \
--adaptive-fdr-q 0.05 \
--adaptive-fdr-rank 1 \
> sample-results.txtIf you notice any missing features, bugs, or need assistance with analyzing the output of TRGT-instability, please don't hesitate to open a GitHub issue or reach out to the authors by email.
TRGT-instability is a pre-release software intended for research use only and not for use in diagnostic procedures. While efforts have been made to ensure that TRGT-instability lives up to the quality that PacBio strives for, we make no warranty regarding this software.
As TRGT-instability is not covered by any service level agreement or the like, please do not contact a PacBio Field Applications Scientists or PacBio Customer Service for assistance with any TRGT-instability release. Please report all issues through GitHub instead. We make no warranty that any such issue will be addressed, to any extent or within any time frame.
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.