|
1 | 1 | # TRGT-instability |
2 | 2 |
|
3 | | -TRGT-instability is an add-tool for [TRGT](https://github.com/PacificBiosciences/trgt/) |
4 | | -designed to quantify instability of each repeat allele. |
| 3 | +This tool quantifies tandem repeat (TR) instability from the output of |
| 4 | +[TRGT](https://github.com/PacificBiosciences/trgt) in three stages: |
| 5 | +read-divergence extraction, per-repeat instability model fitting, and |
| 6 | +query-allele testing for excess instability. The tool models |
| 7 | +read-to-consensus divergence for each allele to define a baseline |
| 8 | +instability distribution for each repeat observed in the sequencing |
| 9 | +data. |
5 | 10 |
|
6 | | -## Availability |
| 11 | +> [!WARNING] |
| 12 | +> **Please note:** TRGT-instability is under active development and should be |
| 13 | +> used only for experimentation and feedback. |
7 | 14 |
|
8 | | -The latest binary can be found in the release page. |
| 15 | +## Authors and contributors |
9 | 16 |
|
10 | | -## Usage example |
| 17 | +| Authors and contributors | Affiliations | |
| 18 | +|----------------------------|-------------------------------------| |
| 19 | +| Egor Dolzhenko | PacBio | |
| 20 | +| Adam English | Baylor College of Medicine | |
| 21 | +| Tom Mokveld | PacBio | |
| 22 | +| Guilherme de Sena Brandine | PacBio | |
| 23 | +| Zev Kronenberg | PacBio | |
| 24 | +| Galen Wright | University of Manitoba | |
| 25 | +| Britt Drogemoller | University of Manitoba | |
| 26 | +| William J. Rowell | PacBio | |
| 27 | +| Aaron M. Wenger | PacBio | |
| 28 | +| Mark F. Bennett | Walter and Eliza Hall Institute | |
| 29 | +| Ben Weisburd | Broad Institute | |
| 30 | +| Graham S. Erwin | Baylor College of Medicine | |
| 31 | +| Peng Jin | Emory University School of Medicine | |
| 32 | +| David Nelson | Baylor College of Medicine | |
| 33 | +| Harriet Dashnow | University of Colorado Anschutz | |
| 34 | +| Fritz J. Sedlazeck | Baylor College of Medicine | |
| 35 | +| Michael A. Eberle | PacBio | |
11 | 36 |
|
12 | | -The tool can be run like so: |
| 37 | +Equal contribution: Egor Dolzhenko, Adam English, Fritz J. Sedlazeck, Michael A. Eberle. |
| 38 | + |
| 39 | + |
| 40 | +## Terminology |
| 41 | + |
| 42 | +- `read divergence rate`: the length-normalized edit distance between a read and its allele consensus |
| 43 | +- `allele instability profile`: the 15-bin count vector summarizing read divergence rates for one allele |
| 44 | +- `repeat instability model`: the fitted Dirichlet-multinomial model for one repeat locus |
| 45 | + |
| 46 | +## Command overview |
| 47 | + |
| 48 | +- `trgt-instability divergence`: computes edit distances between reads and their allele sequences used to determine divergence rates |
| 49 | +- `trgt-instability model`: fits an instability model from allele instability profiles for each repeat |
| 50 | +- `trgt-instability test`: tests a query allele for excess instability relative to the repeat-specific model |
| 51 | +- `trgt-instability bin`: calculates shared discretization bins for read divergence rates |
| 52 | + |
| 53 | +## Inputs |
| 54 | + |
| 55 | +- A repeat catalog with to analyze |
| 56 | +- TRGT outputs for all control samples |
| 57 | +- TRGT outputs for all case samples |
| 58 | + |
| 59 | +## A complete workflow |
| 60 | + |
| 61 | +Run `divergence` on all case and control samples: |
| 62 | + |
| 63 | +```bash |
| 64 | +./trgt-instability divergence \ |
| 65 | + --repeats repeat-catalog.bed \ |
| 66 | + --spanning-reads sample.sorted.bam \ |
| 67 | + --variants sample.sorted.vcf.gz \ |
| 68 | + | gzip > sample.dists.txt.gz |
| 69 | +``` |
| 70 | + |
| 71 | +`divergence` writes plain-text tab-delimited records to stdout, so the examples |
| 72 | +pipe that output through `gzip` for the downstream commands. |
| 73 | + |
| 74 | +Combine divergence records for all control samples (the `model` command expects |
| 75 | +`--data` records to be sorted by repeat identifier (`trid`)): |
| 76 | + |
| 77 | +```bash |
| 78 | +zcat controls/*.dists.txt.gz | sort -k 1,1 -k 2,2n | gzip > controls-dists.txt.gz |
| 79 | +``` |
| 80 | + |
| 81 | +If you want to compare instability of different repeats calculate shared discretization bins from the full control cohort (skip this step if you are |
| 82 | +analyzing repeats individually): |
13 | 83 |
|
14 | 84 | ```bash |
15 | | -./trgt-instability --repeats repeats.bed --spanning-reads sample.trgt.sorted.bam |
| 85 | +./trgt-instability bin --data controls-dists.txt.gz |
16 | 86 | ``` |
17 | 87 |
|
18 | | -where |
| 88 | +Generate repeat-specific instability models for all repeats: |
19 | 89 |
|
20 | | -- `repeats.bed` is the BED file with repeat definitions; it must be the same file |
21 | | -as the one used to run TRGT, |
22 | | -- `sample.trgt.sorted.bam` is the BAM file generated by TRGT; it must be sorted |
23 | | -and indexed. |
| 90 | +```bash |
| 91 | +./trgt-instability model --data controls-dists.txt.gz | gzip > models.gz |
| 92 | +``` |
24 | 93 |
|
25 | | -The output will look like this: |
| 94 | +or use shared bins in `bins.txt`: |
26 | 95 |
|
27 | | -```tsv |
28 | | -trid allele motif motif_counts expansion_rate contraction_rate model_params |
29 | | -FMR1 1 CGG 84,93,94,94,95,95,95,96,97,97,99,100,100,101,101,101,104,104,107,108,109,113,116,126 0.06 0.05 0.96:0.01:0.61:0.80 |
| 96 | +```bash |
| 97 | +./trgt-instability model --data controls-dists.txt.gz --bins <bin edges reported by the bin command> | gzip > models.gz |
30 | 98 | ``` |
31 | 99 |
|
32 | | -where: |
| 100 | +Now any case sample can be tested for excess instability: |
| 101 | + |
| 102 | +```bash |
| 103 | +./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz > sample-results.txt |
| 104 | +``` |
| 105 | + |
| 106 | +Use `--n-sim` to control the maximum parametric-bootstrap depth per tested |
| 107 | +allele (default: `100000`), and `--threads` to parallelize testing, for |
| 108 | +example: |
| 109 | + |
| 110 | +```bash |
| 111 | +./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz --threads 16 --n-sim 200000 > sample-results.txt |
| 112 | +``` |
| 113 | + |
| 114 | +The results consist of repeat identifiers, allele sequences, and the associated |
| 115 | +p-values. A low p-value indicates that the allele instability profile is |
| 116 | +unlikely under the fitted repeat-specific baseline model, i.e. evidence for |
| 117 | +excess instability. |
| 118 | + |
| 119 | +If you also want a (read depth aware) quantity to compare alleles by |
| 120 | +their instability, `test` can calculate posterior effect-size information |
| 121 | +derived from the fitted instability model: |
| 122 | + |
| 123 | +```bash |
| 124 | +./trgt-instability test \ |
| 125 | + --models models.gz \ |
| 126 | + --data cases/sample.dists.txt.gz \ |
| 127 | + --report-effect-size \ |
| 128 | + --n-posterior-draws 4000 \ |
| 129 | + > sample-results-with-d.txt |
| 130 | +``` |
| 131 | + |
| 132 | +With `--report-effect-size`, each output line contains: |
| 133 | + |
| 134 | +- `d_median`: posterior median of the Wasserstein distance from the fitted repeat-specific baseline |
| 135 | +- `d_ci_lower`: lower bound of the 95% credible interval for `d` |
| 136 | +- `d_ci_upper`: upper bound of the 95% credible interval for `d` |
| 137 | + |
| 138 | +This `d` score is intended for ranking alleles by how far their instability |
| 139 | +profile is from the fitted repeat-specific baseline, (the p-value |
| 140 | +remains the significance measure). See |
| 141 | +[`docs/effect-size-reporting.md`](docs/effect-size-reporting.md) for details. |
| 142 | + |
| 143 | +For large runs requiring multiple testing, adaptive stopping can avoid performing |
| 144 | +the full `--n-sim` bootstrap iterations for alleles that can no longer reach the |
| 145 | +target tail probability. You can derive this target probability automatically |
| 146 | +for Benjamini-Hochberg p-value correction using `stop_p = q * rank / m`, where `m` is the number of tested alleles: |
| 147 | + |
| 148 | +```bash |
| 149 | +./trgt-instability test \ |
| 150 | + --models models.gz \ |
| 151 | + --data cases/sample.dists.txt.gz \ |
| 152 | + --threads 16 \ |
| 153 | + --n-sim 24000000 \ |
| 154 | + --adaptive-fdr-q 0.05 \ |
| 155 | + --adaptive-fdr-rank 1 \ |
| 156 | + > sample-results.txt |
| 157 | +``` |
| 158 | + |
| 159 | +## Reference |
| 160 | + |
| 161 | +- [File formats](docs/file-formats.md) |
| 162 | +- [Effect-size reporting for `test`](docs/effect-size-reporting.md) |
| 163 | + |
| 164 | +## Need help? |
| 165 | + |
| 166 | +If you notice any missing features, bugs, or need assistance with analyzing the |
| 167 | +output of TRGT-instability, please don't hesitate to open a GitHub issue or |
| 168 | +reach out to the authors by [email](mailto:edolzhenko@pacificbiosciences.com). |
| 169 | + |
| 170 | +## Support information |
| 171 | + |
| 172 | +TRGT-instability is a pre-release software intended for research use only and |
| 173 | +not for use in diagnostic procedures. While efforts have been made to ensure |
| 174 | +that TRGT-instability lives up to the quality that PacBio strives for, we make |
| 175 | +no warranty regarding this software. |
| 176 | + |
| 177 | +As TRGT-instability is not covered by any service level agreement or the like, |
| 178 | +please do not contact a PacBio Field Applications Scientists or PacBio Customer |
| 179 | +Service for assistance with any TRGT-instability release. Please report all |
| 180 | +issues through GitHub instead. We make no warranty that any such issue will be |
| 181 | +addressed, to any extent or within any time frame. |
33 | 182 |
|
34 | | -- `trid` is the tandem repeat identifier from `repeats.bed` |
35 | | -- `allele` is the allele index |
36 | | -- `motif` is the motif profiled |
37 | | -- `motif_count` is the number of motifs identified in each read |
38 | | -- `expansion_rate` is the estimated expansion rate (see below) |
39 | | -- `contraction_rate` is the estimated contraction rate (see below) |
40 | | -- `model_params` are the estimated parameters of the instability model |
| 183 | +### DISCLAIMER |
41 | 184 |
|
42 | | -The expansion and contraction rate parameters are motivated by a simple |
43 | | -model where mosaicism can occur due to slippage during DNA replication. |
44 | | -An expansion rate of 0.01 means that an allele consisting of 100 motifs |
45 | | -is expected to have a single motif expansion event during replication. |
| 185 | +THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE |
| 186 | +PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY |
| 187 | +KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES |
| 188 | +OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A |
| 189 | +PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF |
| 190 | +THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR |
| 191 | +APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY |
| 192 | +OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO |
| 193 | +NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES. |
0 commit comments