Skip to content

Commit 150aab5

Browse files
Release version 1.0.0
1 parent 92acb0d commit 150aab5

4 files changed

Lines changed: 336 additions & 28 deletions

File tree

LICENSE.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Pacific Biosciences Software License Agreement
2+
1. **Introduction and Acceptance.** This Software License Agreement (this “**Agreement**”) is a legal agreement between you (either an individual or an entity) and Pacific Biosciences of California, Inc. (“**PacBio**”) regarding the use of the PacBio software accompanying this Agreement, which includes documentation provided in “online” or electronic form (together, the “**Software**”). PACBIO PROVIDES THE SOFTWARE SOLELY ON THE TERMS AND CONDITIONS SET FORTH IN THIS AGREEMENT AND ON THE CONDITION THAT YOU ACCEPT AND COMPLY WITH THEM. BY DOWNLOADING, DISTRIBUTING, MODIFYING OR OTHERWISE USING THE SOFTWARE, YOU (A) ACCEPT THIS AGREEMENT AND AGREE THAT YOU ARE LEGALLY BOUND BY ITS TERMS; AND (B) REPRESENT AND WARRANT THAT: (I) YOU ARE OF LEGAL AGE TO ENTER INTO A BINDING AGREEMENT; AND (II) IF YOU REPRESENT A CORPORATION, GOVERNMENTAL ORGANIZATION OR OTHER LEGAL ENTITY, YOU HAVE THE RIGHT, POWER AND AUTHORITY TO ENTER INTO THIS AGREEMENT ON BEHALF OF SUCH ENTITY AND BIND SUCH ENTITY TO THESE TERMS. IF YOU DO NOT AGREE TO THE TERMS OF THIS AGREEMENT, PACBIO WILL NOT AND DOES NOT LICENSE THE SOFTWARE TO YOU AND YOU MUST NOT DOWNLOAD, INSTALL OR OTHERWISE USE THE SOFTWARE OR DOCUMENTATION.
3+
2. **Grant of License.** Subject to your compliance with the restrictions set forth in this Agreement, PacBio hereby grants to you a non-exclusive, non-transferable license during the Term to install, copy, use, distribute in binary form only, and host the Software. If you received the Software from PacBio in source code format, you may also modify and/or compile the Software.
4+
3. **License Restrictions.** You may not remove or destroy any copyright notices or other proprietary markings. You may only use the Software to process or analyze data generated on a PacBio instrument or otherwise provided to you by PacBio. Any use, modification, translation, or compilation of the Software not expressly authorized in Section 2 is prohibited. You may not use, modify, host, or distribute the Software so that any part of the Software becomes subject to any license that requires, as a condition of use, modification, hosting, or distribution, that (a) the Software, in whole or in part, be disclosed or distributed in source code form or (b) any third party have the right to modify the Software, in whole or in part.
5+
4. **Ownership.** The license granted to you in Section 2 is not a transfer or sale of PacBio’s ownership rights in or to the Software. Except for the license granted in Section 2, PacBio retains all right, title and interest (including all intellectual property rights) in and to the Software. The Software is protected by applicable intellectual property laws, including United States copyright laws and international treaties.
6+
5. **Third Party Materials.** The Software may include software, content, data or other materials, including related documentation and open source software, that are owned by one or more third parties and that are subject to separate licensee terms (“**Third-Party Licenses**”). A list of all materials, if any, can be found the documentation for the Software. You acknowledge and agree that such third party materials subject to Third-Party Licenses are not licensed to you pursuant to the provisions of this Agreement and that this Agreement shall not be construed to grant any such right and/or license. You shall have only such rights and/or licenses, if any, to use such third party materials as set forth in the applicable Third-Party Licenses.
7+
6. **Feedback.** If you provide any feedback to PacBio concerning the functionality and performance of the Software, including identifying potential errors and improvements (“**Feedback**”), such Feedback shall be owned by PacBio. You hereby assign to PacBio all right, title, and interest in and to the Feedback, and PacBio is free to use the Feedback without any payment or restriction.
8+
7. **Confidentiality.** You must hold in the strictest confidence the Software and any related materials or information including, but not limited to, any Feedback, technical data, research, product plans, or know-how provided by PacBio to you, directly or indirectly in writing, orally or by inspection of tangible objects (“**Confidential Information**”). You will not disclose any Confidential Information to third parties, including any of your employees who do not have a need to know such information, and you will take reasonable measures to protect the secrecy of, and to avoid disclosure and unauthorized use of, the Confidential Information. You will immediately notify the PacBio in the event of any unauthorized or suspected use or disclosure of the Confidential Information. To protect the Confidential Information contained in the Software, you may not reverse engineer, decompile, or disassemble the Software, except to the extent the foregoing restriction is expressly prohibited by applicable law.
9+
8. **Termination.** This Agreement will terminate upon the earlier of: (a) your failure to comply with any term of this Agreement; or (b) return, destruction, or deletion of all copies of the Software in your possession. PacBio’s rights and your obligations will survive the termination of this Agreement. The “**Term**” means the period beginning on when this Agreement becomes effective until the termination of this Agreement. Upon termination of this Agreement for any reason, you will delete from all of your computer libraries or storage devices or otherwise destroy all copies of the Software and derivatives thereof.
10+
9. **NO OTHER WARRANTIES.** THE SOFTWARE IS PROVIDED ON AN “AS IS” BASIS. YOU ASSUME ALL RESPONSIBILITIES FOR SELECTION OF THE SOFTWARE TO ACHIEVE YOUR INTENDED RESULTS, AND FOR THE INSTALLATION OF, USE OF, AND RESULTS OBTAINED FROM THE SOFTWARE. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, PACBIO DISCLAIMS ALL WARRANTIES, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, ACCURACY, TITLE, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE SOFTWARE AND THE ACCOMPANYING WRITTEN MATERIALS. THERE IS NO WARRANTY AGAINST INTERFERENCE WITH THE ENJOYMENT OF THE SOFTWARE OR AGAINST INFRINGEMENT. THERE IS NO WARRANTY THAT THE SOFTWARE OR PACBIO’S EFFORTS WILL FULFILL ANY OF YOUR PARTICULAR PURPOSES OR NEEDS.
11+
10. **LIMITATION OF LIABILITY.** UNDER NO CIRCUMSTANCES WILL PACBIO BE LIABLE FOR ANY CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, LOSS OF DATA OR OTHER SUCH PECUNIARY LOSS) ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE, EVEN IF PACBIO HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. IN NO EVENT WILL PACBIO’S AGGREGATE LIABILITY FOR DAMAGES ARISING OUT OF THIS AGREEMENT EXCEED $5. THE FOREGOING EXCLUSIONS AND LIMITATIONS OF LIABILITY AND DAMAGES WILL NOT APPLY TO CONSEQUENTIAL DAMAGES FOR PERSONAL INJURY.
12+
11. **Indemnification.** You will indemnify, hold harmless, and defend PacBio (including all of its officers, employees, directors, subsidiaries, representatives, affiliates, and agents) and PacBio’s suppliers from and against any damages (including attorney’s fees and expenses), claims, and lawsuits that arise or result from your use of the Software.
13+
12. **Trademarks.** Certain of the product and PacBio names used in this Agreement, the Software may constitute trademarks of PacBio or third parties. You are not authorized to use any such trademarks.
14+
13. **Export Restrictions.** YOU UNDERSTAND AND AGREE THAT THE SOFTWARE IS SUBJECT TO UNITED STATES AND OTHER APPLICABLE EXPORT-RELATED LAWS AND REGULATIONS AND THAT YOU MAY NOT EXPORT, RE-EXPORT OR TRANSFER THE SOFTWARE OR ANY DIRECT PRODUCT OF THE SOFTWARE EXCEPT AS PERMITTED UNDER THOSE LAWS. WITHOUT LIMITING THE FOREGOING, EXPORT, RE-EXPORT, OR TRANSFER OF THE SOFTWARE TO CUBA, IRAN, NORTH KOREA, SYRIA, RUSSIA, BELARUS, AND THE REGIONS OF CRIMEA, LNR, AND DNR OF UKRAINE IS PROHIBITED.
15+
14. **General.** This Agreement is governed by the laws of the State of California, without reference to its conflict of laws principles. This Agreement is the entire agreement between you and PacBio and supersedes any other communications with respect to the Software. If any provision of this Agreement is held invalid or unenforceable, the remainder of this Agreement will continue in full force and effect.

README.md

Lines changed: 176 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,193 @@
11
# TRGT-instability
22

3-
TRGT-instability is an add-tool for [TRGT](https://github.com/PacificBiosciences/trgt/)
4-
designed to quantify instability of each repeat allele.
3+
This tool quantifies tandem repeat (TR) instability from the output of
4+
[TRGT](https://github.com/PacificBiosciences/trgt) in three stages:
5+
read-divergence extraction, per-repeat instability model fitting, and
6+
query-allele testing for excess instability. The tool models
7+
read-to-consensus divergence for each allele to define a baseline
8+
instability distribution for each repeat observed in the sequencing
9+
data.
510

6-
## Availability
11+
> [!WARNING]
12+
> **Please note:** TRGT-instability is under active development and should be
13+
> used only for experimentation and feedback.
714
8-
The latest binary can be found in the release page.
15+
## Authors and contributors
916

10-
## Usage example
17+
| Authors and contributors | Affiliations |
18+
|----------------------------|-------------------------------------|
19+
| Egor Dolzhenko | PacBio |
20+
| Adam English | Baylor College of Medicine |
21+
| Tom Mokveld | PacBio |
22+
| Guilherme de Sena Brandine | PacBio |
23+
| Zev Kronenberg | PacBio |
24+
| Galen Wright | University of Manitoba |
25+
| Britt Drogemoller | University of Manitoba |
26+
| William J. Rowell | PacBio |
27+
| Aaron M. Wenger | PacBio |
28+
| Mark F. Bennett | Walter and Eliza Hall Institute |
29+
| Ben Weisburd | Broad Institute |
30+
| Graham S. Erwin | Baylor College of Medicine |
31+
| Peng Jin | Emory University School of Medicine |
32+
| David Nelson | Baylor College of Medicine |
33+
| Harriet Dashnow | University of Colorado Anschutz |
34+
| Fritz J. Sedlazeck | Baylor College of Medicine |
35+
| Michael A. Eberle | PacBio |
1136

12-
The tool can be run like so:
37+
Equal contribution: Egor Dolzhenko, Adam English, Fritz J. Sedlazeck, Michael A. Eberle.
38+
39+
40+
## Terminology
41+
42+
- `read divergence rate`: the length-normalized edit distance between a read and its allele consensus
43+
- `allele instability profile`: the 15-bin count vector summarizing read divergence rates for one allele
44+
- `repeat instability model`: the fitted Dirichlet-multinomial model for one repeat locus
45+
46+
## Command overview
47+
48+
- `trgt-instability divergence`: computes edit distances between reads and their allele sequences used to determine divergence rates
49+
- `trgt-instability model`: fits an instability model from allele instability profiles for each repeat
50+
- `trgt-instability test`: tests a query allele for excess instability relative to the repeat-specific model
51+
- `trgt-instability bin`: calculates shared discretization bins for read divergence rates
52+
53+
## Inputs
54+
55+
- A repeat catalog with to analyze
56+
- TRGT outputs for all control samples
57+
- TRGT outputs for all case samples
58+
59+
## A complete workflow
60+
61+
Run `divergence` on all case and control samples:
62+
63+
```bash
64+
./trgt-instability divergence \
65+
--repeats repeat-catalog.bed \
66+
--spanning-reads sample.sorted.bam \
67+
--variants sample.sorted.vcf.gz \
68+
| gzip > sample.dists.txt.gz
69+
```
70+
71+
`divergence` writes plain-text tab-delimited records to stdout, so the examples
72+
pipe that output through `gzip` for the downstream commands.
73+
74+
Combine divergence records for all control samples (the `model` command expects
75+
`--data` records to be sorted by repeat identifier (`trid`)):
76+
77+
```bash
78+
zcat controls/*.dists.txt.gz | sort -k 1,1 -k 2,2n | gzip > controls-dists.txt.gz
79+
```
80+
81+
If you want to compare instability of different repeats calculate shared discretization bins from the full control cohort (skip this step if you are
82+
analyzing repeats individually):
1383

1484
```bash
15-
./trgt-instability --repeats repeats.bed --spanning-reads sample.trgt.sorted.bam
85+
./trgt-instability bin --data controls-dists.txt.gz
1686
```
1787

18-
where
88+
Generate repeat-specific instability models for all repeats:
1989

20-
- `repeats.bed` is the BED file with repeat definitions; it must be the same file
21-
as the one used to run TRGT,
22-
- `sample.trgt.sorted.bam` is the BAM file generated by TRGT; it must be sorted
23-
and indexed.
90+
```bash
91+
./trgt-instability model --data controls-dists.txt.gz | gzip > models.gz
92+
```
2493

25-
The output will look like this:
94+
or use shared bins in `bins.txt`:
2695

27-
```tsv
28-
trid allele motif motif_counts expansion_rate contraction_rate model_params
29-
FMR1 1 CGG 84,93,94,94,95,95,95,96,97,97,99,100,100,101,101,101,104,104,107,108,109,113,116,126 0.06 0.05 0.96:0.01:0.61:0.80
96+
```bash
97+
./trgt-instability model --data controls-dists.txt.gz --bins <bin edges reported by the bin command> | gzip > models.gz
3098
```
3199

32-
where:
100+
Now any case sample can be tested for excess instability:
101+
102+
```bash
103+
./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz > sample-results.txt
104+
```
105+
106+
Use `--n-sim` to control the maximum parametric-bootstrap depth per tested
107+
allele (default: `100000`), and `--threads` to parallelize testing, for
108+
example:
109+
110+
```bash
111+
./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz --threads 16 --n-sim 200000 > sample-results.txt
112+
```
113+
114+
The results consist of repeat identifiers, allele sequences, and the associated
115+
p-values. A low p-value indicates that the allele instability profile is
116+
unlikely under the fitted repeat-specific baseline model, i.e. evidence for
117+
excess instability.
118+
119+
If you also want a (read depth aware) quantity to compare alleles by
120+
their instability, `test` can calculate posterior effect-size information
121+
derived from the fitted instability model:
122+
123+
```bash
124+
./trgt-instability test \
125+
--models models.gz \
126+
--data cases/sample.dists.txt.gz \
127+
--report-effect-size \
128+
--n-posterior-draws 4000 \
129+
> sample-results-with-d.txt
130+
```
131+
132+
With `--report-effect-size`, each output line contains:
133+
134+
- `d_median`: posterior median of the Wasserstein distance from the fitted repeat-specific baseline
135+
- `d_ci_lower`: lower bound of the 95% credible interval for `d`
136+
- `d_ci_upper`: upper bound of the 95% credible interval for `d`
137+
138+
This `d` score is intended for ranking alleles by how far their instability
139+
profile is from the fitted repeat-specific baseline, (the p-value
140+
remains the significance measure). See
141+
[`docs/effect-size-reporting.md`](docs/effect-size-reporting.md) for details.
142+
143+
For large runs requiring multiple testing, adaptive stopping can avoid performing
144+
the full `--n-sim` bootstrap iterations for alleles that can no longer reach the
145+
target tail probability. You can derive this target probability automatically
146+
for Benjamini-Hochberg p-value correction using `stop_p = q * rank / m`, where `m` is the number of tested alleles:
147+
148+
```bash
149+
./trgt-instability test \
150+
--models models.gz \
151+
--data cases/sample.dists.txt.gz \
152+
--threads 16 \
153+
--n-sim 24000000 \
154+
--adaptive-fdr-q 0.05 \
155+
--adaptive-fdr-rank 1 \
156+
> sample-results.txt
157+
```
158+
159+
## Reference
160+
161+
- [File formats](docs/file-formats.md)
162+
- [Effect-size reporting for `test`](docs/effect-size-reporting.md)
163+
164+
## Need help?
165+
166+
If you notice any missing features, bugs, or need assistance with analyzing the
167+
output of TRGT-instability, please don't hesitate to open a GitHub issue or
168+
reach out to the authors by [email](mailto:edolzhenko@pacificbiosciences.com).
169+
170+
## Support information
171+
172+
TRGT-instability is a pre-release software intended for research use only and
173+
not for use in diagnostic procedures. While efforts have been made to ensure
174+
that TRGT-instability lives up to the quality that PacBio strives for, we make
175+
no warranty regarding this software.
176+
177+
As TRGT-instability is not covered by any service level agreement or the like,
178+
please do not contact a PacBio Field Applications Scientists or PacBio Customer
179+
Service for assistance with any TRGT-instability release. Please report all
180+
issues through GitHub instead. We make no warranty that any such issue will be
181+
addressed, to any extent or within any time frame.
33182

34-
- `trid` is the tandem repeat identifier from `repeats.bed`
35-
- `allele` is the allele index
36-
- `motif` is the motif profiled
37-
- `motif_count` is the number of motifs identified in each read
38-
- `expansion_rate` is the estimated expansion rate (see below)
39-
- `contraction_rate` is the estimated contraction rate (see below)
40-
- `model_params` are the estimated parameters of the instability model
183+
### DISCLAIMER
41184

42-
The expansion and contraction rate parameters are motivated by a simple
43-
model where mosaicism can occur due to slippage during DNA replication.
44-
An expansion rate of 0.01 means that an allele consisting of 100 motifs
45-
is expected to have a single motif expansion event during replication.
185+
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE
186+
PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY
187+
KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES
188+
OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A
189+
PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF
190+
THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR
191+
APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY
192+
OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO
193+
NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

0 commit comments

Comments
 (0)