PacificBiosciences
diff --git a/‎LICENSE.md‎
Lines changed: 15 additions & 0 deletions b/‎LICENSE.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 176 additions & 28 deletions b/‎README.md‎
Lines changed: 176 additions & 28 deletions
@@ -0,0 +1,15 @@
+# Pacific Biosciences Software License Agreement
+1.	**Introduction and Acceptance.** This Software License Agreement (this “**Agreement**”) is a legal agreement between you (either an individual or an entity) and Pacific Biosciences of California, Inc. (“**PacBio**”) regarding the use of the PacBio software accompanying this Agreement, which includes documentation provided in “online” or electronic form (together, the “**Software**”). PACBIO PROVIDES THE SOFTWARE SOLELY ON THE TERMS AND CONDITIONS SET FORTH IN THIS AGREEMENT AND ON THE CONDITION THAT YOU ACCEPT AND COMPLY WITH THEM. BY DOWNLOADING, DISTRIBUTING, MODIFYING OR OTHERWISE USING THE SOFTWARE, YOU (A) ACCEPT THIS AGREEMENT AND AGREE THAT YOU ARE LEGALLY BOUND BY ITS TERMS; AND (B) REPRESENT AND WARRANT THAT: (I) YOU ARE OF LEGAL AGE TO ENTER INTO A BINDING AGREEMENT; AND (II) IF YOU REPRESENT A CORPORATION, GOVERNMENTAL ORGANIZATION OR OTHER LEGAL ENTITY, YOU HAVE THE RIGHT, POWER AND AUTHORITY TO ENTER INTO THIS AGREEMENT ON BEHALF OF SUCH ENTITY AND BIND SUCH ENTITY TO THESE TERMS. IF YOU DO NOT AGREE TO THE TERMS OF THIS AGREEMENT, PACBIO WILL NOT AND DOES NOT LICENSE THE SOFTWARE TO YOU AND YOU MUST NOT DOWNLOAD, INSTALL OR OTHERWISE USE THE SOFTWARE OR DOCUMENTATION. 
+2.	**Grant of License.** Subject to your compliance with the restrictions set forth in this Agreement, PacBio hereby grants to you a non-exclusive, non-transferable license during the Term to install, copy, use, distribute in binary form only, and host the Software. If you received the Software from PacBio in source code format, you may also modify and/or compile the Software. 
+3.	**License Restrictions.** You may not remove or destroy any copyright notices or other proprietary markings. You may only use the Software to process or analyze data generated on a PacBio instrument or otherwise provided to you by PacBio. Any use, modification, translation, or compilation of the Software not expressly authorized in Section 2 is prohibited. You may not use, modify, host, or distribute the Software so that any part of the Software becomes subject to any license that requires, as a condition of use, modification, hosting, or distribution, that (a) the Software, in whole or in part, be disclosed or distributed in source code form or (b) any third party have the right to modify the Software, in whole or in part. 
+4.	**Ownership.** The license granted to you in Section 2 is not a transfer or sale of PacBio’s ownership rights in or to the Software. Except for the license granted in Section 2, PacBio retains all right, title and interest (including all intellectual property rights) in and to the Software. The Software is protected by applicable intellectual property laws, including United States copyright laws and international treaties.  
+5.	**Third Party Materials.** The Software may include software, content, data or other materials, including related documentation and open source software, that are owned by one or more third parties and that are subject to separate licensee terms (“**Third-Party Licenses**”). A list of all materials, if any, can be found the documentation for the Software. You acknowledge and agree that such third party materials subject to Third-Party Licenses are not licensed to you pursuant to the provisions of this Agreement and that this Agreement shall not be construed to grant any such right and/or license. You shall have only such rights and/or licenses, if any, to use such third party materials as set forth in the applicable Third-Party Licenses. 
+6.	**Feedback.** If you provide any feedback to PacBio concerning the functionality and performance of the Software, including identifying potential errors and improvements (“**Feedback**”), such Feedback shall be owned by PacBio. You hereby assign to PacBio all right, title, and interest in and to the Feedback, and PacBio is free to use the Feedback without any payment or restriction.
+7.	**Confidentiality.** You must hold in the strictest confidence the Software and any related materials or information including, but not limited to, any Feedback, technical data, research, product plans, or know-how provided by PacBio to you, directly or indirectly in writing, orally or by inspection of tangible objects (“**Confidential Information**”). You will not disclose any Confidential Information to third parties, including any of your employees who do not have a need to know such information, and you will take reasonable measures to protect the secrecy of, and to avoid disclosure and unauthorized use of, the Confidential Information. You will immediately notify the PacBio in the event of any unauthorized or suspected use or disclosure of the Confidential Information. To protect the Confidential Information contained in the Software, you may not reverse engineer, decompile, or disassemble the Software, except to the extent the foregoing restriction is expressly prohibited by applicable law.  
+8.	**Termination.** This Agreement will terminate upon the earlier of:  (a) your failure to comply with any term of this Agreement; or (b) return, destruction, or deletion of all copies of the Software in your possession. PacBio’s rights and your obligations will survive the termination of this Agreement. The “**Term**” means the period beginning on when this Agreement becomes effective until the termination of this Agreement. Upon termination of this Agreement for any reason, you will delete from all of your computer libraries or storage devices or otherwise destroy all copies of the Software and derivatives thereof.
+9.	**NO OTHER WARRANTIES.** THE SOFTWARE IS PROVIDED ON AN “AS IS” BASIS. YOU ASSUME ALL RESPONSIBILITIES FOR SELECTION OF THE SOFTWARE TO ACHIEVE YOUR INTENDED RESULTS, AND FOR THE INSTALLATION OF, USE OF, AND RESULTS OBTAINED FROM THE SOFTWARE. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, PACBIO DISCLAIMS ALL WARRANTIES, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, ACCURACY, TITLE, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE SOFTWARE AND THE ACCOMPANYING WRITTEN MATERIALS. THERE IS NO WARRANTY AGAINST INTERFERENCE WITH THE ENJOYMENT OF THE SOFTWARE OR AGAINST INFRINGEMENT. THERE IS NO WARRANTY THAT THE SOFTWARE OR PACBIO’S EFFORTS WILL FULFILL ANY OF YOUR PARTICULAR PURPOSES OR NEEDS.
+10.	**LIMITATION OF LIABILITY.** UNDER NO CIRCUMSTANCES WILL PACBIO BE LIABLE FOR ANY CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, LOSS OF DATA OR OTHER SUCH PECUNIARY LOSS) ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE, EVEN IF PACBIO HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. IN NO EVENT WILL PACBIO’S AGGREGATE LIABILITY FOR DAMAGES ARISING OUT OF THIS AGREEMENT EXCEED $5. THE FOREGOING EXCLUSIONS AND LIMITATIONS OF LIABILITY AND DAMAGES WILL NOT APPLY TO CONSEQUENTIAL DAMAGES FOR PERSONAL INJURY.
+11.	**Indemnification.** You will indemnify, hold harmless, and defend PacBio (including all of its officers, employees, directors, subsidiaries, representatives, affiliates, and agents) and PacBio’s suppliers from and against any damages (including attorney’s fees and expenses), claims, and lawsuits that arise or result from your use of the Software.
+12.	**Trademarks.** Certain of the product and PacBio names used in this Agreement, the Software may constitute trademarks of PacBio or third parties. You are not authorized to use any such trademarks.
+13.	**Export Restrictions.** YOU UNDERSTAND AND AGREE THAT THE SOFTWARE IS SUBJECT TO UNITED STATES AND OTHER APPLICABLE EXPORT-RELATED LAWS AND REGULATIONS AND THAT YOU MAY NOT EXPORT, RE-EXPORT OR TRANSFER THE SOFTWARE OR ANY DIRECT PRODUCT OF THE SOFTWARE EXCEPT AS PERMITTED UNDER THOSE LAWS. WITHOUT LIMITING THE FOREGOING, EXPORT, RE-EXPORT, OR TRANSFER OF THE SOFTWARE TO CUBA, IRAN, NORTH KOREA, SYRIA, RUSSIA, BELARUS, AND THE REGIONS OF CRIMEA, LNR, AND DNR OF UKRAINE IS PROHIBITED.
+14.	**General.** This Agreement is governed by the laws of the State of California, without reference to its conflict of laws principles. This Agreement is the entire agreement between you and PacBio and supersedes any other communications with respect to the Software. If any provision of this Agreement is held invalid or unenforceable, the remainder of this Agreement will continue in full force and effect.
@@ -1,45 +1,193 @@
 # TRGT-instability
 
-TRGT-instability is an add-tool for [TRGT](https://github.com/PacificBiosciences/trgt/)
-designed to quantify instability of each repeat allele.
+This tool quantifies tandem repeat (TR) instability from the output of
+[TRGT](https://github.com/PacificBiosciences/trgt) in three stages:
+read-divergence extraction, per-repeat instability model fitting, and
+query-allele testing for excess instability. The tool models
+read-to-consensus divergence for each allele to define a baseline
+instability distribution for each repeat observed in the sequencing
+data.
 
-## Availability
+> [!WARNING]
+> **Please note:** TRGT-instability is under active development and should be
+> used only for experimentation and feedback.
 
-The latest binary can be found in the release page.
+## Authors and contributors
 
-## Usage example
+| Authors and contributors   | Affiliations                        |
+|----------------------------|-------------------------------------|
+| Egor Dolzhenko             | PacBio                              |
+| Adam English               | Baylor College of Medicine          |
+| Tom Mokveld                | PacBio                              |
+| Guilherme de Sena Brandine | PacBio                              |
+| Zev Kronenberg             | PacBio                              |
+| Galen Wright               | University of Manitoba              |
+| Britt Drogemoller          | University of Manitoba              |
+| William J. Rowell          | PacBio                              |
+| Aaron M. Wenger            | PacBio                              |
+| Mark F. Bennett            | Walter and Eliza Hall Institute     |
+| Ben Weisburd               | Broad Institute                     |
+| Graham S. Erwin            | Baylor College of Medicine          |
+| Peng Jin                   | Emory University School of Medicine |
+| David Nelson               | Baylor College of Medicine          |
+| Harriet Dashnow            | University of Colorado Anschutz     |
+| Fritz J. Sedlazeck         | Baylor College of Medicine          |
+| Michael A. Eberle          | PacBio                              |
 
-The tool can be run like so:
+Equal contribution: Egor Dolzhenko, Adam English, Fritz J. Sedlazeck, Michael A. Eberle.
+
+
+## Terminology
+
+- `read divergence rate`: the length-normalized edit distance between a read and its allele consensus
+- `allele instability profile`: the 15-bin count vector summarizing read divergence rates for one allele
+- `repeat instability model`: the fitted Dirichlet-multinomial model for one repeat locus
+
+## Command overview
+
+- `trgt-instability divergence`: computes edit distances between reads and their allele sequences used to determine divergence rates
+- `trgt-instability model`: fits an instability model from allele instability profiles for each repeat
+- `trgt-instability test`: tests a query allele for excess instability relative to the repeat-specific model
+- `trgt-instability bin`: calculates shared discretization bins for read divergence rates
+
+## Inputs
+
+- A repeat catalog with to analyze
+- TRGT outputs for all control samples
+- TRGT outputs for all case samples
+
+## A complete workflow
+
+Run `divergence` on all case and control samples:
+
+```bash
+./trgt-instability divergence \
+  --repeats repeat-catalog.bed \
+  --spanning-reads sample.sorted.bam \
+  --variants sample.sorted.vcf.gz \
+  | gzip > sample.dists.txt.gz
+```
+
+`divergence` writes plain-text tab-delimited records to stdout, so the examples
+pipe that output through `gzip` for the downstream commands.
+
+Combine divergence records for all control samples (the `model` command expects 
+`--data` records to be sorted by repeat identifier (`trid`)):
+
+```bash
+zcat controls/*.dists.txt.gz | sort -k 1,1 -k 2,2n | gzip > controls-dists.txt.gz
+```
+
+If you want to compare instability of different repeats calculate shared discretization bins from the full control cohort (skip this step if you are
+analyzing repeats individually):
 
 ```bash
-./trgt-instability --repeats repeats.bed --spanning-reads sample.trgt.sorted.bam 
+./trgt-instability bin --data controls-dists.txt.gz
 ```
 
-where
+Generate repeat-specific instability models for all repeats:
 
-- `repeats.bed` is the BED file with repeat definitions; it must be the same file
-as the one used to run TRGT,
-- `sample.trgt.sorted.bam` is the BAM file generated by TRGT; it must be sorted
-and indexed.
+```bash
+./trgt-instability model --data controls-dists.txt.gz | gzip > models.gz
+```
 
-The output will look like this:
+or use shared bins in `bins.txt`:
 
-```tsv
-trid allele motif motif_counts expansion_rate contraction_rate model_params
-FMR1 1      CGG   84,93,94,94,95,95,95,96,97,97,99,100,100,101,101,101,104,104,107,108,109,113,116,126  0.06 0.05  0.96:0.01:0.61:0.80
+```bash
+./trgt-instability model --data controls-dists.txt.gz --bins <bin edges reported by the bin command> | gzip > models.gz
 ```
 
-where:
+Now any case sample can be tested for excess instability:
+
+```bash
+./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz > sample-results.txt
+```
+
+Use `--n-sim` to control the maximum parametric-bootstrap depth per tested
+allele (default: `100000`), and `--threads` to parallelize testing, for
+example:
+
+```bash
+./trgt-instability test --models models.gz --data cases/sample.dists.txt.gz --threads 16 --n-sim 200000 > sample-results.txt
+```
+
+The results consist of repeat identifiers, allele sequences, and the associated
+p-values. A low p-value indicates that the allele instability profile is
+unlikely under the fitted repeat-specific baseline model, i.e. evidence for
+excess instability.
+
+If you also want a (read depth aware) quantity to compare alleles by
+their instability, `test` can calculate posterior effect-size information
+derived from the fitted instability model:
+
+```bash
+./trgt-instability test \
+  --models models.gz \
+  --data cases/sample.dists.txt.gz \
+  --report-effect-size \
+  --n-posterior-draws 4000 \
+  > sample-results-with-d.txt
+```
+
+With `--report-effect-size`, each output line contains:
+
+- `d_median`: posterior median of the Wasserstein distance from the fitted repeat-specific baseline
+- `d_ci_lower`: lower bound of the 95% credible interval for `d`
+- `d_ci_upper`: upper bound of the 95% credible interval for `d`
+
+This `d` score is intended for ranking alleles by how far their instability
+profile is from the fitted repeat-specific baseline, (the p-value
+remains the significance measure). See
+[`docs/effect-size-reporting.md`](docs/effect-size-reporting.md) for details.
+
+For large runs requiring multiple testing, adaptive stopping can avoid performing
+the full `--n-sim` bootstrap iterations for alleles that can no longer reach the
+target tail probability. You can derive this target probability automatically
+for Benjamini-Hochberg p-value correction using `stop_p = q * rank / m`, where `m` is the number of tested alleles:
+
+```bash
+./trgt-instability test \
+  --models models.gz \
+  --data cases/sample.dists.txt.gz \
+  --threads 16 \
+  --n-sim 24000000 \
+  --adaptive-fdr-q 0.05 \
+  --adaptive-fdr-rank 1 \
+  > sample-results.txt
+```
+
+## Reference
+
+- [File formats](docs/file-formats.md)
+- [Effect-size reporting for `test`](docs/effect-size-reporting.md)
+
+## Need help?
+
+If you notice any missing features, bugs, or need assistance with analyzing the
+output of TRGT-instability, please don't hesitate to open a GitHub issue or
+reach out to the authors by [email](mailto:edolzhenko@pacificbiosciences.com).
+
+## Support information
+
+TRGT-instability is a pre-release software intended for research use only and
+not for use in diagnostic procedures. While efforts have been made to ensure
+that TRGT-instability lives up to the quality that PacBio strives for, we make
+no warranty regarding this software.
+
+As TRGT-instability is not covered by any service level agreement or the like,
+please do not contact a PacBio Field Applications Scientists or PacBio Customer
+Service for assistance with any TRGT-instability release. Please report all
+issues through GitHub instead. We make no warranty that any such issue will be
+addressed, to any extent or within any time frame.
 
-- `trid` is the tandem repeat identifier from `repeats.bed`
-- `allele` is the allele index
-- `motif` is the motif profiled
-- `motif_count` is the number of motifs identified in each read
-- `expansion_rate` is the estimated expansion rate (see below)
-- `contraction_rate` is the estimated contraction rate (see below)
-- `model_params` are the estimated parameters of the instability model
+### DISCLAIMER
 
-The expansion and contraction rate parameters are motivated by a simple
-model where mosaicism can occur due to slippage during DNA replication.
-An expansion rate of 0.01 means that an allele consisting of 100 motifs
-is expected to have a single motif expansion event during replication.
+THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE
+PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY
+KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES
+OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A
+PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF
+THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR
+APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY
+OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO
+NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.