Skip to content

Commit 8ce5ee8

Browse files
committed
updated default values
1 parent bb1b0a6 commit 8ce5ee8

File tree

2 files changed

+35
-18
lines changed

2 files changed

+35
-18
lines changed

README.md

+31-14
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,54 @@
11
# WarpSTR
2-
WarpSTR is an alignment-free algorithm for analysing STR alleles using nanopore sequencing raw reads. The method uses guppy basecalling annotation output for the extraction of region of interest, and dynamic time warping based state automata for calling raw signal data. The tool can be configured to account for complex motifs containing interruptions and other variations such as substitutions or ambiguous bases.
32

4-
### Installation
3+
WarpSTR is an alignment-free algorithm for analysing STR alleles using nanopore sequencing raw reads. The method uses guppy basecalling annotation output for the extraction of region of interest, and dynamic time warping based state automata for calling raw signal data. The tool can be configured to account for complex motifs containing interruptions and other variations such as substitutions or ambiguous bases.
4+
5+
See our preprint at: <https://www.biorxiv.org/content/10.1101/2022.11.05.515275v1>
6+
7+
## Installation
8+
59
WarpSTR can be easily installed using conda environment, frozen in `conda_req.yaml`. The conda environment can be created as follows:
10+
611
```bash
712
conda env create -f conda_req.yaml
813
```
14+
915
After installation, it is required to activate conda environment:
16+
1017
```bash
1118
conda activate warpstr
1219
```
1320

1421
WarpSTR was tested in Ubuntu 20.04 OS.
1522

1623
## Running WarpSTR
24+
1725
Required step to do before running WarpSTR is to prepare config file and add loci information.
1826

1927
### Config file
28+
2029
The input configuration file must be populated with elements such as `inputs`, `output` and `reference_path`. An example is provided in `example/config.yaml`.
2130

2231
There are also many advanced parameters that are optional to set. List of all parameters are found in `example/advanced_params.yaml`. To set values for those parameters, just add those parameters to your main config and set them to the desired value. In other case, default values for those parameters are taken.
2332

2433
### Loci information
34+
2535
Information about loci, that are subjects for analysis by WarpSTR, must be described in the config file. An example is described `example/config.yaml`. Each loci must be defined by name and genomic coordinates. Then, you can either specify repeating motifs occuring in the locus in `motif` element, from which the input sequence for WarpSTR state automata is automatically created(this is recommended for starting users). The second way is to configure the input sequence by yourself in `sequence` element of the locus, however this is not a trivial task, so it is recommended for more advanced users. The other possibility is to use automatic configuration and then modify it by hand.
2636

2737
### Running
38+
2839
After creating configuration file, running WarpSTR is simple as it requires only the path to the config file:
29-
```
40+
41+
```bash
3042
python WarpSTR.py example/config.yaml
3143
```
3244

3345
### Input data
46+
3447
Required input data are .fast5 files and .bam mapping files. In configuration file, the user is required to provide the path to the upper level path, in the `inputs` element. WarpSTR presumes that your data can come from multiple sequencing runs, but are of the same sample, and thus are to be analyzed together. For example, you have main directory for sample `subjectXY` with many subdirectories denoting sequencing runs i.e. `run_1`, `run_2`, with each run directory having its own .bam mapping file and .fast5 files. It is also possible to denote another path to input, in case of having data stored somewhere else (i.e. on the other mounted directory, as ONT data are very large), for example with the data from another run, i.e. `run_3`.
3548

3649
For the above example, `inputs` in the config could be defined as follows:
37-
```
50+
51+
```yaml
3852
inputs:
3953
- path: /data/subjectXY
4054
runs: run_1,run_2
@@ -45,10 +59,12 @@ inputs:
4559
Each directory as given by `path` and `runs`, i.e. `/data/subjectXY/run_1` and so on, is traversed by WarpSTR to find .bam files and .fast5 files.
4660

4761
## Output
62+
4863
The upper path for output is given in the .yaml configuration file as `output` element. Outputs are separated for each locus as subdirectories of this upper path, where names of subdirectories are the same as the locus name.
4964

5065
The output structure for one locus is as follows:
51-
```
66+
67+
```bash
5268
alignments/ # contains alignments of template flanks with reads
5369
expected_signals/ # contains template flanks as sequences and expected signals
5470
fast5/ # signals extracted as encompasssing the locus, stored as signle .fast5 files
@@ -60,6 +76,7 @@ overview.csv # .csv file with read information and output
6076
Some output files are optional and can be controlled by the .yaml config file.
6177

6278
### Predictions
79+
6380
In the `predictions` directory of each locus there would be a large variety of outputted files in other subdirectories.
6481

6582
In **basecalls** subdirectory are output files related to basecalling, such as `all.fasta` containing basecalled sequences of all reads encompassing the locus as given by SAM/BAM, `basecalls_all.fasta` containing only reads in which flanks were found. This file is further split per strand into `basecalls_reverse.fasta` and `basecalls_template.fasta`. In case of running muscle for MSA - multiple sequence alignment (controlled by advanced_params config), there would be `msa_all.fasta` file with MSA. In case of running summarizing, there would be `group1.fasta` and `group2.fasta` files where would be basecalled sequences split into groups as summarized by the last step of WarpSTR. In such case MSA output would be also created only for basecalled sequences of each group.
@@ -71,17 +88,17 @@ In **sequences** subdirectory there is analogous information as in **basecalls**
7188
In **DTW_alignments** subdirectory there are visualized alignments of STR signal with automaton (in both stages). Visualizations are truncated to first 2000 values.
7289

7390
### Summaries
91+
7492
In the `summaries` directory of each locus there is a myriad of optional visualizations:
7593

76-
```
77-
alleles.svg - Summarized predictions of repeat lengths in 1 or 2 groups and for WarpSTR and basecall.
78-
collapsed_predictions.svg - Complex repeat structure counts, only for WarpSTR.
79-
collapsed_predictions_strand.svg - As above, but further split by strand.
80-
complex_genotypes.svg - Summarized complex repeat structure counts in 1 or 2 groups.
81-
predictions_cost.svg - Scatterplot of state-wise cost and allele lengths.
82-
predictions_phase.svg - Violinplots of repeat lengths in the first and second phase.
83-
predictions_strand.svg - Violinplots of repeat lengths as split by strand.
84-
```
94+
- alleles.svg - Summarized predictions of repeat lengths in 1 or 2 groups and for WarpSTR and basecall.
95+
- collapsed_predictions.svg - Complex repeat structure counts, only for WarpSTR.
96+
- collapsed_predictions_strand.svg - As above, but further split by strand.
97+
- complex_genotypes.svg - Summarized complex repeat structure counts in 1 or 2 groups.
98+
- predictions_cost.svg - Scatterplot of state-wise cost and allele lengths.
99+
- predictions_phase.svg - Violinplots of repeat lengths in the first and second phase.
100+
- predictions_strand.svg - Violinplots of repeat lengths as split by strand.
85101

86102
## Additional information
103+
87104
Newer .fast5 files are usually VBZ compressed, therefore VBZ plugin for HD5 is required to be installed, so WarpSTR can handle such files. See `https://github.com/nanoporetech/vbz_compression`.

example/config.yaml

+4-4
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@ inputs:
1010
runs: 84_15,370_2014,394_2016
1111

1212
# if you are re-running the analysis, here you can set which steps to skip by setting them to False
13-
single_read_extraction: False # Extracts reads mapped to the locus and stores them in single .fast5 format
14-
guppy_annotation: False # Annotates .fast5 files with mapping between basecalled sequence and the signal
13+
single_read_extraction: True # Extracts reads mapped to the locus and stores them in single .fast5 format
14+
guppy_annotation: True # Annotates .fast5 files with mapping between basecalled sequence and the signal
1515
exp_signal_generation: True # Generates expected signals for flanks and repeats
16-
tr_region_extraction: False # Finds tandem repeat region in read using alignment of basecalled sequence and reference repeat sequence
16+
tr_region_extraction: True # Finds tandem repeat region in read using alignment of basecalled sequence and reference repeat sequence
1717
tr_region_calling: True # Uses state automata with DTW alignment to find the number of repeats for each signal
1818
genotyping: True # Predicts the final allele lengths from the predicted repeat numbers of each read
1919

@@ -31,7 +31,7 @@ guppy_config:
3131
loci:
3232
- name: HD # Required
3333
coord: chr4:3,074,878-3,074,967 # Required
34-
noting: AGC[19]AAC[1]AGC[1]CGC[1]CAC[1]CGC[7]
34+
noting: AGC[19]AAC[1]AGC[1]CGC[1]CAC[1]CGC[7] # Concise representation of reference locus. This is only descriptive
3535
motif: AGC,CGC # Set this or 'sequence'
3636
# sequence: (AGC)AACAGCCGCCAC(CGC) # Or set this - recommended for more advanced users.
3737
# flank_length: 110 # Optional, default: 110

0 commit comments

Comments
 (0)