You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WarpSTR is an alignment-free algorithm for analysing STR alleles using nanopore sequencing raw reads. The method uses guppy basecalling annotation output for the extraction of region of interest, and dynamic time warping based state automata for calling raw signal data. The tool can be configured to account for complex motifs containing interruptions and other variations such as substitutions or ambiguous bases.
3
2
4
-
### Installation
3
+
WarpSTR is an alignment-free algorithm for analysing STR alleles using nanopore sequencing raw reads. The method uses guppy basecalling annotation output for the extraction of region of interest, and dynamic time warping based state automata for calling raw signal data. The tool can be configured to account for complex motifs containing interruptions and other variations such as substitutions or ambiguous bases.
4
+
5
+
See our preprint at: <https://www.biorxiv.org/content/10.1101/2022.11.05.515275v1>
6
+
7
+
## Installation
8
+
5
9
WarpSTR can be easily installed using conda environment, frozen in `conda_req.yaml`. The conda environment can be created as follows:
10
+
6
11
```bash
7
12
conda env create -f conda_req.yaml
8
13
```
14
+
9
15
After installation, it is required to activate conda environment:
16
+
10
17
```bash
11
18
conda activate warpstr
12
19
```
13
20
14
21
WarpSTR was tested in Ubuntu 20.04 OS.
15
22
16
23
## Running WarpSTR
24
+
17
25
Required step to do before running WarpSTR is to prepare config file and add loci information.
18
26
19
27
### Config file
28
+
20
29
The input configuration file must be populated with elements such as `inputs`, `output` and `reference_path`. An example is provided in `example/config.yaml`.
21
30
22
31
There are also many advanced parameters that are optional to set. List of all parameters are found in `example/advanced_params.yaml`. To set values for those parameters, just add those parameters to your main config and set them to the desired value. In other case, default values for those parameters are taken.
23
32
24
33
### Loci information
34
+
25
35
Information about loci, that are subjects for analysis by WarpSTR, must be described in the config file. An example is described `example/config.yaml`. Each loci must be defined by name and genomic coordinates. Then, you can either specify repeating motifs occuring in the locus in `motif` element, from which the input sequence for WarpSTR state automata is automatically created(this is recommended for starting users). The second way is to configure the input sequence by yourself in `sequence` element of the locus, however this is not a trivial task, so it is recommended for more advanced users. The other possibility is to use automatic configuration and then modify it by hand.
26
36
27
37
### Running
38
+
28
39
After creating configuration file, running WarpSTR is simple as it requires only the path to the config file:
29
-
```
40
+
41
+
```bash
30
42
python WarpSTR.py example/config.yaml
31
43
```
32
44
33
45
### Input data
46
+
34
47
Required input data are .fast5 files and .bam mapping files. In configuration file, the user is required to provide the path to the upper level path, in the `inputs` element. WarpSTR presumes that your data can come from multiple sequencing runs, but are of the same sample, and thus are to be analyzed together. For example, you have main directory for sample `subjectXY` with many subdirectories denoting sequencing runs i.e. `run_1`, `run_2`, with each run directory having its own .bam mapping file and .fast5 files. It is also possible to denote another path to input, in case of having data stored somewhere else (i.e. on the other mounted directory, as ONT data are very large), for example with the data from another run, i.e. `run_3`.
35
48
36
49
For the above example, `inputs` in the config could be defined as follows:
37
-
```
50
+
51
+
```yaml
38
52
inputs:
39
53
- path: /data/subjectXY
40
54
runs: run_1,run_2
@@ -45,10 +59,12 @@ inputs:
45
59
Each directory as given by `path` and `runs`, i.e. `/data/subjectXY/run_1` and so on, is traversed by WarpSTR to find .bam files and .fast5 files.
46
60
47
61
## Output
62
+
48
63
The upper path for output is given in the .yaml configuration file as `output` element. Outputs are separated for each locus as subdirectories of this upper path, where names of subdirectories are the same as the locus name.
49
64
50
65
The output structure for one locus is as follows:
51
-
```
66
+
67
+
```bash
52
68
alignments/ # contains alignments of template flanks with reads
53
69
expected_signals/ # contains template flanks as sequences and expected signals
54
70
fast5/ # signals extracted as encompasssing the locus, stored as signle .fast5 files
@@ -60,6 +76,7 @@ overview.csv # .csv file with read information and output
60
76
Some output files are optional and can be controlled by the .yaml config file.
61
77
62
78
### Predictions
79
+
63
80
In the `predictions` directory of each locus there would be a large variety of outputted files in other subdirectories.
64
81
65
82
In **basecalls** subdirectory are output files related to basecalling, such as `all.fasta` containing basecalled sequences of all reads encompassing the locus as given by SAM/BAM, `basecalls_all.fasta` containing only reads in which flanks were found. This file is further split per strand into `basecalls_reverse.fasta` and `basecalls_template.fasta`. In case of running muscle for MSA - multiple sequence alignment (controlled by advanced_params config), there would be `msa_all.fasta` file with MSA. In case of running summarizing, there would be `group1.fasta` and `group2.fasta` files where would be basecalled sequences split into groups as summarized by the last step of WarpSTR. In such case MSA output would be also created only for basecalled sequences of each group.
@@ -71,17 +88,17 @@ In **sequences** subdirectory there is analogous information as in **basecalls**
71
88
In **DTW_alignments** subdirectory there are visualized alignments of STR signal with automaton (in both stages). Visualizations are truncated to first 2000 values.
72
89
73
90
### Summaries
91
+
74
92
In the `summaries` directory of each locus there is a myriad of optional visualizations:
75
93
76
-
```
77
-
alleles.svg - Summarized predictions of repeat lengths in 1 or 2 groups and for WarpSTR and basecall.
78
-
collapsed_predictions.svg - Complex repeat structure counts, only for WarpSTR.
79
-
collapsed_predictions_strand.svg - As above, but further split by strand.
80
-
complex_genotypes.svg - Summarized complex repeat structure counts in 1 or 2 groups.
81
-
predictions_cost.svg - Scatterplot of state-wise cost and allele lengths.
82
-
predictions_phase.svg - Violinplots of repeat lengths in the first and second phase.
83
-
predictions_strand.svg - Violinplots of repeat lengths as split by strand.
84
-
```
94
+
- alleles.svg - Summarized predictions of repeat lengths in 1 or 2 groups and for WarpSTR and basecall.
95
+
- collapsed_predictions.svg - Complex repeat structure counts, only for WarpSTR.
96
+
- collapsed_predictions_strand.svg - As above, but further split by strand.
97
+
- complex_genotypes.svg - Summarized complex repeat structure counts in 1 or 2 groups.
98
+
- predictions_cost.svg - Scatterplot of state-wise cost and allele lengths.
99
+
- predictions_phase.svg - Violinplots of repeat lengths in the first and second phase.
100
+
- predictions_strand.svg - Violinplots of repeat lengths as split by strand.
85
101
86
102
## Additional information
103
+
87
104
Newer .fast5 files are usually VBZ compressed, therefore VBZ plugin for HD5 is required to be installed, so WarpSTR can handle such files. See `https://github.com/nanoporetech/vbz_compression`.
0 commit comments