You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+199-26
Original file line number
Diff line number
Diff line change
@@ -6,9 +6,15 @@ See our preprint at: <https://www.biorxiv.org/content/10.1101/2022.11.05.515275v
6
6
7
7
See below for some quick steps how to install and run WarpSTR, or refer to more detailed [documentation](https://fmfi-compbio.github.io/warpstr/).
8
8
9
-
## Installation
9
+
## 1 Installation
10
10
11
-
WarpSTR can be easily installed using conda environment, frozen in `conda_req.yaml`. The conda environment can be created as follows:
11
+
WarpSTR can be installed using conda or pipenv. To install conda, please follow [the official guide](https://conda.io/projects/conda/en/latest/user-guide/install/index.html). To install pipenv, simple `pip install pipenv` should suffice.
12
+
13
+
WarpSTR was tested in Ubuntu 20.04 and Ubuntu 22.04. Used Python version is 3.7.
14
+
15
+
### 1.a) Installing using conda
16
+
17
+
Clone this repository. Then, create the conda environment:
12
18
13
19
```bash
14
20
conda env create -f conda_req.yaml
@@ -20,49 +26,108 @@ After installation, it is required to activate conda environment:
20
26
conda activate warpstr
21
27
```
22
28
23
-
WarpSTR was tested in Ubuntu 20.04 OS.
29
+
### 1.b) Installing using pipenv
24
30
25
-
## Running WarpSTR
31
+
Clone this repository. The pipenv environment can be installed from Pipfile.lock as follows:
26
32
27
-
Required step to do before running WarpSTR is to prepare config file and add loci information.
33
+
```bash
34
+
pipenv sync
35
+
```
36
+
37
+
After installation, it is required to activate the environment:
38
+
39
+
```bash
40
+
pipenv shell
41
+
```
28
42
29
-
### Config file
43
+
##2 Running the test case
30
44
31
-
The input configuration file must be populated with elements such as `inputs`, `output` and `reference_path`. An example is provided in `example/config.yaml`.
45
+
In `test/test_input` there is a small test dataset with 10 reads for one locus. There is also the template for config file required by WarpSTR, `test/config_template.yaml`. You can check whether WarpSTR works correctly simply by running:
32
46
33
-
There are also many advanced parameters that are optional to set. List of all parameters are found in `example/advanced_params.yaml`. To set values for those parameters, just add those parameters to your main config and set them to the desired value. In other case, default values for those parameters are taken.
47
+
```bash
48
+
bash run_test_case.sh
49
+
```
34
50
35
-
### Loci information
51
+
When running this wrapper script, the script will prompt you to provide the required paths and run the WarpSTR for you using the test data. Output files will be then stored in `test/test_output/` as given in the config file. The script should take approximately 3-5 minutes and at the end, you should see something like:
36
52
37
-
Information about loci, that are subjects for analysis by WarpSTR, must be described in the config file. An example is described `example/config.yaml`. Each loci must be defined by name and genomic coordinates. Then, you can either specify repeating motifs occuring in the locus in `motif` element, from which the input sequence for WarpSTR state automata is automatically created(this is recommended for starting users). The second way is to configure the input sequence by yourself in `sequence` element of the locus, however this is not a trivial task, so it is recommended for more advanced users. The other possibility is to use automatic configuration and then modify it by hand.
53
+
```text
54
+
Results stored in overview file XY
55
+
Allele lengths as given by WarpSTR: (44, 40)
56
+
```
38
57
39
-
### Running
58
+
##3 Running WarpSTR
40
59
41
-
After creating configuration file, running WarpSTR is simple as it requires only the path to the config file:
60
+
Running WarpSTR is simple as it requires only the path to the configuration file:
42
61
43
62
```bash
44
63
python WarpSTR.py example/config.yaml
45
64
```
46
65
47
-
### Input data
66
+
WarpSTR consists of multiple complex steps doing the following:
67
+
68
+
1. extracting reads encompassing the locus coordinates - requires BAM mapping files and multi .fast5.
69
+
2. extracting STR regions from reads - requires Guppy basecaller.
70
+
3. determining the alelle length for reads.
71
+
4. genotyping alelle lengths and determining zygosity.
72
+
73
+
If you want to run a whole WarpSTR pipeline then continue reading, else skip to the [WarpSTR steps](#5-warpstr-steps).
74
+
75
+
### 3.1 Config file
76
+
77
+
In the input configuration file (see `example/config.yaml` for an example) you must set the following elements:
78
+
79
+
-`reference_path` - path to the fasta file - the reference genome, the same which was used for mapping basecalled reads.
80
+
-`guppy_config` - path to the executable Guppy basecaller and info about the sequencing (flowcell and kit).
81
+
-`output` - path to the directory, when WarpSTR will produce output results.
82
+
-`loci` - loci information, see [below](#32-loci-information).
83
+
-`inputs` - input data, see [below](#33-input-data).
84
+
85
+
There are also many advanced parameters that are optional to set. List of all parameters are found in `example/advanced_params.yaml`. To set values for those parameters, just copy the elements to your main config and change valeus to your desired values. In other case, default values for those parameters are taken.
86
+
87
+
### 3.2 Loci information
88
+
89
+
Information about loci, that are subjects for analysis by WarpSTR, must be described in the config file. An example is described `example/config.yaml`. Each locus must be defined by name and genomic coordinates (these must match with the reference), and either motif or sequence:
90
+
91
+
```yaml
92
+
name: HD
93
+
coord: chr4:3,074,878-3,074,967
94
+
motif: AGC,CGC
95
+
# sequence: (AGC)AACAGCCGCCAC(CGC)
96
+
```
97
+
98
+
The `motif` element is recommended for beginners, as the input sequence for WarpSTR state automata is automatically created. In this element, possible repeat units should be provided.
99
+
100
+
The second way is to configure the input sequence for automata by yourself in the `sequence` element of the locus. This is not a trivial task, so it is recommended for more advanced users. The other possibility is to use automatic configuration and then modify it by hand. See the preprint for the information about the state automata.
101
+
102
+
### 3.3 Input data
103
+
104
+
Required input data are .fast5 files and .bam mapping files. In configuration file, the user is required to provide the path to the upper level path, in the `inputs` element. WarpSTR presumes that your data can come from multiple sequencing runs, but are of the same sample, and thus should be analyzed together, see [documentation](https://fmfi-compbio.github.io/warpstr/) in that case.
48
105
49
-
Required input data are .fast5 files and .bam mapping files. In configuration file, the user is required to provide the path to the upper level path, in the `inputs` element. WarpSTR presumes that your data can come from multiple sequencing runs, but are of the same sample, and thus are to be analyzed together. For example, you have main directory for sample `subjectXY` with many subdirectories denoting sequencing runs i.e. `run_1`, `run_2`, with each run directory having its own .bam mapping file and .fast5 files. It is also possible to denote another path to input, in case of having data stored somewhere else (i.e. on the other mounted directory, as ONT data are very large), for example with the data from another run, i.e. `run_3`.
106
+
The simple case is like in the test case:
50
107
51
-
For the above example, `inputs` in the config could be defined as follows:
108
+
```bash
109
+
test_input/
110
+
└── test_run1
111
+
├── fast5s
112
+
│ └── batch_0.fast5
113
+
└── mapping
114
+
├── mapping.bam
115
+
└── mapping.bam.bai
116
+
```
117
+
118
+
The names `test_run1` and `test_input` are then used in the configuration file for the `inputs` element:
52
119
53
120
```yaml
54
121
inputs:
55
-
- path: /data/subjectXY
56
-
runs: run_1,run_2
57
-
- path: /mnt/subjectXY
58
-
runs: run_3
122
+
- path: test/test_input
123
+
runs: test_run1
59
124
```
60
125
61
-
Each directory as given by `path` and `runs`, i.e. `/data/subjectXY/run_1` and so on, is traversed by WarpSTR to find .bam files and .fast5 files.
126
+
Names of subdirectories such as `fast5s` and `mapping` are not important, but .fast5 files and .bam files must have the correct extension.
62
127
63
-
## Output
128
+
## 4 Output
64
129
65
-
The upper path for output is given in the .yaml configuration file as `output` element. Outputs are separated for each locus as subdirectories of this upper path, where names of subdirectories are the same as the locus name.
130
+
The upper path for output is given in the .yaml configuration file as `output` element. Each locus has the separate output - a new subdirectory of this upper path with locus name is created, where the output is stored.
66
131
67
132
The output structure for one locus is as follows:
68
133
@@ -77,7 +142,7 @@ overview.csv # .csv file with read information and output
77
142
78
143
Some output files are optional and can be controlled by the .yaml config file.
79
144
80
-
### Predictions
145
+
### 4.1 Predictions
81
146
82
147
In the `predictions` directory of each locus there would be a large variety of outputted files in other subdirectories.
83
148
@@ -89,7 +154,7 @@ In **sequences** subdirectory there is analogous information as in **basecalls**
89
154
90
155
In **DTW_alignments** subdirectory there are visualized alignments of STR signal with automaton (in both stages). Visualizations are truncated to first 2000 values.
91
156
92
-
### Summaries
157
+
### 4.2 Summaries
93
158
94
159
In the `summaries` directory of each locus there is a myriad of optional visualizations:
95
160
@@ -101,6 +166,114 @@ In the `summaries` directory of each locus there is a myriad of optional visuali
101
166
- predictions_phase.svg - Violinplots of repeat lengths in the first and second phase.
102
167
- predictions_strand.svg - Violinplots of repeat lengths as split by strand.
103
168
104
-
## Additional information
169
+
## 5 WarpSTR steps
170
+
171
+
WarpSTR pipeline steps are toggleable in the config file, i.e. you can skip them, by turning them to False:
172
+
173
+
```yaml
174
+
single_read_extraction: True # Extracts reads mapped to the locus and stores them in single .fast5 format
175
+
guppy_annotation: True # Annotates .fast5 files with mapping between basecalled sequence and the signal
176
+
exp_signal_generation: True # Generates expected signals for flanks and repeats
177
+
tr_region_extraction: True # Finds tandem repeat region in read using alignment of basecalled sequence and reference repeat sequence
178
+
tr_region_calling: True # Uses state automata with DTW alignment to find the number of repeats for each signal
179
+
genotyping: True # Predicts the final allele lengths from the predicted repeat numbers of each read
180
+
```
181
+
182
+
### 5.1 Extraction of locus reads
183
+
184
+
Here, .BAM and multi-fast5 files are required. The following config elements must be set:
185
+
186
+
- `inputs`element - defining directories containing .BAM and .fast5 files
187
+
- `loci`element - defining genomic coordinates
188
+
- `single_read_extraction`element set to `True`
189
+
190
+
In the output directory (given by `output` element) the state of the locus output subdirectory after running this step would be:
191
+
192
+
```tree
193
+
{locus_name}
194
+
├── fast5
195
+
│ └── {run_id}
196
+
│ ├── {read_name1}.fast5
197
+
│ ├── {read_name2}.fast5
198
+
│ └── ...
199
+
└── overview.csv - index of extracted reads with `name`,`run_id`,`reverse` values for each read
200
+
```
201
+
202
+
#### Skipping this step
203
+
204
+
If you have already single .fast5s ready for the locus and want to skip this step, you should simulate the outcome of the first step:
205
+
206
+
1. Create the subdirectory in the output directory with the same as the name of the locus in the config.
207
+
2. In the locus subdir create the `fast5/run_id` directory, where you copy single .fast5 reads (See above the output example)
208
+
3. In the locus subdir create `overview.csv` file where for each read signal there should be a row with three columns: `name`,`run_id`,`reverse`, Where `name` is the name as the read_name, and `reverse` having either True or False value, denoting the strand.
209
+
210
+
For example, the overview.csv for the above case would be:
211
+
212
+
```csv
213
+
read_name,run_id,reverse
214
+
read_name1,run_id,False
215
+
read_name2,run_id,True
216
+
...
217
+
```
218
+
219
+
Then, do not forget to turn off the step in the config file:
220
+
221
+
```yaml
222
+
single_read_extraction: False # Extracts reads mapped to the locus and stores them in single .fast5 format
In this step, reads are basecalled again so they would be annotated with the mapping between basecalls and signal values. This mapping is then used to localize the STR region in signals.
230
+
231
+
The state of the locus output directory after running this step would be:
232
+
233
+
```tree
234
+
{locus_name}
235
+
├── fast5
236
+
│ └── {run_id}
237
+
│ ├── annot
238
+
│ │ ├── {read_name1}.fast5
239
+
│ │ ├── {read_name2}.fast5
240
+
│ │ └── ...
241
+
└── overview.csv - index of extracted reads with `name`,`run_id`,`reverse` values for each read.
242
+
In addition, there would be 'l_start_raw', 'r_end_raw' values, corresponding to signal positions, where starts the left flank and ends the right flank.
243
+
```
244
+
245
+
#### Skipping this step
246
+
247
+
If you have already .fast5 signals with localized STR regions, you again must simulate the output of this step. The other option is to use our script `prepare_caller_only.py`. It requires two things:
248
+
249
+
`--config CONFIG` - the same as you would use further. The important thing is to set the `output` and `loci`
250
+
`--file CSV` - .csv file with one row for .fast5 signal, and these required columns:
251
+
252
+
-`fast5_path` - path to the .fast5 read path.
253
+
-`locus` - name of the locus associated with the read.
254
+
-`read_name` - name of the read.
255
+
-`reverse` - strand information, True or False.
256
+
-`l_start_raw` - signal position where the left flank starts.
257
+
-`r_end_raw` - signal position where the right flank ends.
258
+
259
+
The example case is in the repository in `test/test_caller_only`. To run, provide the reference_path (!!!) there in the config, and run using:
- The `l_start_raw` and `r_end_raw` can be set approximately, 100-200 positions off should pose no problem for the correct result.
274
+
- The `l_start_raw` and `r_end_raw` must correspond to the flank positions, i.e. the flank length must be set to the same value in the config for `loci`.
275
+
- We currently do not support direct input of signal values of STR for STR calling.
276
+
277
+
## 6 Additional information
105
278
106
-
Newer .fast5 files are usually VBZ compressed, therefore VBZ plugin for HD5 is required to be installed, so WarpSTR can handle such files. See `https://github.com/nanoporetech/vbz_compression`.
279
+
Newer .fast5 files are usually VBZ compressed, therefore VBZ plugin for HD5 is required to be installed, so WarpSTR can handle such files. See `https://github.com/nanoporetech/vbz_compression`.
0 commit comments