Skip to content

Commit 86a0d2d

Browse files
authored
Merge pull request #4 from fmfi-compbio/refactoring
Refactoring
2 parents 33c8364 + 09aa445 commit 86a0d2d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+3602
-3344
lines changed

.dockerignore

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
warpstr_docs
2+
.git

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
example/config_test.yaml
2+
test/Human_STR_1108232/
13
# Byte-compiled / optimized / DLL files
24
__pycache__/
35
*.py[cod]

.pre-commit-config.yaml

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
repos:
2+
- repo: https://github.com/pre-commit/pre-commit-hooks
3+
rev: v4.3.0
4+
hooks:
5+
- id: double-quote-string-fixer
6+
- id: end-of-file-fixer
7+
- id: check-added-large-files
8+
- id: check-docstring-first
9+
- id: check-merge-conflict
10+
- repo: https://github.com/pycqa/flake8
11+
rev: '5.0.4'
12+
hooks:
13+
- id: flake8
14+
args:
15+
- "--max-line-length=120"
16+
- repo: https://github.com/pre-commit/mirrors-isort
17+
rev: v5.10.1
18+
hooks:
19+
- id: isort
20+
args:
21+
- "--line-length=120"
22+
- repo: https://github.com/pre-commit/mirrors-autopep8
23+
rev: 'v1.7.0'
24+
hooks:
25+
- id: autopep8
26+
args: ["-i", "--max-line-length", "120"]

Dockerfile

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
FROM python:3.7-slim as base
2+
3+
# Setup environment
4+
ENV LANG C.UTF-8
5+
ENV LC_ALL C.UTF-8
6+
ENV PYTHONDONTWRITEBYTECODE 1
7+
ENV PYTHONFAULTHANDLER 1
8+
9+
FROM base AS deps
10+
11+
# Install pipenv and gcc
12+
RUN pip install pipenv
13+
RUN apt-get update && \
14+
apt-get install -y --no-install-recommends gcc
15+
16+
# Install python dependencies in /.venv
17+
COPY Pipfile .
18+
COPY Pipfile.lock .
19+
RUN PIPENV_VENV_IN_PROJECT=1 pipenv install --deploy
20+
21+
FROM base AS runtime
22+
23+
# Copy virtual env from python-deps stage
24+
COPY --from=deps /.venv /.venv
25+
ENV HDF5_PLUGIN_PATH="example/deps/"
26+
ENV PATH="/.venv/bin:$PATH"
27+
28+
# Install application into container
29+
RUN mkdir -p /app
30+
WORKDIR /app
31+
COPY . .

Pipfile

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
[[source]]
2+
name = "pypi"
3+
url = "https://pypi.org/simple"
4+
verify_ssl = true
5+
6+
[packages]
7+
numpy = "==1.20"
8+
biopython = "==1.75"
9+
cached-property = "*"
10+
h5py = "==3.4.0"
11+
matplotlib = "==3.3.4"
12+
multiprocess = "==0.70.12.2"
13+
pandas = "==1.2.5"
14+
pysam = "==0.16.0.1"
15+
pyyaml = "==5.4.1"
16+
scikit-learn = "==1.0.1"
17+
scipy = "==1.6.3"
18+
seaborn = "==0.11.1"
19+
20+
[requires]
21+
python_version = "3.7"

Pipfile.lock

+497
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

+199-26
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,15 @@ See our preprint at: <https://www.biorxiv.org/content/10.1101/2022.11.05.515275v
66

77
See below for some quick steps how to install and run WarpSTR, or refer to more detailed [documentation](https://fmfi-compbio.github.io/warpstr/).
88

9-
## Installation
9+
## 1 Installation
1010

11-
WarpSTR can be easily installed using conda environment, frozen in `conda_req.yaml`. The conda environment can be created as follows:
11+
WarpSTR can be installed using conda or pipenv. To install conda, please follow [the official guide](https://conda.io/projects/conda/en/latest/user-guide/install/index.html). To install pipenv, simple `pip install pipenv` should suffice.
12+
13+
WarpSTR was tested in Ubuntu 20.04 and Ubuntu 22.04. Used Python version is 3.7.
14+
15+
### 1.a) Installing using conda
16+
17+
Clone this repository. Then, create the conda environment:
1218

1319
```bash
1420
conda env create -f conda_req.yaml
@@ -20,49 +26,108 @@ After installation, it is required to activate conda environment:
2026
conda activate warpstr
2127
```
2228

23-
WarpSTR was tested in Ubuntu 20.04 OS.
29+
### 1.b) Installing using pipenv
2430

25-
## Running WarpSTR
31+
Clone this repository. The pipenv environment can be installed from Pipfile.lock as follows:
2632

27-
Required step to do before running WarpSTR is to prepare config file and add loci information.
33+
```bash
34+
pipenv sync
35+
```
36+
37+
After installation, it is required to activate the environment:
38+
39+
```bash
40+
pipenv shell
41+
```
2842

29-
### Config file
43+
## 2 Running the test case
3044

31-
The input configuration file must be populated with elements such as `inputs`, `output` and `reference_path`. An example is provided in `example/config.yaml`.
45+
In `test/test_input` there is a small test dataset with 10 reads for one locus. There is also the template for config file required by WarpSTR, `test/config_template.yaml`. You can check whether WarpSTR works correctly simply by running:
3246

33-
There are also many advanced parameters that are optional to set. List of all parameters are found in `example/advanced_params.yaml`. To set values for those parameters, just add those parameters to your main config and set them to the desired value. In other case, default values for those parameters are taken.
47+
```bash
48+
bash run_test_case.sh
49+
```
3450

35-
### Loci information
51+
When running this wrapper script, the script will prompt you to provide the required paths and run the WarpSTR for you using the test data. Output files will be then stored in `test/test_output/` as given in the config file. The script should take approximately 3-5 minutes and at the end, you should see something like:
3652

37-
Information about loci, that are subjects for analysis by WarpSTR, must be described in the config file. An example is described `example/config.yaml`. Each loci must be defined by name and genomic coordinates. Then, you can either specify repeating motifs occuring in the locus in `motif` element, from which the input sequence for WarpSTR state automata is automatically created(this is recommended for starting users). The second way is to configure the input sequence by yourself in `sequence` element of the locus, however this is not a trivial task, so it is recommended for more advanced users. The other possibility is to use automatic configuration and then modify it by hand.
53+
```text
54+
Results stored in overview file XY
55+
Allele lengths as given by WarpSTR: (44, 40)
56+
```
3857

39-
### Running
58+
## 3 Running WarpSTR
4059

41-
After creating configuration file, running WarpSTR is simple as it requires only the path to the config file:
60+
Running WarpSTR is simple as it requires only the path to the configuration file:
4261

4362
```bash
4463
python WarpSTR.py example/config.yaml
4564
```
4665

47-
### Input data
66+
WarpSTR consists of multiple complex steps doing the following:
67+
68+
1. extracting reads encompassing the locus coordinates - requires BAM mapping files and multi .fast5.
69+
2. extracting STR regions from reads - requires Guppy basecaller.
70+
3. determining the alelle length for reads.
71+
4. genotyping alelle lengths and determining zygosity.
72+
73+
If you want to run a whole WarpSTR pipeline then continue reading, else skip to the [WarpSTR steps](#5-warpstr-steps).
74+
75+
### 3.1 Config file
76+
77+
In the input configuration file (see `example/config.yaml` for an example) you must set the following elements:
78+
79+
- `reference_path` - path to the fasta file - the reference genome, the same which was used for mapping basecalled reads.
80+
- `guppy_config` - path to the executable Guppy basecaller and info about the sequencing (flowcell and kit).
81+
- `output` - path to the directory, when WarpSTR will produce output results.
82+
- `loci` - loci information, see [below](#32-loci-information).
83+
- `inputs` - input data, see [below](#33-input-data).
84+
85+
There are also many advanced parameters that are optional to set. List of all parameters are found in `example/advanced_params.yaml`. To set values for those parameters, just copy the elements to your main config and change valeus to your desired values. In other case, default values for those parameters are taken.
86+
87+
### 3.2 Loci information
88+
89+
Information about loci, that are subjects for analysis by WarpSTR, must be described in the config file. An example is described `example/config.yaml`. Each locus must be defined by name and genomic coordinates (these must match with the reference), and either motif or sequence:
90+
91+
```yaml
92+
name: HD
93+
coord: chr4:3,074,878-3,074,967
94+
motif: AGC,CGC
95+
# sequence: (AGC)AACAGCCGCCAC(CGC)
96+
```
97+
98+
The `motif` element is recommended for beginners, as the input sequence for WarpSTR state automata is automatically created. In this element, possible repeat units should be provided.
99+
100+
The second way is to configure the input sequence for automata by yourself in the `sequence` element of the locus. This is not a trivial task, so it is recommended for more advanced users. The other possibility is to use automatic configuration and then modify it by hand. See the preprint for the information about the state automata.
101+
102+
### 3.3 Input data
103+
104+
Required input data are .fast5 files and .bam mapping files. In configuration file, the user is required to provide the path to the upper level path, in the `inputs` element. WarpSTR presumes that your data can come from multiple sequencing runs, but are of the same sample, and thus should be analyzed together, see [documentation](https://fmfi-compbio.github.io/warpstr/) in that case.
48105

49-
Required input data are .fast5 files and .bam mapping files. In configuration file, the user is required to provide the path to the upper level path, in the `inputs` element. WarpSTR presumes that your data can come from multiple sequencing runs, but are of the same sample, and thus are to be analyzed together. For example, you have main directory for sample `subjectXY` with many subdirectories denoting sequencing runs i.e. `run_1`, `run_2`, with each run directory having its own .bam mapping file and .fast5 files. It is also possible to denote another path to input, in case of having data stored somewhere else (i.e. on the other mounted directory, as ONT data are very large), for example with the data from another run, i.e. `run_3`.
106+
The simple case is like in the test case:
50107

51-
For the above example, `inputs` in the config could be defined as follows:
108+
```bash
109+
test_input/
110+
└── test_run1
111+
├── fast5s
112+
│   └── batch_0.fast5
113+
└── mapping
114+
├── mapping.bam
115+
└── mapping.bam.bai
116+
```
117+
118+
The names `test_run1` and `test_input` are then used in the configuration file for the `inputs` element:
52119

53120
```yaml
54121
inputs:
55-
- path: /data/subjectXY
56-
runs: run_1,run_2
57-
- path: /mnt/subjectXY
58-
runs: run_3
122+
- path: test/test_input
123+
runs: test_run1
59124
```
60125
61-
Each directory as given by `path` and `runs`, i.e. `/data/subjectXY/run_1` and so on, is traversed by WarpSTR to find .bam files and .fast5 files.
126+
Names of subdirectories such as `fast5s` and `mapping` are not important, but .fast5 files and .bam files must have the correct extension.
62127

63-
## Output
128+
## 4 Output
64129

65-
The upper path for output is given in the .yaml configuration file as `output` element. Outputs are separated for each locus as subdirectories of this upper path, where names of subdirectories are the same as the locus name.
130+
The upper path for output is given in the .yaml configuration file as `output` element. Each locus has the separate output - a new subdirectory of this upper path with locus name is created, where the output is stored.
66131

67132
The output structure for one locus is as follows:
68133

@@ -77,7 +142,7 @@ overview.csv # .csv file with read information and output
77142

78143
Some output files are optional and can be controlled by the .yaml config file.
79144

80-
### Predictions
145+
### 4.1 Predictions
81146

82147
In the `predictions` directory of each locus there would be a large variety of outputted files in other subdirectories.
83148

@@ -89,7 +154,7 @@ In **sequences** subdirectory there is analogous information as in **basecalls**
89154

90155
In **DTW_alignments** subdirectory there are visualized alignments of STR signal with automaton (in both stages). Visualizations are truncated to first 2000 values.
91156

92-
### Summaries
157+
### 4.2 Summaries
93158

94159
In the `summaries` directory of each locus there is a myriad of optional visualizations:
95160

@@ -101,6 +166,114 @@ In the `summaries` directory of each locus there is a myriad of optional visuali
101166
- predictions_phase.svg - Violinplots of repeat lengths in the first and second phase.
102167
- predictions_strand.svg - Violinplots of repeat lengths as split by strand.
103168

104-
## Additional information
169+
## 5 WarpSTR steps
170+
171+
WarpSTR pipeline steps are toggleable in the config file, i.e. you can skip them, by turning them to False:
172+
173+
```yaml
174+
single_read_extraction: True # Extracts reads mapped to the locus and stores them in single .fast5 format
175+
guppy_annotation: True # Annotates .fast5 files with mapping between basecalled sequence and the signal
176+
exp_signal_generation: True # Generates expected signals for flanks and repeats
177+
tr_region_extraction: True # Finds tandem repeat region in read using alignment of basecalled sequence and reference repeat sequence
178+
tr_region_calling: True # Uses state automata with DTW alignment to find the number of repeats for each signal
179+
genotyping: True # Predicts the final allele lengths from the predicted repeat numbers of each read
180+
```
181+
182+
### 5.1 Extraction of locus reads
183+
184+
Here, .BAM and multi-fast5 files are required. The following config elements must be set:
185+
186+
- `inputs` element - defining directories containing .BAM and .fast5 files
187+
- `loci` element - defining genomic coordinates
188+
- `single_read_extraction` element set to `True`
189+
190+
In the output directory (given by `output` element) the state of the locus output subdirectory after running this step would be:
191+
192+
```tree
193+
{locus_name}
194+
├── fast5
195+
│   └── {run_id}
196+
│   ├── {read_name1}.fast5
197+
│   ├── {read_name2}.fast5
198+
│   └── ...
199+
└── overview.csv - index of extracted reads with `name`,`run_id`,`reverse` values for each read
200+
```
201+
202+
#### Skipping this step
203+
204+
If you have already single .fast5s ready for the locus and want to skip this step, you should simulate the outcome of the first step:
205+
206+
1. Create the subdirectory in the output directory with the same as the name of the locus in the config.
207+
2. In the locus subdir create the `fast5/run_id` directory, where you copy single .fast5 reads (See above the output example)
208+
3. In the locus subdir create `overview.csv` file where for each read signal there should be a row with three columns: `name`,`run_id`,`reverse`, Where `name` is the name as the read_name, and `reverse` having either True or False value, denoting the strand.
209+
210+
For example, the overview.csv for the above case would be:
211+
212+
```csv
213+
read_name,run_id,reverse
214+
read_name1,run_id,False
215+
read_name2,run_id,True
216+
...
217+
```
218+
219+
Then, do not forget to turn off the step in the config file:
220+
221+
```yaml
222+
single_read_extraction: False # Extracts reads mapped to the locus and stores them in single .fast5 format
223+
```
224+
225+
### 5.2 Extraction of STR regions
226+
227+
Requires executable Guppy basecaller (and completed previous pipeline step).
228+
229+
In this step, reads are basecalled again so they would be annotated with the mapping between basecalls and signal values. This mapping is then used to localize the STR region in signals.
230+
231+
The state of the locus output directory after running this step would be:
232+
233+
```tree
234+
{locus_name}
235+
├── fast5
236+
│   └── {run_id}
237+
│ ├── annot
238+
│   │   ├── {read_name1}.fast5
239+
│   │   ├── {read_name2}.fast5
240+
│   │   └── ...
241+
└── overview.csv - index of extracted reads with `name`,`run_id`,`reverse` values for each read.
242+
In addition, there would be 'l_start_raw', 'r_end_raw' values, corresponding to signal positions, where starts the left flank and ends the right flank.
243+
```
244+
245+
#### Skipping this step
246+
247+
If you have already .fast5 signals with localized STR regions, you again must simulate the output of this step. The other option is to use our script `prepare_caller_only.py`. It requires two things:
248+
249+
`--config CONFIG` - the same as you would use further. The important thing is to set the `output` and `loci`
250+
`--file CSV` - .csv file with one row for .fast5 signal, and these required columns:
251+
252+
- `fast5_path` - path to the .fast5 read path.
253+
- `locus` - name of the locus associated with the read.
254+
- `read_name` - name of the read.
255+
- `reverse` - strand information, True or False.
256+
- `l_start_raw` - signal position where the left flank starts.
257+
- `r_end_raw` - signal position where the right flank ends.
258+
259+
The example case is in the repository in `test/test_caller_only`. To run, provide the reference_path (!!!) there in the config, and run using:
260+
261+
```bash
262+
python prepare_caller_only.py --config test/test_caller_only/config_caller_only.yaml --file test/test_caller_only/example.csv
263+
```
264+
265+
This creates a simulated output of the previous step in the `test/test_caller_only/Human_STR_1108232`. Then you can run WarpSTR:
266+
267+
```bash
268+
python WarpSTR.py test/test_caller_only/config_caller_only.yaml
269+
```
270+
271+
#### Important notes
272+
273+
- The `l_start_raw` and `r_end_raw` can be set approximately, 100-200 positions off should pose no problem for the correct result.
274+
- The `l_start_raw` and `r_end_raw` must correspond to the flank positions, i.e. the flank length must be set to the same value in the config for `loci`.
275+
- We currently do not support direct input of signal values of STR for STR calling.
276+
277+
## 6 Additional information
105278

106-
Newer .fast5 files are usually VBZ compressed, therefore VBZ plugin for HD5 is required to be installed, so WarpSTR can handle such files. See `https://github.com/nanoporetech/vbz_compression`.
279+
Newer .fast5 files are usually VBZ compressed, therefore VBZ plugin for HD5 is required to be installed, so WarpSTR can handle such files. See `https://github.com/nanoporetech/vbz_compression`.

0 commit comments

Comments
 (0)