fmfi-compbio
diff --git a/‎.dockerignore
Lines changed: 2 additions & 0 deletions b/‎.dockerignore
Lines changed: 2 additions & 0 deletions
diff --git a/‎.gitignore
Lines changed: 2 additions & 0 deletions b/‎.gitignore
Lines changed: 2 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml
Lines changed: 26 additions & 0 deletions b/‎.pre-commit-config.yaml
Lines changed: 26 additions & 0 deletions
diff --git a/‎Dockerfile
Lines changed: 31 additions & 0 deletions b/‎Dockerfile
Lines changed: 31 additions & 0 deletions
diff --git a/‎Pipfile
Lines changed: 21 additions & 0 deletions b/‎Pipfile
Lines changed: 21 additions & 0 deletions
diff --git a/‎Pipfile.lock
Lines changed: 497 additions & 0 deletions b/‎Pipfile.lock
Lines changed: 497 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 199 additions & 26 deletions b/‎README.md
Lines changed: 199 additions & 26 deletions
@@ -0,0 +1,2 @@
+warpstr_docs
+.git
@@ -1,3 +1,5 @@
+example/config_test.yaml
+test/Human_STR_1108232/
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 
@@ -0,0 +1,26 @@
+repos:
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.3.0
+    hooks:
+      - id: double-quote-string-fixer
+      - id: end-of-file-fixer
+      - id: check-added-large-files
+      - id: check-docstring-first
+      - id: check-merge-conflict
+-   repo: https://github.com/pycqa/flake8
+    rev: '5.0.4'
+    hooks:
+    -   id: flake8
+        args:
+        - "--max-line-length=120"
+-   repo: https://github.com/pre-commit/mirrors-isort
+    rev: v5.10.1
+    hooks:
+    -   id: isort
+        args:
+        - "--line-length=120"
+-   repo: https://github.com/pre-commit/mirrors-autopep8
+    rev: 'v1.7.0'
+    hooks:
+    -   id: autopep8
+        args: ["-i", "--max-line-length", "120"]
@@ -0,0 +1,31 @@
+FROM python:3.7-slim as base
+
+# Setup environment
+ENV LANG C.UTF-8
+ENV LC_ALL C.UTF-8
+ENV PYTHONDONTWRITEBYTECODE 1
+ENV PYTHONFAULTHANDLER 1
+
+FROM base AS deps
+
+# Install pipenv and gcc
+RUN pip install pipenv
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends gcc
+
+# Install python dependencies in /.venv
+COPY Pipfile .
+COPY Pipfile.lock .
+RUN PIPENV_VENV_IN_PROJECT=1 pipenv install --deploy
+
+FROM base AS runtime
+
+# Copy virtual env from python-deps stage
+COPY --from=deps /.venv /.venv
+ENV HDF5_PLUGIN_PATH="example/deps/"
+ENV PATH="/.venv/bin:$PATH"
+
+# Install application into container
+RUN mkdir -p /app
+WORKDIR /app
+COPY . .
@@ -0,0 +1,21 @@
+[[source]]
+name = "pypi"
+url = "https://pypi.org/simple"
+verify_ssl = true
+
+[packages]
+numpy = "==1.20"
+biopython = "==1.75"
+cached-property = "*"
+h5py = "==3.4.0"
+matplotlib = "==3.3.4"
+multiprocess = "==0.70.12.2"
+pandas = "==1.2.5"
+pysam = "==0.16.0.1"
+pyyaml = "==5.4.1"
+scikit-learn = "==1.0.1"
+scipy = "==1.6.3"
+seaborn = "==0.11.1"
+
+[requires]
+python_version = "3.7"
@@ -6,9 +6,15 @@ See our preprint at: <https://www.biorxiv.org/content/10.1101/2022.11.05.515275v
 
 See below for some quick steps how to install and run WarpSTR, or refer to more detailed [documentation](https://fmfi-compbio.github.io/warpstr/).
 
-## Installation
+## 1 Installation
 
-WarpSTR can be easily installed using conda environment, frozen in `conda_req.yaml`. The conda environment can be created as follows:
+WarpSTR can be installed using conda or pipenv. To install conda, please follow [the official guide](https://conda.io/projects/conda/en/latest/user-guide/install/index.html). To install pipenv, simple `pip install pipenv` should suffice.
+
+WarpSTR was tested in Ubuntu 20.04 and Ubuntu 22.04. Used Python version is 3.7.
+
+### 1.a) Installing using conda
+
+Clone this repository. Then, create the conda environment:
 
 ```bash
 conda env create -f conda_req.yaml
@@ -20,49 +26,108 @@ After installation, it is required to activate conda environment:
 conda activate warpstr
 ```
 
-WarpSTR was tested in Ubuntu 20.04 OS.
+### 1.b) Installing using pipenv
 
-## Running WarpSTR
+Clone this repository. The pipenv environment can be installed from Pipfile.lock as follows:
 
-Required step to do before running WarpSTR is to prepare config file and add loci information.
+```bash
+pipenv sync
+```
+
+After installation, it is required to activate the environment:
+
+```bash
+pipenv shell
+```
 
-### Config file
+## 2 Running the test case
 
-The input configuration file must be populated with elements such as `inputs`, `output` and `reference_path`. An example is provided in `example/config.yaml`.
+In `test/test_input` there is a small test dataset with 10 reads for one locus. There is also the template for config file required by WarpSTR, `test/config_template.yaml`. You can check whether WarpSTR works correctly simply by running:
 
-There are also many advanced parameters that are optional to set. List of all parameters are found in `example/advanced_params.yaml`. To set values for those parameters, just add those parameters to your main config and set them to the desired value. In other case, default values for those parameters are taken.
+```bash
+bash run_test_case.sh
+```
 
-### Loci information
+When running this wrapper script, the script will prompt you to provide the required paths and run the WarpSTR for you using the test data. Output files will be then stored in `test/test_output/` as given in the config file. The script should take approximately 3-5 minutes and at the end, you should see something like:
 
-Information about loci, that are subjects for analysis by WarpSTR, must be described in the config file. An example is described `example/config.yaml`. Each loci must be defined by name and genomic coordinates. Then, you can either specify repeating motifs occuring in the locus in `motif` element, from which the input sequence for WarpSTR state automata is automatically created(this is recommended for starting users). The second way is to configure the input sequence by yourself in `sequence` element of the locus, however this is not a trivial task, so it is recommended for more advanced users. The other possibility is to use automatic configuration and then modify it by hand.
+```text
+Results stored in overview file XY
+Allele lengths as given by WarpSTR: (44, 40)
+```
 
-### Running
+## 3 Running WarpSTR
 
-After creating configuration file, running WarpSTR is simple as it requires only the path to the config file:
+Running WarpSTR is simple as it requires only the path to the configuration file:
 
 ```bash
 python WarpSTR.py example/config.yaml
 ```
 
-### Input data
+WarpSTR consists of multiple complex steps doing the following:
+
+1. extracting reads encompassing the locus coordinates - requires BAM mapping files and multi .fast5.
+2. extracting STR regions from reads - requires Guppy basecaller.
+3. determining the alelle length for reads.
+4. genotyping alelle lengths and determining zygosity.
+
+If you want to run a whole WarpSTR pipeline then continue reading, else skip to the [WarpSTR steps](#5-warpstr-steps).
+
+### 3.1 Config file
+
+In the input configuration file (see `example/config.yaml` for an example) you must set the following elements:
+
+- `reference_path` - path to the fasta file - the reference genome, the same which was used for mapping basecalled reads.
+- `guppy_config` - path to the executable Guppy basecaller and info about the sequencing (flowcell and kit).
+- `output` - path to the directory, when WarpSTR will produce output results.
+- `loci` - loci information, see [below](#32-loci-information).
+- `inputs` - input data, see [below](#33-input-data).
+
+There are also many advanced parameters that are optional to set. List of all parameters are found in `example/advanced_params.yaml`. To set values for those parameters, just copy the elements to your main config and change valeus to your desired values. In other case, default values for those parameters are taken.
+
+### 3.2 Loci information
+
+Information about loci, that are subjects for analysis by WarpSTR, must be described in the config file. An example is described `example/config.yaml`. Each locus must be defined by name and genomic coordinates (these must match with the reference), and either motif or sequence:
+
+```yaml
+name: HD
+coord: chr4:3,074,878-3,074,967
+motif: AGC,CGC
+# sequence: (AGC)AACAGCCGCCAC(CGC)
+```
+
+The `motif` element is recommended for beginners, as the input sequence for WarpSTR state automata is automatically created. In this element, possible repeat units should be provided.
+
+The second way is to configure the input sequence for automata by yourself in the `sequence` element of the locus. This is not a trivial task, so it is recommended for more advanced users. The other possibility is to use automatic configuration and then modify it by hand. See the preprint for the information about the state automata.
+
+### 3.3 Input data
+
+Required input data are .fast5 files and .bam mapping files. In configuration file, the user is required to provide the path to the upper level path, in the `inputs` element. WarpSTR presumes that your data can come from multiple sequencing runs, but are of the same sample, and thus should be analyzed together, see [documentation](https://fmfi-compbio.github.io/warpstr/) in that case.
 
-Required input data are .fast5 files and .bam mapping files. In configuration file, the user is required to provide the path to the upper level path, in the `inputs` element. WarpSTR presumes that your data can come from multiple sequencing runs, but are of the same sample, and thus are to be analyzed together. For example, you have main directory for sample `subjectXY` with many subdirectories denoting sequencing runs i.e. `run_1`, `run_2`, with each run directory having its own .bam mapping file and .fast5 files. It is also possible to denote another path to input, in case of having data stored somewhere else (i.e. on the other mounted directory, as ONT data are very large), for example with the data from another run, i.e. `run_3`.
+The simple case is like in the test case:
 
-For the above example, `inputs` in the config could be defined as follows:
+```bash
+test_input/
+└── test_run1
+    ├── fast5s
+    │   └── batch_0.fast5
+    └── mapping
+        ├── mapping.bam
+        └── mapping.bam.bai
+```
+
+The names `test_run1` and `test_input` are then used in the configuration file for the `inputs` element:
 
 ```yaml
 inputs:                       
-  - path: /data/subjectXY 
-    runs: run_1,run_2
-  - path: /mnt/subjectXY 
-    runs: run_3
+  - path: test/test_input
+    runs: test_run1
 ```
 
-Each directory as given by `path` and `runs`, i.e. `/data/subjectXY/run_1` and so on, is traversed by WarpSTR to find .bam files and .fast5 files.
+Names of subdirectories such as `fast5s` and `mapping` are not important, but .fast5 files and .bam files must have the correct extension.
 
-## Output
+## 4 Output
 
-The upper path for output is given in the .yaml configuration file as `output` element. Outputs are separated for each locus as subdirectories of this upper path, where names of subdirectories are the same as the locus name.
+The upper path for output is given in the .yaml configuration file as `output` element. Each locus has the separate output - a new subdirectory of this upper path with locus name is created, where the output is stored.
 
 The output structure for one locus is as follows:
 
@@ -77,7 +142,7 @@ overview.csv        # .csv file with read information and output
 
 Some output files are optional and can be controlled by the .yaml config file.
 
-### Predictions
+### 4.1 Predictions
 
 In the `predictions` directory of each locus there would be a large variety of outputted files in other subdirectories.
 
@@ -89,7 +154,7 @@ In **sequences** subdirectory there is analogous information as in **basecalls**
 
 In **DTW_alignments** subdirectory there are visualized alignments of STR signal with automaton (in both stages). Visualizations are truncated to first 2000 values.
 
-### Summaries
+### 4.2 Summaries
 
 In the `summaries` directory of each locus there is a myriad of optional visualizations:
 
@@ -101,6 +166,114 @@ In the `summaries` directory of each locus there is a myriad of optional visuali
 - predictions_phase.svg - Violinplots of repeat lengths in the first and second phase.
 - predictions_strand.svg - Violinplots of repeat lengths as split by strand.
 
-## Additional information
+## 5 WarpSTR steps
+
+WarpSTR pipeline steps are toggleable in the config file, i.e. you can skip them, by turning them to False:
+
+```yaml
+single_read_extraction: True   # Extracts reads mapped to the locus and stores them in single .fast5 format
+guppy_annotation:       True   # Annotates .fast5 files with mapping between basecalled sequence and the signal
+exp_signal_generation:  True   # Generates expected signals for flanks and repeats
+tr_region_extraction:   True   # Finds tandem repeat region in read using alignment of basecalled sequence and reference repeat sequence
+tr_region_calling:      True   # Uses state automata with DTW alignment to find the number of repeats for each signal
+genotyping:             True   # Predicts the final allele lengths from the predicted repeat numbers of each read 
+```
+
+### 5.1 Extraction of locus reads
+
+Here, .BAM and multi-fast5 files are required. The following config elements must be set:
+
+- `inputs` element - defining directories containing .BAM and .fast5 files
+- `loci` element - defining genomic coordinates
+- `single_read_extraction` element set to `True`
+
+In the output directory (given by `output` element) the state of the locus output subdirectory after running this step would be:
+
+```tree
+{locus_name}
+├── fast5
+│   └── {run_id}
+│       ├── {read_name1}.fast5
+│       ├── {read_name2}.fast5
+│       └── ...
+└── overview.csv - index of extracted reads with `name`,`run_id`,`reverse` values for each read
+```
+
+#### Skipping this step
+
+If you have already single .fast5s ready for the locus and want to skip this step, you should simulate the outcome of the first step:
+
+1. Create the subdirectory in the output directory with the same as the name of the locus in the config.
+2. In the locus subdir create the `fast5/run_id` directory, where you copy single .fast5 reads (See above the output example)
+3. In the locus subdir create `overview.csv` file where for each read signal there should be a row with three columns: `name`,`run_id`,`reverse`, Where `name` is the name as the read_name, and `reverse` having either True or False value, denoting the strand.
+
+For example, the overview.csv for the above case would be:
+
+```csv
+read_name,run_id,reverse
+read_name1,run_id,False
+read_name2,run_id,True
+...
+```
+
+Then, do not forget to turn off the step in the config file:
+
+```yaml
+single_read_extraction: False   # Extracts reads mapped to the locus and stores them in single .fast5 format
+```
+
+### 5.2 Extraction of STR regions
+
+Requires executable Guppy basecaller (and completed previous pipeline step).
+
+In this step, reads are basecalled again so they would be annotated with the mapping between basecalls and signal values. This mapping is then used to localize the STR region in signals.
+
+The state of the locus output directory after running this step would be:
+
+```tree
+{locus_name}
+├── fast5
+│   └── {run_id}
+│       ├── annot
+│       │   ├── {read_name1}.fast5
+│       │   ├── {read_name2}.fast5
+│       │   └── ...
+└── overview.csv - index of extracted reads with `name`,`run_id`,`reverse` values for each read.
+In addition, there would be 'l_start_raw', 'r_end_raw' values, corresponding to signal positions, where starts the left flank and ends the right flank.
+```
+
+#### Skipping this step
+
+If you have already .fast5 signals with localized STR regions, you again must simulate the output of this step. The other option is to use our script `prepare_caller_only.py`. It requires two things:
+
+`--config CONFIG` - the same as you would use further. The important thing is to set the `output` and `loci`
+`--file CSV` - .csv file with one row for .fast5 signal, and these required columns:
+
+- `fast5_path` - path to the .fast5 read path.
+- `locus` - name of the locus associated with the read.
+- `read_name` - name of the read.
+- `reverse` - strand information, True or False.
+- `l_start_raw` - signal position where the left flank starts.
+- `r_end_raw` - signal position where the right flank ends.
+
+The example case is in the repository in `test/test_caller_only`. To run, provide the reference_path (!!!) there in the config, and run using:
+
+```bash
+python prepare_caller_only.py --config test/test_caller_only/config_caller_only.yaml --file test/test_caller_only/example.csv
+```
+
+This creates a simulated output of the previous step in the `test/test_caller_only/Human_STR_1108232`. Then you can run WarpSTR:
+
+```bash
+python WarpSTR.py test/test_caller_only/config_caller_only.yaml
+```
+
+#### Important notes
+
+- The `l_start_raw` and `r_end_raw` can be set approximately, 100-200 positions off should pose no problem for the correct result.
+- The `l_start_raw` and `r_end_raw` must correspond to the flank positions, i.e. the flank length must be set to the same value in the config for `loci`.
+- We currently do not support direct input of signal values of STR for STR calling.
+
+## 6 Additional information
 
-Newer .fast5 files are usually VBZ compressed, therefore VBZ plugin for HD5 is required to be installed, so WarpSTR can handle such files. See `https://github.com/nanoporetech/vbz_compression`. 
+Newer .fast5 files are usually VBZ compressed, therefore VBZ plugin for HD5 is required to be installed, so WarpSTR can handle such files. See `https://github.com/nanoporetech/vbz_compression`.
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,5 @@`
	`1`	`+example/config_test.yaml`
	`2`	`+test/Human_STR_1108232/`
`1`	`3`	`# Byte-compiled / optimized / DLL files`
`2`	`4`	`__pycache__/`
`3`	`5`	`*.py[cod]`