Skip to content

Commit b077084

Browse files
committed
Manual iteration
1 parent 9ada1f2 commit b077084

File tree

1 file changed

+45
-56
lines changed

1 file changed

+45
-56
lines changed

assembler/src/projects/pathracer/README.md

Lines changed: 45 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -9,74 +9,74 @@ MANUAL
99
<!-- The tool finds all proper alignments rather than only the best one. -->
1010
<!-- That allows extracting all genes satisfying HMM gene model from the assembly. -->
1111
<!-- -->
12-
**PathRacer** is a novel standalone tool that aligns profile HMM directly to the
12+
**PathRacer** is a standalone tool that performs profile HMM alignment directly to the
1313
assembly graph (performing the codon translation on fly for amino acid pHMMs).
1414
The tool provides the set of most probable paths traversed by a HMM through the
1515
whole assembly graph, regardless whether the sequence of interested is encoded
16-
on the single contig or scattered across the set of edges, therefore
16+
on the single contig or scattered across the set of edges, therefore
1717
significantly improving the recovery of sequences of interest even from
1818
fragmented metagenome assemblies.
1919

2020
### Input
21-
For this moment the tool supports only _de Bruijn_ graphs in GFA format produced by **SPAdes**.
21+
Currently the tool supports only _de Bruijn_ graphs in GFA format as produced by **SPAdes** or compatible assembler in this matter (e.g. **MEGAHIT**).
2222
Contact us if you need some other format support.
2323

24-
Profile HMM should be in **HMMer3** format, but one can pass nucleotide or amino acid sequence(s) to be converted to pHMM(s) that would be equivalent
25-
to performing Levenshtein search for each input sequence.
26-
27-
### Output
28-
For each pHMM (gene) the tool reports:
29-
30-
- **&lt;gene\_name&gt;.seqs.fa**: sequences correspondent to _N_ (parameter, see below) best score paths ordered by score along with their alignment in CIGAR format
31-
- **&lt;gene\_name&gt;.nucs.fa**: _(for amino acids pHHMs only)_ the same sequences in nucleotides
32-
- **&lt;gene\_name&gt;.edges.fa**: unique unitig (edge) paths correspondent to best score paths above
33-
- **&lt;gene\_name&gt;.{domtblout, pfamtblout, tblout}**: _(optional)_ unitig paths realignment by **HMMer3** `hmmalign` in various formats
34-
- **event\_graph\_&lt;gene\_name&gt;\_component\_&lt;component\_id&gt;\_size\_&lt;component\_size&gt;.cereal**: _(optional, debug output)_ connected components of the aligned graph
35-
- **&lt;component\_id&gt;.dot**: _(optional, plot)_ connected component of matched neighborhood subgraph
36-
- **&lt;component\_id&gt;\_&lt;path\_index&gt;.dot**: _(optional, plot)_ neighborhood of the found path
37-
38-
In addition:
39-
40-
- **all.edges.fa**: unique unitig paths for all pHMMs in one file
41-
- **pathracer.log**: log file
42-
- **graph\_with\_hmm\_paths.gfa**: _(optional)_ input graph with annotated unitig paths
24+
Profile HMM should be in **HMMer3** format, but one can pass nucleotide or amino acid sequences as well. These sequences will be converted to proxy pHMM. Aligning of these pHMMs would be equivalent to performing alignment using Levenshtein distance for each input sequence.
4325

4426

4527
### Command line options
4628
Required positional arguments:
4729

48-
1. Query gene models file (.hmm file or .fasta)
49-
2. Graph in GFA format
50-
3. _k_ (_de Bruijn_ overlap size) for the input graph
30+
1. Query file (.hmm file or .fasta)
31+
2. Assembly graph in GFA format
32+
3. _k_ (_de Bruijn_ vertex overlap size) for the input graph
5133

5234
Main options:
5335

5436
- `--output`, `-o` DIR: output directory
55-
- `--hmm` | `--nt` | `--aa`: match against pHMM(s) [default] | nucleotide sequences | amino acid sequences
37+
- `--hmm` | `--nt` | `--aa`: perform match against pHMM(s) [default] | nucleotide sequences | amino acid sequences
5638
- `--queries` Q1 [Q2 [...]]: queries names to lookup [default: all queries from input query file]
5739
- `--global` | `--local`: perform HMM-global, graph-local (aka _glocal_, default) or HMM-local, graph-local HMM matching
5840
- `--length`, `-l` L: minimal length of resultant matched sequence; if &le;1 then to be multiplied on aligned HMM length [default: 0.9]
59-
- `--top` N: extract up to _N_ top paths [default: 10000]
60-
- `--rescore`: rescore paths by **HMMer3**
61-
- `--threads`, `-t` T: the number of parallel threads [default: 16]
41+
- `--top` N: extract up to _N_ top scored paths [default: 10000]; only unique paths are reported and therefore
42+
- `--rescore`: rescore resulting paths by **HMMer** and produce output tables in **HMMer** standard formats
43+
- `--threads`, `-t` T: the total number of CPU threads to use [default: 16]
6244
- `--parallel-components`: process connected components of neighborhood subgraph in parallel
63-
- `--memory`, `-m` M: RAM limit for PathRacer in GB (terminates if exceeded) [default: 100]
64-
65-
Debug output control:
66-
67-
- `--debug`: enable extensive debug console output
68-
- `--draw`: draw pictures around the interesting edges
45+
- `--memory`, `-m` M: RAM limit in GB (PathRacer terminates if the limit is exceeded) [default: 100]
6946
- `--annotate-graph`: emit paths in GFA graph
70-
- `--export-event-graph`: export Event Graph in .cereal format
7147

7248
Heuristics options:
7349

7450
- `--max-size` MAX\_SIZE: maximal component size to consider [default: INF]
7551
- `--max-insertion-length`: maximal allowed number of successive I-emissions [default: 30]
7652
- `--no-top-score-filter`: disable top score Event Graph vertices filter. Increases sensitivity of deep analysis (`--top` &gt; 50000)
7753

54+
Debug output control:
55+
56+
- `--debug`: enable extensive debug console output
57+
- `--draw`: draw pictures around the interesting edges
58+
- `--export-event-graph`: export Event Graph in .cereal format
59+
7860
_In addition:_ Some other developer options that are not supposed to be tuned by end-user. Could be removed in further releases.
7961

62+
### Output
63+
For each input pHMM (genemode ) PathRacer reports:
64+
65+
- **&lt;gene\_name&gt;.seqs.fa**: sequences correspondent to _N_ best scored paths ordered by score along with their alignment in CIGAR format
66+
- **&lt;gene\_name&gt;.nucls.fa**: _(for amino acids pHHMs only)_ the same sequences in nucleotides
67+
- **&lt;gene\_name&gt;.edges.fa**: unique graph edge paths sequences corresponding to best scored paths
68+
- **&lt;gene\_name&gt;.{domtblout, pfamtblout, tblout}**: _(optional)_ edge paths realignment by **HMMer** in various default output formats
69+
- **event\_graph\_&lt;gene\_name&gt;\_component\_&lt;component\_id&gt;\_size\_&lt;component\_size&gt;.cereal**: _(optional, debug output)_ connected components of the event graph graph
70+
- **&lt;component\_id&gt;.dot**: _(optional, plot)_ connected component of matched neighborhood subgraph
71+
- **&lt;component\_id&gt;\_&lt;path\_index&gt;.dot**: _(optional, plot)_ neighborhood of the found path
72+
73+
In addition:
74+
75+
- **all.edges.fa**: unique edge paths for all pHMMs in one file
76+
- **pathracer.log**: log file
77+
- **graph\_with\_hmm\_paths.gfa**: _(optional)_ input graph with top scored paths added
78+
79+
8080
### Examples
8181
One can download example datasets from here <http://cab.spbu.ru/software/pathracer/>
8282

@@ -89,46 +89,35 @@ One can download example datasets from here <http://cab.spbu.ru/software/pathrac
8989

9090
Lookup for beta-lactamase genes (amino acid pHMMs) in Singapore wastewater
9191
```
92-
./pathracer bla_all.hmm urban_strain.gfa 55 --output pathracer_urban_strain_bla_all
92+
pathracer bla_all.hmm urban_strain.gfa 55 --output pathracer_urban_strain_bla_all
9393
```
9494

9595
Lookup for _16S_/_5S_/_23S_ (nucleotide HMMs) in _E.coli_ multicell assembly
9696
```
97-
./pathracer bac.hmm ecoli_mc.gfa 55 --output pathracer_ecoli_mc_bac
97+
pathracer bac.hmm ecoli_mc.gfa 55 --output pathracer_ecoli_mc_bac
9898
```
9999

100-
Look up for known _16S_ sequences in _E.coli_ multicell assembly
100+
Lookup for known _16S_ sequences in _E.coli_ multicell assembly
101101
```
102-
./pathracer synth16S_new.fa ecoli_mc.gfa 55 --nt --output pathracer_ecoli_mc_16S_seqs
102+
pathracer synth16S_new.fa ecoli_mc.gfa 55 --nt --output pathracer_ecoli_mc_16S_seqs
103103
```
104104

105-
Look up for known _16S_ sequences in SYNTH mock metagenome assembly
105+
Lookup for known _16S_ sequences in SYNTH mock metagenome assembly
106106
```
107-
./pathracer synth16S_new.fa synth_strain_gbuilder.gfa 55 --nt --output pathracer_synth_strain_gbuider_16S_seqs
107+
pathracer synth16S_new.fa synth_strain_gbuilder.gfa 55 --nt --output pathracer_synth_strain_gbuider_16S_seqs
108108
```
109109

110110
Let us extract **all** _16S_ sequences from SYNTH mock metagenome assembly.
111111
For this we increase `--top` and disable Event Graph vertices filter (`--no-top-score-filter`)
112112
Deep analysis of extremely complicated dataset also require stack and memory limits tuning
113113
```
114-
ulimit -s unlimited
114+
ulimit -s unlimited &&
115115
export OMP_STACKSIZE=1G
116-
./pathracer bac.hmm synth_strain_gbuilder.gfa 55 --queries 16S_rRNA -m 250 --top 1000000 --output pathracer_synth_strain_gbuilder_16s --no-top-score-filter
116+
pathracer bac.hmm synth_strain_gbuilder.gfa 55 --queries 16S_rRNA -m 250 --top 1000000 --output pathracer_synth_strain_gbuilder_16s --no-top-score-filter
117117
```
118118

119119
### References
120-
If you are using **PathRacer** in your research, please refer to <https://www.biorxiv.org/content/10.1101/562579v1>
120+
If you are using **PathRacer** in your research, please cite to <https://www.biorxiv.org/content/10.1101/562579v1>
121121

122122
In case of any problems running **PathRacer** please contact SPAdes support <spades.support@cab.spbu.ru> attaching the log file.
123123
Your suggestions are also very welcome!
124-
125-
126-
127-
128-
129-
130-
131-
132-
133-
134-

0 commit comments

Comments
 (0)