You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+86-39Lines changed: 86 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,13 @@
6
6
7
7
BinSPreader is a novel tool that attempts to refine metagenome-assembled genomes (MAGs) obtained from existing tools. BinSPreader exploits the assembly graph topology and other connectivity information, such as paired-end and Hi-C reads, to refine the existing binning, correct binning errors, propagate binning from longer contigs to shorter contigs and infer contigs belonging to multiple bins.
8
8
9
+
### Dependencies
10
+
11
+
- g++ (version 5.3.1 or higher)
12
+
- cmake (version 3.12 or higher)
13
+
- zlib
14
+
- libbz2
15
+
9
16
### Installation
10
17
11
18
```
@@ -15,50 +22,89 @@ make bin-refine
15
22
```
16
23
Now to run BinSPreader move to folder `assembler/` and execute
17
24
18
-
`build/bin/hicspades-binner`
25
+
`build/bin/bin-refine`
19
26
20
27
### Input
21
28
22
-
The tool has two mandatory options:
23
-
- Assembly graph file in [GFA 1.0 format](https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md), with scaffolds included as path lines. Alternatively, scaffolds can be provided separately using `--path` option.
24
-
-Initial
29
+
The tool requires initial binning to refine, as well as assembly graph as a source of information for refining. Optionally, BinSPreader can be provided with multiple Hi-C and/or paired-end libraries.
30
+
31
+
Required positional arguments:
32
+
33
+
- Assembly graph file in [GFA 1.0 format](https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md), with scaffolds included as path lines. Alternatively, scaffold paths can be provided separately using `--path` option in the `.paths` format accepted by Bandage (see [Bandage wiki](https://github.com/rrwick/Bandage/wiki/Graph-paths) for details).
34
+
- Binning output from an existing tool (in `.tsv` format)
35
+
36
+
Synopsis: `bin-refine <graph (in GFA)> <binning (in .tsv)> <output directory> [OPTION...]`
37
+
38
+
Main options:
25
39
26
-
Synopsis: `hicspades-binner <graph (in GFA)> <dataset description (in YAML)> <output directory> [OPTION...]`
40
+
-`--paths` provide contigs paths from file separately from GFA
41
+
-`--dataset` Dataset in YAML format (see #yaml) describing Hi-C and paired-end reads
27
42
28
-
The options are:
43
+
-`-l` L Library index (0-based, default: 0). Only the library specified by this index will be used.
44
+
-`-t` T # of threads to use (default: 1/2 of available threads)
45
+
-`-e` E convergence relative tolerance threshold (default: 1e-5)
46
+
-`-n` ITERATIONS maximum number of iterations (default: 5000)
47
+
-`-m` allow multiple bin assignment (defalut: false)
48
+
-`-Smax|-Smle` simple maximum or maximum likelihood binning assignment strategy (default: max likelihood)
49
+
-`-Rcorr|-Rprop` Select propagation or correction mode (default: correction)
50
+
-`--cami` use CAMI bioboxes binning format
51
+
-`--zero-bin` emit zero bin for unbinned sequences
52
+
-`--tall-multi` use tall table for multiple binning result
53
+
-`--bin-dist` estimate pairwise bin distance (could be slow on large graphs!)
54
+
-`-la` LA labels correction regularization parameter for labeled data (default: 0.6)
29
55
30
-
`-t, --threads <int> `
31
-
# of threads to use
56
+
Sparse propagation options:
57
+
-`--sparse-propagation` Gradually reduce regularization parameter from binned to unbinned edges. Recommended for sparse binnings with low assembly fraction.
58
+
-`--no-unbinned-bin` Do not create a special bin for unbinned contigs. More agressive strategy.
59
+
-`-ma, --metaalpha` Regularization parameter for sparse propagation procedure. Increase/decrease for more agressive/conservative refining (default: 0.6)
60
+
-`-lt, --length-threshold` LENGTH_THRESHOLD Binning will not be propagated to edges longer than threshold
61
+
-`-db' --distance-bound` DISTANCE_BOUND Binning will not be propagated further than bound from initially binned edges
32
62
33
-
`-e, --enzymes <string> `
34
-
Comma-separated string of restriction enzyme recognition sites
63
+
Read splitting options:
64
+
-`-r, --reads` Split reads according to binning. Can be used for reassembly.
65
+
-`-b, --bin-weight` BIN_WEIGHT Reads bin weight threshold (default: 0.1).
35
66
36
-
`--tmp-dir <dir name> `
37
-
scratch directory to use
67
+
Developer options:
68
+
-`--bin-load` Load binary-converted reads from tmpdir
69
+
-`--debug` produce lots of debug data
70
+
-`--tmp-dir` TMP_DIR scratch directory to use
71
+
-`-h, --help ` print help message
38
72
39
-
`--min-ctg-len <int> `
40
-
Minimum contig length for binning
73
+
### BinSPreader output
41
74
42
-
`--path-links-thr <int> `
43
-
Minimum total number of links between contigs
75
+
BinSPreader stores all output files in output directory `<output_dir> `, which is set by the user.
44
76
45
-
`--edge-links-thr <int>`
46
-
Minimum number of links between long edges
77
+
-`<output_dir>/binning.tsv` contains refined binning in `.tsv` format
78
+
-`<output_dir>/bin_stats.tsv` contains various per-bin statistics
79
+
-`<output_dir>/bin_weights.tsv` contains soft bin weights per contig
80
+
-`<output_dir>/edge_weights.tsv` contains soft bin weights per edge
47
81
48
-
`-h, --help `
49
-
print help message
82
+
In addition
83
+
84
+
-`<output_dir>/bin_dist.tsv` contains refined bin distance matrix (if `--bin-dist` was used)
85
+
-`<output_dir>/bin_label_1.fastq, <output_dir>/bin_label_2.fastq` read set for bin labeled by `bin_label` (if `--reads` was used)
86
+
-`<output_dir>/pe_links.tsv` list of paired-end links between assembly graph edges with weights (if `--debug` was used)
87
+
-`<output_dir>/graph_links.tsv` list of graph links between assembly graph edges with weights (if `--debug` was used)
50
88
51
89
<aname="yaml"></a>
52
90
**_Specifying input data with YAML data set file_**
53
91
54
-
hicSPAdes-binner currently supports a single Hi-C library described in a YAML file. For example, if your Hi-C library is split into two pairs of files
92
+
BinSPreader currently supports multiple paired-end or Hi-C libraries described in a YAML file. For example, if you have one paired-end library split into two sets of files
55
93
56
94
```bash
57
95
58
-
lib_hic_left_1.fastq
59
-
lib_hic_right_1.fastq
60
-
lib_hic_left_2.fastq
61
-
lib_hic_right_2.fastq
96
+
lib_pe1_left_1.fastq
97
+
lib_pe1_right_1.fastq
98
+
lib_pe1_left_2.fastq
99
+
lib_pe1_right_2.fastq
100
+
```
101
+
102
+
and one Hi-C library
103
+
104
+
```bash
105
+
106
+
lib_hic1_left.fastq
107
+
lib_hic1_right.fastq
62
108
```
63
109
64
110
YAML file should look like this:
@@ -68,24 +114,25 @@ YAML file should look like this:
68
114
[
69
115
{
70
116
orientation: "fr",
71
-
type: "hic",
117
+
type: "paired-end",
118
+
right reads: [
119
+
"/FULL_PATH_TO_DATASET/lib_pe1_right_1.fastq",
120
+
"/FULL_PATH_TO_DATASET/lib_pe1_right_2.fastq"
121
+
],
122
+
left reads: [
123
+
"/FULL_PATH_TO_DATASET/lib_pe1_left_1.fastq",
124
+
"/FULL_PATH_TO_DATASET/lib_pe1_left_2.fastq"
125
+
]
126
+
},
127
+
{
128
+
orientation: "fr",
129
+
type: "paired-end",
72
130
right reads: [
73
-
"/FULL_PATH_TO_DATASET/lib_hic_right_1.fastq",
74
-
"/FULL_PATH_TO_DATASET/lib_hic_right_2.fastq"
131
+
"/FULL_PATH_TO_DATASET/lib_hic1_right.fastq"
75
132
],
76
133
left reads: [
77
-
"/FULL_PATH_TO_DATASET/lib_hic_left_1.fastq",
78
-
"/FULL_PATH_TO_DATASET/lib_hic_left_2.fastq"
134
+
"/FULL_PATH_TO_DATASET/lib_hic1_left.fastq"
79
135
]
80
136
}
81
137
]
82
138
```
83
-
84
-
### Output
85
-
86
-
hicSPAdes-binner stores all output files in `<output_dir> `, which is set by the user.
87
-
88
-
-`<output_dir>/clustering.mcl` contains resulting scaffold clustering in MCL format
89
-
-`<output_dir>/clustering.tsv` contains resulting scaffold clustering in TSV format
90
-
-`<output_dir>/basic_stats.tsv` contains various per-cluster statistics
91
-
-`<output_dir>/contact_map.tsv` contains hicSPAdes scores between input scaffolds, as well as other scaffold statistics
0 commit comments