-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathigquast_manual.html
312 lines (275 loc) · 18.2 KB
/
igquast_manual.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
<!DOCTYPE html>
<html lang="en-us">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>IgQUAST Manual</title>
<style type="text/css">
.code {
background-color: lightgray;
white-space: pre-wrap; /* css-3 */
white-space: -moz-pre-wrap; /* Mozilla, since 1999 */
white-space: -pre-wrap; /* Opera 4-6 */
white-space: -o-pre-wrap; /* Opera 7 */
word-wrap: break-word; /* Internet Explorer 5.5+ */
}
.sc {
font-variant: small-caps;
}
</style>
</head>
<body>
<h1><font class="sc">IgQUAST</font> manual</h1>
1. <a href="#intro">What is <font class="sc">IgQUAST</font>?</a><br>
1.1. <a href="#igquast_input">Input</a><br>
1.2. <a href="#igquast_output">Output</a><br>
2. <a href="#igquast_install">Installation</a><br>
3. <a href="#igquast_usage"><font class="sc">IgQUAST</font> usage</a><br>
3.1. <a href="#igquast_input_options">Input options</a><br>
3.2. <a href="#igquast_output_options">Output options</a><br>
3.3. <a href="#igquast_scenarios_options">Performed scenarios</a><br>
3.4. <a href="#igquast_misc_options">Miscellaneous options</a><br>
3.5. <a href="#igquast_examples">Examples</a><br>
3.6. <a href="#igquast_output_files">Output files</a><br>
4. <a href="#references">Citations</a><br>
5. <a href="#feedback">Feedback and bug reports</a><br>
<!-- -------- -->
<h2 id="intro">1. What is <font class="sc">IgQUAST</font>?</h2> <font class="sc">IgQUAST</font> (<b>I</b>mmuno<b>G</b>lobulin <b>QU</b>ality <b>AS</b>sessment <b>T</b>ool)
is a tool for adaptive immune repertoires quality assessment.
<font class="sc">IgQUAST</font> can be used for benchmarking of adaptive immune repertoire construction tools and for quality estimation of constructed repertoires.
<font class="sc">IgQUAST</font> performs reference-based and reference-free analysis:
<ul>
<li>During reference-based analysis the tool compares two input repertoires: the <i>reference repertoire</i> and the <i>constructed repertoire.</i>
The analysis is separated into two scenarios:
<ul>
<li><b>Repertoire-to-repertoire matching</b> only uses repertoire sequences.
The tool aligns each of two repertoires against the other one and computes
<i>sensitivity</i> and <i>precision</i> metrics,
detects error positions in erroneously constructed sequences,
and compares reference and constructed abundances for ideally reconstructed sequences.
<!-- TODO reformulate -->
</li>
<li><b>Partition-based analysis</b> only uses partitions induced by the RCMs (read-to-cluster maps).
The tool compares two partitions and computes partition similarity metrics (like <i>Rand index</i>).
Also it computes cluster quality measures (like <i>purity</i> and <i>discordance</i>) and plots their distributions for both input repertoires.
<!-- TODO reformulate -->
</li>
</ul>
</li>
<li><b>Reference-free analysis</b> is performed on the constructed repertoire.
The tool detects overestimated clusters in the repertoire
using amplification-free Poisson model.
It also estimates error rate and error profile of the initial read library.
The same analysis is performed on the reference repertoire if it is provided.
All reference-free analysis requires both repertoire sequences and read-to-cluster map (RCM).
</li>
</ul>
<h3 id="igquast_input">1.1. Input</h3> <font class="sc">IgQUAST</font> takes as an input:
<ul>
<li>Initial Rep-seq read library;</li>
<li>Analyzed adaptive immune repertoire constructed on this library;</li>
<li>For reference-based analysis, the reference repertoire constructed on the same library.</li>
</ul>
Initial Rep-seq library should be in FASTA or FASTQ format.
Reads should be properly cropped (should start from V gene beginning
and finish by J gene ending), reads obtained from negative strand should
be reversed, and contaminative reads should be filtered out.
Cropping, strand correction and contamination filtering may be performed by
the <font class="sc">VJFinder</font> tool from the <font class="sc">IgRepertoireConstructor</font> package
(see <a href="manual.html"><font class="sc">IgRepertoireConstructor</font> manual</a> for <font class="sc">VJFinder</font> input-output format description).
<br>
Analyzed constructed and reference repertoires should be presented by two files each,
repertoire sequences in FASTA format and read-to-cluster map (RCM)
file in a special format.
See <a href="manual.html#repertoire_files"><font class="sc">IgRepertoireConstructor</font> manual</a>
for the comprehensive repertoire format description.
If you have only one of these files,
the tool can reconstruct it using available information (use <code>--reconstruct</code> option).
<h3 id="igquast_output">1.2. Output</h3> <font class="sc">IgQUAST</font> reports:
<ul>
<li>Plots for visual analysis;</li>
<li>Metrics report.</li>
</ul>
Plots are reported in PNG, PDF, and SVG formats. Metrics are reported in text (brief) and JSON (full) formats.
<h2 id=igquast_install>2. Installation</h2>
<font class="sc">IgQUAST</font> has the following dependencies:
<ul>
<li>64-bit Linux system</li>
<li>g++ (version 4.7 or higher)</li>
<li>cmake (version 2.8.8 or higher)</li>
<li>Python 2 (version 2.7 or higher), including:
<ul>
<li><a href = "http://biopython.org/wiki/Download">BioPython</a></li>
<li><a href = "http://matplotlib.org/users/installing.html">Matplotlib</a></li>
<li><a href = "http://www.numpy.org/">NumPy</a></li>
<li><a href = "http://www.scipy.org/install.html">SciPy</a></li>
<li><a href = "http://pandas.pydata.org/">pandas</a></li>
<li><a href = "https://web.stanford.edu/~mwaskom/software/seaborn/">Seaborn</a></li>
</ul></li>
</ul>
<font class="sc">IgQUAST</font> is a part of <font class="sc">IgRepertoireConstructor</font> package.
See <a href="manual.html#install"><font class="sc">IgRepertoireConstructor</font> manual</a> for the installation instructions.
<br>
<br>
Please verify your <font class="sc">IgQUAST</font> installation
before the first run of <font class="sc">IgQUAST</font>:
<pre class="code">
<code>
./igquast.py --test
</code>
</pre> If the installation is succeeded, you will find the following information at the end of the log:
<pre class="code">
<code>
Thank you for using IgQUAST!
Log was written to igquast_test/igquast.log
</code>
</pre>
<h2 id="igquast_usage">3. <font class="sc">IgQUAST</font> usage</h2>
To run <font class="sc">IgQUAST</font>, type:
<pre class="code">
<code>
./igquast.py [options] -s <initial reads> -c <constructed repertoire FASTA> -C <constructed repertoire RCM> -r <reference repertoire FASTA> -R <reference repertoire RCM> -o <output dir for plots>
</code>
</pre>
<h3 id="igquast_input_options">3.1. Input options</h3>
<code>-c / --constructed-repertoire <constructed repertoire FASTA></code><br> FASTA file with constructed repertoire sequences. Can be gzipped.
<br><br>
<code>-C / --constructed-rcm <constructed repertoire RCM></code><br> RCM file with constructed repertoire read-cluster map. Can be gzipped.
<br><br>
<code>-r / --reference-repertoire <reference repertoire FASTA></code><br> FASTA file with reference repertoire sequences. Can be gzipped.
<br><br>
<code>-R / --reference-rcm <reference repertoire RCM></code><br> RCM file with reference repertoire read-cluster map. Can be gzipped.
<br><br>
<code>-s / --initial-reads <initial reads></code><br> Initial Rep-seq reads in FASTA or FASTQ format. Can be gzipped.
<br><br>
<code>--reconstruct | --no-reconstruct</code><br> Whether to reconstruct missing repertoire files if it is possible. Disabled by default.
<br><br>
<h3 id="igquast_output_options">3.2. Output options</h3>
<code>-o / --output-dir <output dir></code><br> output directory (required).
<br><br>
<code>--text <text report file></code><br> File for text report output.
Default: <code><output dir>/igquast.txt</code>.
<br><br>
<code>--json <JSON report file></code><br> File for JSON report output.
Default: <code><output dir>/igquast.json</code>.
<br><br>
<code>-F / --figure-format <figure formats(s)></code><br> Figure
format(s) for plots. Allowed values are <code>png</code>,
<code>pdf</code> and <code>svg</code>.
One can pass several values separated by commas.
Empty string means do not produce plots.
Default value is <code>png,pdf,svg</code>.
<br><br>
<h3 id="igquast_scenarios_options">3.3. Performed scenarios</h3>
<code>--repertoire-to-repertoire-matching | --no-repertoire-to-repertoire-matching</code><br> Whether to perform repertoire-to-repertoire matching. Enabled by default.
<br><br>
<code>--partition-based | --no-partition-based</code><br> Enable/disable partition-based metrics and plots. Enabled by default.
<br><br>
<code>--reference-free | --no-reference-free</code><br> Enable/disable reference-free metrics and plots. Disabled by default.
<br><br>
<code>--export-bad-clusters | --no-export-bad-clusters</code><br> Whether to export untrustworthy clusters during reference-free analysis. Disabled by default.
<br><br>
<h3 id="igquast_additional_options">3.4. Algorithm parameters</h3>
<code>--reference-size-cutoff <positive integer></code><br> Cutoff
for reference cluster size. Smaller reference clusters are discarded during
repertoire-to-repertoire comparison. Default value is <code>5</code>.
<br><br>
<h3 id="igquast_misc_options">3.4. Miscellaneous options</h3>
<code>--test</code><br> Running on the toy test dataset. Command line corresponding to the test run is equivalent to the following:
<pre class="code">
<code>
./igquast.py -s test_dataset/igquast/test/input_reads.fa.gz -c test_dataset/igquast/igrec/final_repertoire.fa.gz -C test_dataset/igquast/igrec/final_repertoire.rcm -r test_dataset/igquast/test/repertoire.fa.gz -R test_dataset/igquast/test/repertoire.rcm -o igquast_test_test
</code>
</pre>
<code>-h / --help</code><br> Show help and exit.
<br><br>
<h3 id="igquast_examples">3.5. Examples</h3>
Perform reference-free analysis only:
<pre class="code">
<code>
./igquast.py -s test_dataset/igquast/test/input_reads.fa.gz -c test_dataset/igquast/test/igrec_bad/final_repertoire.fa.gz -C test_dataset/igquast/test/igrec_bad/final_repertoire.rcm -o igquast_test --reference-free
</code>
</pre> Do not plot figures, make reports only:
<pre class="code">
<code>
./igquast.py -s test_dataset/igquast/test/input_reads.fa.gz -c test_dataset/igquast/test/igrec/final_repertoire.fa.gz -C test_dataset/igquast/test/igrec/final_repertoire.rcm -r test_dataset/igquast/test/repertoire.fa.gz -R test_dataset/igquast/test/repertoire.rcm --figure-format= -o igquast_test
</code>
</pre>
<h3 id="igquast_output_files">3.6. Output files</h3> <font class="sc">IgQUAST</font> creates output directory
(its name is specified using option <code>-o</code>) and outputs the following files there:
<ul>
<li>
<b>reference_based</b> — Directory with reference-based plots:
<ul>
<li><b>error_position_distribution</b> — distribution of error positions for constructed repertoire sequences reconstructed with only one error.
This plot helps to detect sequencing technology artifacts and repertoire construction strategy artifacts</li>
<li><b>sensitivity_precision</b> — sensitivity and precision depending on cluster size threshold for the constructed repertoire</li>
<li><b>distance_distribution</b> — constructed to reference and reference to constructed distance distribution depending on the cluster size threshold for the constructed repertoire. 8 plots on one figure</li>
<li><b>{constructed_to_reference,reference_to_constructed}_distance_distribution_size_{1,3,5,10}</b> — the same 8 plots separately</li>
<li><b>abundance_distributions</b> — cluster size distribution for the constructed and reference repertoires</li>
<li><b>abundance_distributions_log</b> — the same plot with logarithmic Y-scale</li>
<li><b>cluster_abundances_scatterplot</b> — scatterplot of constructed cluster sizes against reference cluster sizes for ideally reconstructed clusters</li>
<li><b>constructed_purity_distribution</b> — distribution of cluster purity in the constructed repertoire. This plot helps to detect overcorrection</li>
<li><b>constructed_purity_distribution_large</b> — the same plot for large clusters only</li>
<!-- <li><b>constructed_purity_distribution_large_ylog</b> &mdash; the same plot with logarithmic Y-scale</li> -->
<li><b>reference_discordance_distribution</b> — distribution of constructed cluster discordance
(relative contribution of the second popular reference cluster into the particular constructed cluster).
This plot helps to detect overcorrection</li>
<li><b>reference_discordance_distribution_large</b> — the same plot for large clusters only</li>
<!-- <li><b>reference_discordance_distribution_large_ylog</b> &mdash; the same plot with logarithmic Y-scale</li> -->
<li><b>reference_purity_distribution</b> — distribution of cluster purity in the reference repertoire. This plot helps to detect undercorrection</li>
<li><b>reference_purity_distribution_large</b> — the same plot for large clusters only</li>
<!-- <li><b>reference_purity_distribution_large_ylog</b> &mdash; the same plot with logarithmic Y-scale</li> -->
<li><b>constructed_discordance_distribution</b> — distribution of reference cluster discordance
(relative contribution of the second popular constructed cluster into the particular reference cluster).
This plot helps to detect overcorrection</li>
<li><b>constructed_discordance_distribution_large</b> — the same plot for large clusters only</li>
<!-- <li><b>constructed_discordance_distribution_large_ylog</b> &mdash; the same plot with logarithmic Y-scale</li> -->
</ul>
</li>
<li>
<b>reference_free</b> — Directory with reference-free plots:
<ul>
<li><b>constructed_cluster_error_profile</b> — error profile estimation for entire constructed repertoire</li>
<li><b>constructed_cluster_error_profile_largest_{1,2,3,4,5}</b> — error profile estimation for 5 largest constructed clusters separately</li>
<li><b>constructed_cluster_discordance_profile</b> — discordance (second vote letter) profile for entire constructed repertoire</li>
<li><b>constructed_cluster_error_profile_largest_{1,2,3,4,5}</b> — discordance profile for 5 largest constructed clusters</li>
<li><b>constructed_distribution_of_errors_in_reads</b> — distribution of the number of errors in reads for the constructed repertoire clusters</li>
<li><b>constructed_max_error_scatter</b> — scatterplot of max errors by position against cluster size for the constructed repertoire.
This plot helps to detect overcorrection</li>
<li><b>reference_*</b> — the same plots for the reference repertoire</li>
<li>
<b>bad_constructed_clusters</b> — Directory with clusters from the constructed repertoire,
which are identified as overcorrected during reference-free analysis
</li>
<li>
<b>bad_reference_clusters</b> — Directory with clusters from the reference repertoire,
which are identified as overcorrected during reference-free analysis
</li>
</ul>
</li>
<li>
<b>constructed.rcm, reference.rcm, constructed.fa.gz. reference.fa.gz</b> — Reconstructed missing repertoire files
</li>
<li>
<b>igquast.log</b> — full log of the <font class="sc">IgQUAST</font> run
</li>
</ul>
Files for the reports (in text and JSON formats) are specified by the corresponding options.
Some files can be absent depending on provided input. Note that reference-free analysis is disabled by default since it is very time- and memory-consuming.
One should use the option <code>--reference-free</code> to enable it.
<h2 id = "references">4. Citations</h2>
<p align = "justify">
Alexander Shlemov, Sergey Bankevich, Andrey Bzikadze,
Dmitriy M. Chudakov,
Yana Safonova,
and Pavel A. Pevzner.
Reconstructing antibody repertoires from error-prone immunosequencing datasets <i>(submitted)</i>
</p>
<!-- -------- -->
<h2 id="feedback">5. Feedback and bug reports</h2> Your comments, bug
reports, and suggestions are very welcome. They will help us to further
improve <font class="sc">IgQUAST</font>.
<br><br> If you have any trouble running <font class="sc">IgQUAST</font>, please provide us the log file from the output directory.
</body>
</html>