-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathatom.xml
More file actions
664 lines (352 loc) · 534 KB
/
atom.xml
File metadata and controls
664 lines (352 loc) · 534 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Yiwei Niu's Note</title>
<subtitle>to share, to learn</subtitle>
<link href="/blog/atom.xml" rel="self"/>
<link href="https://yiweiniu.github.io/blog/"/>
<updated>2019-07-17T08:42:48.542Z</updated>
<id>https://yiweiniu.github.io/blog/</id>
<author>
<name>Yiwei Niu</name>
</author>
<generator uri="http://hexo.io/">Hexo</generator>
<entry>
<title>Install/Update R and R packages</title>
<link href="https://yiweiniu.github.io/blog/2019/07/Install-Update-R-and-R-packages/"/>
<id>https://yiweiniu.github.io/blog/2019/07/Install-Update-R-and-R-packages/</id>
<published>2019-07-17T08:38:44.000Z</published>
<updated>2019-07-17T08:42:48.542Z</updated>
<content type="html"><![CDATA[<p>Purpose in short: to ease the pain when installing/updating <code>R</code> and <code>R</code> packages.</p><p><strong>Note</strong>: I mainly work under CentOS and Windows environment, so, I am not familiar with Mac OS system.</p><h2 id="basic-r-configuration"><a class="markdownIt-Anchor" href="#basic-r-configuration"></a> Basic R configuration</h2><p>Before we start talking about installing packages, it would be better to do some basic configuration relating to the <code>R</code> library. There are several advantages of doing this:</p><ul><li>help you understand how <code>R</code> starts up</li><li>make it easy to set or change the configurations (saving time)</li><li>make it easy to manage your R enrionment</li></ul><h3 id="basic-r-environment-variables"><a class="markdownIt-Anchor" href="#basic-r-environment-variables"></a> Basic R environment variables</h3><p>There are several basic enrionment variables for R. <mark>These variables are not the same as variables of the system (like those in <code>.bashrc</code> or <code>.bash_profile</code>)</mark>.</p><p>Variables for important directories for <code>R</code>.</p><ul><li><p><code>HOME</code>, user’s home directory. This can be got from <code>path.expand('~')</code> or <code>Sys.getenv('HOME')</code>.</p> <figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># windows</span></span><br><span class="line">> path.expand(<span class="string">'~'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"C:/Users/NIU/Documents"</span></span><br><span class="line"></span><br><span class="line">> path.expand(<span class="string">'~'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"C:/Users/NIU/Documents"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Linux</span></span><br><span class="line">> Sys.getenv(<span class="string">'HOME'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"/home/niuyw"</span></span><br></pre></td></tr></table></figure></li><li><p><code>R_HOME</code>, the directory in which R is installed. This can be got from <code>R.home()</code> or <code>Sys.getenv('R_HOME')</code>.</p> <figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># windows</span></span><br><span class="line">> Sys.getenv(<span class="string">'R_HOME'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"C:/PROGRA~1/R/R-35~1.3"</span></span><br><span class="line"></span><br><span class="line">> R.home()</span><br><span class="line">[<span class="number">1</span>] <span class="string">"C:/PROGRA~1/R/R-35~1.3"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Linux</span></span><br><span class="line">> R.home()</span><br><span class="line">[<span class="number">1</span>] <span class="string">"/home/niuyw/software/R.3.5.3/lib64/R"</span></span><br><span class="line">> Sys.getenv(<span class="string">'R_HOME'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"/home/niuyw/software/R.3.5.3/lib64/R"</span></span><br></pre></td></tr></table></figure></li><li><p>Current working directory. This is reported by <code>getwd()</code>.</p> <figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">> getwd()</span><br><span class="line">[<span class="number">1</span>] <span class="string">"D:/test/R"</span></span><br></pre></td></tr></table></figure></li></ul><p><strong>Variables for libraries.</strong></p><ul><li><code>R_LIBS</code>, a colon-separated list of directories</li><li><code>R_LIBS_USER</code>, a colon-separated list of directories</li><li><code>R_LIBS_SITE</code>, a colon-separated list of directories</li></ul><p>By default <code>R_LIBS</code> and <code>R_LIBS_SITE</code> are unset, and <code>R_LIBS_USER</code> is set to directory <code>R/R.version$platform-library/x.y</code> of the home directory (or <code>Library/R/x.y/library</code> for CRAN macOS builds), for R.x.y.z.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># windows</span></span><br><span class="line">> Sys.getenv(<span class="string">'R_LIBS'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">""</span></span><br><span class="line">> Sys.getenv(<span class="string">'R_LIBS_USER'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"C:/Users/NIU/Documents/R/win-library/3.5"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Linux</span></span><br><span class="line">> Sys.getenv(<span class="string">'R_LIBS'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">""</span></span><br><span class="line">> Sys.getenv(<span class="string">'R_LIBS_USER'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"~/R/x86_64-pc-linux-gnu-library/3.5"</span></span><br></pre></td></tr></table></figure><h3 id="package-search-paths"><a class="markdownIt-Anchor" href="#package-search-paths"></a> Package search paths</h3><p>Search paths for packages are paths where <code>R</code> search/install/uninstall packages. This can be reported by <code>.libPaths()</code> function.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># windows</span></span><br><span class="line">> .libPaths()</span><br><span class="line">[<span class="number">1</span>] <span class="string">"C:/Program Files/R/R-3.5.3/library"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Linux</span></span><br><span class="line">> .libPaths()</span><br><span class="line">[<span class="number">1</span>] <span class="string">"/home/niuyw/software/R.3.5.3/lib64/R/library"</span></span><br></pre></td></tr></table></figure><p><code>.libPaths()</code> can also be used to set the search paths for packages.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">.libPaths(new)</span><br></pre></td></tr></table></figure><p>If called with argument <code>new</code>, the library search path is set to the existing directories in <code>unique(c(new, .Library.site, .Library))</code> and this is returned. If given no argument, a character vector with the currently active library trees is returned.</p><ul><li><code>.Library</code> is a character string giving the location of the default library, the ‘library’ subdirectory of <code>R_HOME</code>.</li><li><code>.Library.site</code> is a (possibly empty) character vector giving the locations of the site libraries, by default the ‘site-library’ subdirectory of <code>R_HOME</code> (which may not exist).</li></ul><p>At startup, the library search path is initialized from the environment variables: first <strong>R_LIBS</strong>, then <strong>R_LIBS_USER</strong> and finally <strong>R_LIBS_SITE</strong>. <mark>Only directories which exist at the time will be included</mark>.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Linux</span></span><br><span class="line">> Sys.getenv(<span class="string">'R_LIBS'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">""</span></span><br><span class="line"></span><br><span class="line">> Sys.getenv(<span class="string">'R_LIBS_USER'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"~/R/x86_64-pc-linux-gnu-library/3.5"</span></span><br><span class="line"></span><br><span class="line">> Sys.getenv(<span class="string">'R_LIBS_SITE'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">""</span></span><br><span class="line"></span><br><span class="line">> .Library</span><br><span class="line">[<span class="number">1</span>] <span class="string">"/home/niuyw/software/R.3.5.3/lib64/R/library"</span></span><br><span class="line"></span><br><span class="line">> .Library.site</span><br><span class="line">character(<span class="number">0</span>)</span><br><span class="line"></span><br><span class="line">> .libPaths()</span><br><span class="line">[<span class="number">1</span>] <span class="string">"/home/niuyw/software/R.3.5.3/lib64/R/library"</span></span><br><span class="line"></span><br><span class="line">> file.exists(<span class="string">'~/R/x86_64-pc-linux-gnu-library/3.5'</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="literal">FALSE</span></span><br></pre></td></tr></table></figure><p>As can be seen, both <code>R_LIBS</code> and <code>R_LIBS_SITE</code> are empty by default. Although variable <code>R_LIBS_USER</code> was set, the directory was not included in <code>.libPaths()</code> since the directory did not exist.</p><p>Calling <code>.libPaths('')</code> (with an empty string) will remove all other entries but the library sub-directory of the distribution.</p><h3 id="r-startup"><a class="markdownIt-Anchor" href="#r-startup"></a> R startup</h3><p><a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html" target="_blank" rel="noopener">R: Initialization at Start of an R Session</a> has a clear descriptions about how R starts up.</p><blockquote><p>In <strong>R</strong>, the startup mechanism is as follows.</p><p>Unless <code>--no-environ</code> was given on the command line, <strong>R</strong> searches for site and user files to process for setting environment variables. The name of the site file is the one pointed to by the environment variable <code>R_ENVIRON</code>; if this is unset, <code>R_HOME/etc/Renviron.site</code> is used (if it exists, which it does not in a ‘factory-fresh’ installation). The name of the user file can be specified by the <code>R_ENVIRON_USER</code> environment variable; if this is unset, the files searched for are <code>.Renviron</code> in the current or in the user’s home directory (in that order).</p><p>Then <strong>R</strong> searches for the site-wide startup profile file of <strong>R</strong> code unless the command line option <code>--no-site-file</code> was given. The path of this file is taken from the value of the R_PROFILE environment variable (after <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/path.expand.html" target="_blank" rel="noopener">tilde expansion</a>). If this variable is unset, the default is <code>R_HOME/etc/Rprofile.site</code>, which is used if it exists (which it does not in a ‘factory-fresh’ installation). This code is sourced into the base package. Users need to be careful not to unintentionally overwrite objects in base, and it is normally advisable to use <code>local</code> if code needs to be executed: see the examples.</p><p>Then, unless <code>--no-init-file</code> was given, <strong>R</strong> searches for a user profile, a file of <strong>R</strong> code. The path of this file can be specified by the <code>R_PROFILE_USER</code> environment variable (and <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/path.expand.html" target="_blank" rel="noopener">tilde expansion</a> will be performed). If this is unset, a file called <code>.Rprofile</code> is searched for in the current directory or in the user’s home directory (in that order). The user profile file is sourced into the workspace.</p><p>Note that when the site and user profile files are sourced only the base package is loaded, so objects in other packages need to be referred to by e.g. <code>utils::dump.frames</code> or after explicitly loading the package concerned.</p><p><strong>R</strong> then loads a saved image of the user workspace from ‘.RData’ in the current directory if there is one (unless --no-restore-data or --no-restore was specified on the command line).</p><p>Next, if a function <code>.First</code> is found on the search path, it is executed as <code>.First()</code>. Finally, function <code>.First.sys()</code> in the base package is run. This calls <code>require</code> to attach the default packages specified by <code>options]</code>(“defaultPackages”). If the methods package is included, this will have been attached earlier (by function <code>.OptRequireMethods()</code>) so that namespace initializations such as those from the user workspace will proceed correctly.</p><p>A function <code>.First</code> (and <code>.Last</code>) can be defined in appropriate ‘.Rprofile’ or ‘Rprofile.site’ files or have been saved in ‘.RData’. If you want a different set of packages than the default ones when you start, insert a call to <code>options</code> in the <code>.Rprofile</code> or <code>Rprofile.site</code> file. For example, <code>options(defaultPackages = character())</code> will attach no extra packages on startup (only the base package) (or set <code>R_DEFAULT_PACKAGES=NULL</code> as an environment variable before running <strong>R</strong>). Using <code>options(defaultPackages = "")</code> or <code>R_DEFAULT_PACKAGES=""</code> enforces the R <em>system</em> default.</p><p>On front-ends which support it, the commands history is read from the file specified by the environment variable <code>R_HISTFILE</code> (default ‘.Rhistory’ in the current directory) unless --no-restore-history or --no-restore was specified.</p><p>The command-line option --vanilla implies --no-site-file, --no-init-file, --no-environ and (except for <code>R CMD</code>) --no-restore</p></blockquote><p>There are two sorts of files used in startup: environment files which contain lists of environment variables to be set, and profile files which contain R code.</p><p>At startup, R will try to read a number of files in a particular order. The contents in these files would determine how R performs in the session opened.</p><p>Files in three folders are important in this process:</p><ul><li><p><code>R_HOME</code></p><ul><li><code>R_HOME/etc/Renviron.site</code></li><li><code>R_HOME/etc/Rprofile.site</code></li></ul></li><li><p><code>HOME</code></p><ul><li><code>.Renviron</code></li><li><code>.Rprofile</code></li></ul></li><li><p>Current working directory.</p><ul><li><code>.Renviron</code></li><li><code>.Rprofile</code></li></ul></li></ul><p><strong>R only uses one <code>.Rprofile</code> and one <code>.Renviron</code> in any session:</strong></p><ul><li><code>.Rprofile</code> file in your current project overrides <code>.Rprofile</code> in <code>R_HOME</code> and <code>HOME</code>. Likewise, <code>.Rprofile</code> in <code>HOME</code> overrides <code>.Rprofile</code> in <code>R_HOME</code>.</li><li>The same applies to <code>.Renviron</code></li></ul><h4 id="renviron"><a class="markdownIt-Anchor" href="#renviron"></a> .Renviron</h4><p>The <code>.Renviron</code> file is used to store system variables. We can create this file in <code>HOME</code> or in current working directory.</p><p>A typical use of the <code>.Renviron</code> file is to specify the <code>R_LIBS</code> path:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Linux</span></span></span><br><span class="line">R_LIBS=~/R/library</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Windows</span></span></span><br><span class="line">R_LIBS=C:/R/library</span><br></pre></td></tr></table></figure><p>This variable points to a directory where R packages will be installed. When <code>install.packages</code> is called, new packages will be stored in <code>R_LIBS</code>.</p><h4 id="rprofile"><a class="markdownIt-Anchor" href="#rprofile"></a> .Rprofile</h4><p><code>.Rprofile</code> file contains R scirpts that run each time R starts.</p><p>Use <code>help(Rprofile)</code> in R to get help information about the setting.</p><p>We can set the CRAN mirror in <code>.Rprofile</code>.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## local creates a new, empty environment</span></span><br><span class="line"><span class="comment">## This avoids polluting the global environment with</span></span><br><span class="line"><span class="comment">## the object r</span></span><br><span class="line">local({</span><br><span class="line"> r = getOption(<span class="string">"repos"</span>)</span><br><span class="line"> r[<span class="string">"CRAN"</span>] = <span class="string">"https://cran.rstudio.com/"</span></span><br><span class="line"> options(repos = r)</span><br><span class="line">})</span><br></pre></td></tr></table></figure><p>From: <a href="https://csgillespie.github.io/efficientR/3-3-r-startup.html#r-startup" target="_blank" rel="noopener">Efficient R programming - 3.3 R startup</a></p><blockquote><p>The RStudio mirror is a virtual machine run by Amazon’s EC2 service, and it syncs with the main CRAN mirror in Austria once per day. Since RStudio is using Amazon’s CloudFront, the repository is automatically distributed around the world, so no matter where you are in the world, the data does not need to travel very far, and is therefore fast to download.</p></blockquote><h2 id="installupdate-packages"><a class="markdownIt-Anchor" href="#installupdate-packages"></a> Install/update packages</h2><h3 id="from-cran"><a class="markdownIt-Anchor" href="#from-cran"></a> From CRAN</h3><p>Choose mirror before installing packages. See the mirrors available: <a href="https://cran.r-project.org/mirrors.html" target="_blank" rel="noopener">https://cran.r-project.org/mirrors.html</a></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">chooseCRANmirror()</span><br></pre></td></tr></table></figure><p>Or specify the mirror when installing the packages.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">install.packages(<span class="string">'RMySQL'</span>, repos=<span class="string">'https://mirrors.tuna.tsinghua.edu.cn/CRAN/'</span>)</span><br></pre></td></tr></table></figure><p>Or through Biocmanager</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">BiocManager::install(<span class="string">'ggplot2'</span>)</span><br></pre></td></tr></table></figure><h3 id="from-bioconductor"><a class="markdownIt-Anchor" href="#from-bioconductor"></a> From Bioconductor</h3><p>See: <a href="https://www.bioconductor.org/install/" target="_blank" rel="noopener">Bioconductor - install</a></p><p>Choose mirror. See the mirror list here: <a href="https://www.bioconductor.org/about/mirrors/" target="_blank" rel="noopener">https://www.bioconductor.org/about/mirrors/</a></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## Change default Bioconductor mirrors</span></span><br><span class="line">chooseBioCmirror()</span><br></pre></td></tr></table></figure><h4 id="r-350"><a class="markdownIt-Anchor" href="#r-350"></a> R < 3.5.0</h4><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">source</span>(<span class="string">"https://bioconductor.org/biocLite.R"</span>)</span><br><span class="line">biocLite(c(<span class="string">"GenomicFeatures"</span>, <span class="string">"AnnotationDbi"</span>))</span><br><span class="line"></span><br><span class="line"><span class="comment">## install a package from source:</span></span><br><span class="line">biocLite(<span class="string">"IRanges"</span>, type=<span class="string">"source"</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment">## install all Bioconductor software packages</span></span><br><span class="line">biocLite(all_group())</span><br><span class="line"><span class="comment">## End(Not run)</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## Show the Bioconductor and CRAN repositories that will be used to</span></span><br><span class="line"><span class="comment">## install/update packages.</span></span><br><span class="line">> biocinstallRepos()</span><br><span class="line"> BioCsoft </span><br><span class="line"> <span class="string">"https://bioconductor.org/packages/3.6/bioc"</span> </span><br><span class="line"> BioCann </span><br><span class="line"><span class="string">"https://bioconductor.org/packages/3.6/data/annotation"</span> </span><br><span class="line"> BioCexp </span><br><span class="line"><span class="string">"https://bioconductor.org/packages/3.6/data/experiment"</span> </span><br><span class="line"> CRAN </span><br><span class="line"> <span class="string">"http://cloud.r-project.org"</span></span><br></pre></td></tr></table></figure><p>Or through package <code>BiocInstaller</code>.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(BiocInstaller)</span><br><span class="line">biocLite(<span class="string">"DESeq2"</span>)</span><br></pre></td></tr></table></figure><h4 id="r-350-2"><a class="markdownIt-Anchor" href="#r-350-2"></a> R >= 3.5.0</h4><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> (!requireNamespace(<span class="string">"BiocManager"</span>, quietly = <span class="literal">TRUE</span>))</span><br><span class="line"> install.packages(<span class="string">"BiocManager"</span>)</span><br><span class="line"></span><br><span class="line">BiocManager::install(<span class="string">"FlowSorted.Blood.EPIC"</span>, version = <span class="string">"3.8"</span>)</span><br></pre></td></tr></table></figure><p>Package <code>BiocManager</code> can also be used to install packages not in Bioconductor.</p><p><code>BiocManager::repositories()</code> returns the Bioconductor and CRAN repositories used by <code>install()</code>.</p><p>See: <a href="https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html" target="_blank" rel="noopener">https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html</a></p><p><strong>Specify a version</strong></p><p>Use the <code>version=</code> argument to update all packages to a specific <em>Bioconductor</em> version</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">BiocManager::install(version=<span class="string">"3.7"</span>)</span><br></pre></td></tr></table></figure><p>A special version, <code>version="devel"</code>, allows use of <em>Bioconductor</em> packages that are under development.</p><h3 id="from-github"><a class="markdownIt-Anchor" href="#from-github"></a> From Github</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># check the devtools package</span></span><br><span class="line"><span class="keyword">if</span> (!requireNamespace(<span class="string">"devtools"</span>, quietly = <span class="literal">TRUE</span>))</span><br><span class="line"> install.packages(<span class="string">"devtools"</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># install package</span></span><br><span class="line">devtools::install_github(<span class="string">"markgene/maxprobes"</span>)</span><br></pre></td></tr></table></figure><p>Or through Biocmanager</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">BiocManager::install(<span class="string">'MadsAlbertsen/ampvis2'</span>)</span><br></pre></td></tr></table></figure><h3 id="from-source"><a class="markdownIt-Anchor" href="#from-source"></a> From source</h3><p>The packages can also be installed from source. First the package should be downloaded.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">install.packages(<span class="string">"M3Drop_3.05.00.tar.gz"</span>, type=<span class="string">"source"</span>)</span><br></pre></td></tr></table></figure><h2 id="update-r"><a class="markdownIt-Anchor" href="#update-r"></a> Update R</h2><p>Reference</p><ul><li><a href="https://stackoverflow.com/questions/1401904/painless-way-to-install-a-new-version-of-r" target="_blank" rel="noopener">stack overflow - Painless way to install a new version of R?</a></li></ul><h3 id="upgrade-packages-after-installing-a-new-r"><a class="markdownIt-Anchor" href="#upgrade-packages-after-installing-a-new-r"></a> Upgrade packages after installing a new R</h3><p>Copy the library to the new library path. Then use the code to update packages.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">update.packages(checkBuilt=<span class="literal">TRUE</span>, ask=<span class="literal">FALSE</span>)</span><br></pre></td></tr></table></figure><h3 id="upgrade-packages-of-bioconductor"><a class="markdownIt-Anchor" href="#upgrade-packages-of-bioconductor"></a> Upgrade packages of Bioconductor</h3><p>Install packages from a newer version of Bioconductor.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">BiocManger::install(version = <span class="string">'xx'</span>)</span><br></pre></td></tr></table></figure><h2 id="special-tools"><a class="markdownIt-Anchor" href="#special-tools"></a> Special tools</h2><p>Since the troublesome work of installing/updating R and R packages, there are tools specialized for this job.</p><h3 id="rvcheck"><a class="markdownIt-Anchor" href="#rvcheck"></a> rvcheck</h3><p><a href="https://github.com/GuangchuangYu/rvcheck" target="_blank" rel="noopener">rvcheck</a>, created by Guangchuang Yu, is a simple and easy-to-use package to check R/Package version.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># install</span></span><br><span class="line">install.packages(<span class="string">"rvcheck"</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># Or the development version</span></span><br><span class="line"><span class="comment">## install.packages("devtools")</span></span><br><span class="line">devtools::install_github(<span class="string">"GuangchuangYu/rvcheck"</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># Usage examples</span></span><br><span class="line"><span class="keyword">library</span>(rvcheck)</span><br><span class="line">check_r()</span><br><span class="line">check_bioc(<span class="string">'ggtree'</span>)</span><br><span class="line">check_cran(<span class="string">'emojifont'</span>)</span><br><span class="line">check_github(<span class="string">"guangchuangyu/clusterProfiler"</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># Update all!</span></span><br><span class="line">rvcheck::update_all(check_R = <span class="literal">TRUE</span>, which = c(<span class="string">"CRAN"</span>, <span class="string">"BioC"</span>, <span class="string">"github"</span>))</span><br></pre></td></tr></table></figure><h3 id="installr"><a class="markdownIt-Anchor" href="#installr"></a> installr</h3><p><a href="https://github.com/talgalili/installr" target="_blank" rel="noopener">installr</a>, created by Tal Galili, includes functions for installing softwares from within R (currently, only on Windows OS), with a special focus on R itself.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># install</span></span><br><span class="line">install.packages(<span class="string">'installr'</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># Usage examples</span></span><br><span class="line"><span class="comment">## update R</span></span><br><span class="line"><span class="keyword">if</span>(!<span class="keyword">require</span>(<span class="string">"installr"</span>)) install.packages(<span class="string">'installr'</span>)</span><br><span class="line"><span class="keyword">library</span>(<span class="string">"installr"</span>)</span><br><span class="line">updateR() <span class="comment"># this will open dialog boxes to take you through the steps.</span></span><br><span class="line"><span class="comment"># OR use:</span></span><br><span class="line"><span class="comment"># updateR(TRUE) # this will use common defaults and will be the safest/fastest option</span></span><br><span class="line"></span><br><span class="line"><span class="comment">## install a new software</span></span><br><span class="line"><span class="keyword">library</span>(<span class="string">"installr"</span>)</span><br><span class="line">installr() <span class="comment"># user can easily select (via a GUI interface) a software to install.</span></span><br></pre></td></tr></table></figure><p>Further reading: <a href="https://www.r-statistics.com/2013/03/updating-r-from-r-on-windows-using-the-installr-package/" target="_blank" rel="noopener">Updating R from R (on Windows) – using the {installr} package</a></p><h2 id="reference"><a class="markdownIt-Anchor" href="#reference"></a> Reference</h2><ul><li><a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html" target="_blank" rel="noopener">R: Search Paths for Packages</a></li><li><a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html" target="_blank" rel="noopener">R: Initialization at Start of an R Session</a></li><li><a href="https://csgillespie.github.io/efficientR/3-3-r-startup.html#r-startup" target="_blank" rel="noopener">Efficient R programming - 3.3 R startup</a></li><li><a href="https://www.r-bloggers.com/fun-with-rprofile-and-customizing-r-startup/" target="_blank" rel="noopener">Rbloggers - Fun with .Rprofile and customizing R startup</a></li><li><a href="https://www.r-bloggers.com/package-paths-in-r/" target="_blank" rel="noopener">Rbloggers - Package Paths in R</a></li><li><a href="https://www.statmethods.net/interface/customizing.html" target="_blank" rel="noopener">Quick-R - Customizing Startup</a></li></ul><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180703: create the note.</li><li>20190717: complete the note.</li></ul>]]></content>
<summary type="html">
<p>Purpose in short: to ease the pain when installing/updating <code>R</code> and <code>R</code> packages.</p>
<p><strong>Note</strong>: I m
</summary>
<category term="R" scheme="https://yiweiniu.github.io/blog/categories/R/"/>
<category term="R" scheme="https://yiweiniu.github.io/blog/tags/R/"/>
<category term="install R" scheme="https://yiweiniu.github.io/blog/tags/install-R/"/>
<category term="R package" scheme="https://yiweiniu.github.io/blog/tags/R-package/"/>
</entry>
<entry>
<title>Cancer gene collections</title>
<link href="https://yiweiniu.github.io/blog/2019/06/Cancer-gene-collections/"/>
<id>https://yiweiniu.github.io/blog/2019/06/Cancer-gene-collections/</id>
<published>2019-06-03T11:15:23.000Z</published>
<updated>2019-06-03T11:20:01.000Z</updated>
<content type="html"><![CDATA[<p>Reliable cancer gene collections are useful in cancer research. Here I list several such resources.</p><p>These sets can be devided into two categories: mutation- or data-based, and literature- or knowledge-based.</p><h2 id="mutation-based"><a class="markdownIt-Anchor" href="#mutation-based"></a> Mutation-based</h2><h3 id="cosmic-cancer-gene-census-cgc"><a class="markdownIt-Anchor" href="#cosmic-cancer-gene-census-cgc"></a> COSMIC Cancer Gene Census (CGC)</h3><ul><li>Paper: <a href="https://www.nature.com/articles/s41568-018-0060-1" target="_blank" rel="noopener">The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers</a></li><li>Web page: <a href="https://cancer.sanger.ac.uk/census" target="_blank" rel="noopener">https://cancer.sanger.ac.uk/census</a></li></ul><p>The Catalogue of Somatic Mutations in Cancer (COSMIC) Cancer Gene Census (CGC) is an expert-curated description of the genes driving human cancer. The figure following shows the curation process.</p><img src="/blog/2019/06/Cancer-gene-collections/20190603155415507_9925.png"><p>It comprises two tiers.</p><ul><li>To classify as Tier 1, a gene must possess a documented and reproducible activity relevant to cancer, along with evidence of mutations in cancer that change the activity of the gene product in a way that promotes oncogenic transformation.</li><li>Included in Tier 2 are genes with mutation patterns typical for oncogenes or TSGs but that have less well-established functional evidence in the scientific literature. Similarly, genes with strong published evidence for a function in cancer but unclear mutation patterns or genes known to be dysregulated solely by epigenetic means (for example, by changes to promoter methylation) are also included in Tier 2.</li></ul><img src="/blog/2019/06/Cancer-gene-collections/20190603155627379_31195.png"><p><strong>Comments</strong></p><p><a href="https://www.nature.com/articles/s41592-019-0422-y" target="_blank" rel="noopener">Lever et al. 2019</a></p><blockquote><p>The Cancer Gene Census (CGC) uses data from the Catalogue of Somatic Mutations in Cancer (COSMIC) to list known oncogenes and tumor suppressors.</p></blockquote><h3 id="2020-rule"><a class="markdownIt-Anchor" href="#2020-rule"></a> 20/20 rule</h3><ul><li>Paper: <a href="https://science.sciencemag.org/content/339/6127/1546" target="_blank" rel="noopener">Cancer Genome Landscapes</a></li></ul><p>The authors used mutation patterns to classify genes of the COSMIC database into oncogenes and tumor supressor genes.</p><ul><li>To be classified as an oncogene, we simply require that >20% of the recorded mutations in the gene are at recurrent positions and are missense.</li><li>To be classified as a tumor suppressor gene, we analogously require that >20% of the recorded mutations in the gene are inactivating.</li></ul><h3 id="network-of-cancer-genes-ncg"><a class="markdownIt-Anchor" href="#network-of-cancer-genes-ncg"></a> Network of Cancer Genes (NCG)</h3><ul><li>Paper: <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1612-0" target="_blank" rel="noopener">The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens</a></li><li>Web page: <a href="http://ncg.kcl.ac.uk/" target="_blank" rel="noopener">http://ncg.kcl.ac.uk/</a></li></ul><p>Genes in NCG were collected from 275 publications, including two sources of known cancer genes and 273 cancer sequencing screens of more than 100 cancer types from 34,905 cancer donors and multiple primary sites.</p><img src="/blog/2019/06/Cancer-gene-collections/20190602163121333_28026.png"><p><strong>Comments</strong></p><p><a href="https://www.nature.com/articles/s41592-019-0422-y" target="_blank" rel="noopener">Lever et al. 2019</a></p><blockquote><p>The Network of Cancer Genes builds upon the CGC and integrates a wide variety of additional contextual data, such as the frequency of mutations.</p></blockquote><h3 id="intogen"><a class="markdownIt-Anchor" href="#intogen"></a> IntOGen</h3><ul><li>Paper: <a href="https://www.nature.com/articles/nmeth.2642" target="_blank" rel="noopener">IntOGen-mutations identifies cancer drivers across tumor types</a></li><li>Web page: <a href="https://www.intogen.org/search" target="_blank" rel="noopener">https://www.intogen.org/search</a></li></ul><p>IntOGen-mutations is a Web platform used to identify cancer drivers across tumor types and to present the results of the systematic analysis of most currently available large data sets of tumor somatic mutations.</p><p><strong>Comments</strong></p><p><a href="https://www.nature.com/articles/s41592-019-0422-y" target="_blank" rel="noopener">Lever et al. 2019</a></p><blockquote><p>IntOGen uses data from large-scale sequencing projects (for example, the Cancer Genome Atlas (TCGA)) to collate the importance of cancer genes.</p></blockquote><h3 id="candidate-cancer-gene-database-ccgd"><a class="markdownIt-Anchor" href="#candidate-cancer-gene-database-ccgd"></a> Candidate Cancer Gene Database (CCGD)</h3><ul><li>Paper: <a href="https://academic.oup.com/nar/article/43/D1/D844/2439469" target="_blank" rel="noopener">The Candidate Cancer Gene Database: a database of cancer driver genes from forward genetic screens in mice</a></li><li>Web page: <a href="http://ccgd-starrlab.oit.umn.edu/about.php" target="_blank" rel="noopener">http://ccgd-starrlab.oit.umn.edu/about.php</a></li></ul><p>The Candidate Cancer Gene Database (CCGD) was developed to disseminate the results of transposon-based forward genetic screens in mice that identify candidate cancer genes. The purpose of the database is to allow cancer researchers to quickly determine whether or not a gene, or list of genes, has been identified as a potential cancer driver in a forward genetic screen in mice.</p><h2 id="literature-based"><a class="markdownIt-Anchor" href="#literature-based"></a> Literature-based</h2><p><a href="https://www.nature.com/articles/s41592-019-0422-y" target="_blank" rel="noopener">Lever et al. 2019</a></p><blockquote><p>All manually curated databases face the overwhelming curation burden of expert curator time and costs necessary to stay up-to-date.</p></blockquote><h3 id="cancermine"><a class="markdownIt-Anchor" href="#cancermine"></a> CancerMine</h3><ul><li>Paper: <a href="https://www.nature.com/articles/s41592-019-0422-y" target="_blank" rel="noopener">CancerMine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer</a></li><li>Web page: <a href="http://bionlp.bcgsc.ca/cancermine/" target="_blank" rel="noopener">http://bionlp.bcgsc.ca/cancermine/</a></li><li>Github: <a href="https://github.com/jakelever/cancermine" target="_blank" rel="noopener">https://github.com/jakelever/cancermine</a></li></ul><p>CancerMine a literature-mined database of drivers, oncogenes and tumor suppressors in cancer. The authors first manually annotated the drivers, oncogenes and tumor suppressors discussed in 1,500 sentences as training data. Then they trained a logistic regression classifier on word frequencies and semantic features. To lower the number of false positives, a high threshold was uded, resulting relatively high precision and low recall (average precision of 85.6% and recall of 29.4% across the three gene role types).</p><p>CancerMine contains substantially more cancer gene associations than other resources but has poor overlap with CGC and IntOGen. The authors explained this as “gene associations in the CGC and IntOGen are not mentioned in the literature”.</p><img src="/blog/2019/06/Cancer-gene-collections/20190602161846581_1735.png"><h3 id="ongene"><a class="markdownIt-Anchor" href="#ongene"></a> ONGene</h3><ul><li>Paper: <a href="https://www.sciencedirect.com/science/article/pii/S1673852716302053" target="_blank" rel="noopener">ONGene: A literature-based database for human oncogenes</a></li><li>Web page: <a href="http://ongene.bioinfo-minzhao.org" target="_blank" rel="noopener">http://ongene.bioinfo-minzhao.org</a></li></ul><p>ONGene is a database for oncogenes. The authors manually curated abstracts from PubMed (Dec 25th, 2015) and collected 803 human oncogenes (698 protein-coding genes and 105 non-coding genes.)</p><p>This database is simple and has high utility.</p><p><strong>Comments</strong></p><p><a href="https://www.nature.com/articles/s41592-019-0422-y" target="_blank" rel="noopener">Lever et al. 2019</a></p><blockquote><p>ONGene and TSGene list oncogenes and tumor suppressors but do not associate them with specific cancer types.</p></blockquote><h3 id="tsgene"><a class="markdownIt-Anchor" href="#tsgene"></a> TSGene</h3><ul><li>Paper: <a href="https://academic.oup.com/nar/article/44/D1/D1023/2503080" target="_blank" rel="noopener">TSGene 2.0: an updated literature-based knowledgebase for tumor suppressor genes</a></li><li>Web page: <a href="http://bioinfo.mc.vanderbilt.edu/TSGene/" target="_blank" rel="noopener">http://bioinfo.mc.vanderbilt.edu/TSGene/</a></li></ul><p>TSGene is a database for tumor suppressor genes. The authors manually curated abstracts from PubMed (25 April 2015) and collected 1217 human TSGs (1018 protein-coding genes and 199 non-coding genes.)</p><p><strong>Comments</strong></p><p><a href="https://www.nature.com/articles/s41592-019-0422-y" target="_blank" rel="noopener">Lever et al. 2019</a></p><blockquote><p>ONGene and TSGene list oncogenes and tumor suppressors but do not associate them with specific cancer types.</p></blockquote><h3 id="clinical-interpretation-of-variants-in-cancer-civic"><a class="markdownIt-Anchor" href="#clinical-interpretation-of-variants-in-cancer-civic"></a> Clinical Interpretation of Variants in Cancer (CIViC)</h3><ul><li>Paper: <a href="https://www.nature.com/articles/ng.3774" target="_blank" rel="noopener">CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer</a></li><li>Web page: <a href="https://civicdb.org/home" target="_blank" rel="noopener">https://civicdb.org/home</a></li></ul><p>CIViC is an expert-crowdsourced knowledgebase for Clinical Interpretation of Variants in Cancer describing the therapeutic, prognostic, diagnostic and predisposing relevance of inherited and somatic variants of all types.</p><h2 id="related-resources"><a class="markdownIt-Anchor" href="#related-resources"></a> Related resources</h2><ul><li><a href="https://www.biostars.org/p/15890/" target="_blank" rel="noopener">Biostars - Database Of Tumor Suppressors And/Or Oncogenes</a></li></ul><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20190602: create the note.</li></ul>]]></content>
<summary type="html">
<p>Reliable cancer gene collections are useful in cancer research. Here I list several such resources.</p>
<p>These sets can be devided into
</summary>
<category term="cancer" scheme="https://yiweiniu.github.io/blog/categories/cancer/"/>
<category term="cancer oncogene tumor suppressor" scheme="https://yiweiniu.github.io/blog/tags/cancer-oncogene-tumor-suppressor/"/>
</entry>
<entry>
<title>ATAC-seq data analysis: from FASTQ to peaks</title>
<link href="https://yiweiniu.github.io/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/"/>
<id>https://yiweiniu.github.io/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/</id>
<published>2019-03-20T10:15:57.000Z</published>
<updated>2019-08-29T01:36:31.417Z</updated>
<content type="html"><![CDATA[<p>The content were compiled from multiple resources on the internet (forum, papers, workshop etc.). I could not indicate all the sources, but I want to thank them for sharing experiences/knowledge/code.</p><h2 id="atac-seq-overview"><a class="markdownIt-Anchor" href="#atac-seq-overview"></a> ATAC-seq overview</h2><p>ATAC-seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing) is a method for determining chromatin accessibility across the genome. It utilizes a hyperactive Tn5 transposase to insert sequencing adapters into open chromatin regions.</p><img src="/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/20190320181006767_5393.png"><p>ATAC-seq overview (<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374986/" target="_blank" rel="noopener">Buenrostro <em>et al.</em>, 2015</a>).</p><p>And the peaks look like the following figure.</p><img src="/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/20190320180433696_25889.png"><p>ATAC-seq peaks <a href="https://doi.org/10.1186/1756-8935-7-33" target="_blank" rel="noopener">(Tsompana and Buck, 2014)</a></p><p>And ATAC-seq can be used to:</p><ul><li>generate epigenomic profiles</li><li>map accessible chromatin across tissues or conditions</li><li>retrieve nucleosome positions</li><li>identify important transcription factors</li><li>generate occupancy profiles of TFs (footprinting)</li></ul><h2 id="experimental-design"><a class="markdownIt-Anchor" href="#experimental-design"></a> Experimental design</h2><p>See <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374986/" target="_blank" rel="noopener">Buenrostro <em>et al.</em>, 2015</a>, <a href="https://www.encodeproject.org/atac-seq/" target="_blank" rel="noopener">ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline</a>, and <a href="https://github.com/harvardinformatics/ATAC-seq" target="_blank" rel="noopener">Harvard FAS Informatics - ATAC-seq Guidelines</a> for details.</p><ul><li>two or more biological replicates</li><li>each replicate has 25 million non-duplicate, non-mitochondrial aligned reads for single-end sequencing and 50 million for paired-ended sequencing</li><li>typically, no need for “input”</li><li>use as few PCR cycles as possible when constructing the library</li><li>paired-end sequencing is preferred</li></ul><h2 id="data-analysis"><a class="markdownIt-Anchor" href="#data-analysis"></a> Data analysis</h2><p>Several useful pipelines.</p><ul><li><a href="https://github.com/harvardinformatics/ATAC-seq" target="_blank" rel="noopener">Harvard FAS Informatics - ATAC-seq Guidelines</a> – clear and up-to-date.</li><li><a href="https://www.encodeproject.org/atac-seq/" target="_blank" rel="noopener">ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline</a></li><li><a href="https://github.com/tobiasrausch/ATACseq" target="_blank" rel="noopener">Tobias Rausch - ATAC-seq analysis pipeline</a></li><li><a href="https://github.com/ENCODE-DCC/atac-seq-pipeline" target="_blank" rel="noopener">ENCODE ATAC-seq pipeline</a></li><li><a href="https://github.com/ParkerLab/bioinf525" target="_blank" rel="noopener">Parker Lab - ATAC-seq lab for BIOINF525</a></li><li><a href="https://github.com/ay-lab/ATACProc" target="_blank" rel="noopener">Ferhat Ay Lab - ATAC-seq processing pipeline</a></li><li><a href="https://rockefelleruniversity.github.io/RU_ATACseq/" target="_blank" rel="noopener">Rockefeller University, ATACseq in R</a></li><li><a href="http://qiubio.com/new/book/chapter-06/#%E7%AC%AC%E4%BA%94%E7%AB%A0-atac-seq%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90chapter-5-atac-seq-data-analysis" target="_blank" rel="noopener">生物信息学生 R 入门教程 - 第五章 ATAC-seq数据分析</a> – in R.</li></ul><p>The following pipeline includes several common analysis in ATAC-seq setting, from data trimming to peak calling. Some steps are optional, like merging BAMs.</p><h3 id="quality-control"><a class="markdownIt-Anchor" href="#quality-control"></a> Quality control</h3><p>Just like analyzing other NGS data, quality control is needed for raw <code>FASTQ</code>, and there are many programs avaliable, such as <a href="http://www.usadellab.org/cms/?page=trimmomatic" target="_blank" rel="noopener">Trimmomatic</a> and <a href="https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/" target="_blank" rel="noopener">Trim Galore</a>.</p><h3 id="alignment-and-filter"><a class="markdownIt-Anchor" href="#alignment-and-filter"></a> Alignment and filter</h3><p>The next step is to align reads to a reference genome. Two popular aligners are <a href="http://bio-bwa.sourceforge.net/bwa.shtml" target="_blank" rel="noopener">BWA</a> and <a href="http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml" target="_blank" rel="noopener">Bowtie2</a>. I will use <code>Bowtie2</code> (since it was used in many tutorals and papers).</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> better alignment results are frequently achieved with --very-sensitive</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> use -X 2000 to allow larger fragment size (default is 500)</span></span><br><span class="line">bowtie2 --very-sensitive -X 2000 -x $Bowtie2Index -1 ${sample}*_1.fq.gz -2 ${sample}*_2.fq.gz \</span><br><span class="line"> -p $PPN 2> $${sample}.bowtie2.log | $path2samtools sort -@ $PPN -O bam -o ${sample}.sorted.bam</span><br><span class="line"><span class="meta">$</span><span class="bash">path2samtools index -@ <span class="variable">$PPN</span> <span class="variable">$WORKDIR</span>/bowtie2/<span class="variable">${sample}</span>.sorted.bam</span></span><br></pre></td></tr></table></figure><p>Then, the alignment results should be filtered.</p><h4 id="mitochondrial-reads"><a class="markdownIt-Anchor" href="#mitochondrial-reads"></a> Mitochondrial reads</h4><p>Ref: <a href="https://github.com/harvardinformatics/ATAC-seq" target="_blank" rel="noopener">Harvard FAS Informatics - ATAC-seq Guidelines</a></p><blockquote><p>Since there are no ATAC-seq peaks of interest in the mitochondrial genome, these reads will only complicate the subsequent steps. Therefore, we recommend that they be removed from further analysis, via one of the following methods:</p><ol><li>Remove the mitochondrial genome from the reference genome before aligning the reads. In human/mouse genome builds, the mitochondrial genome is labeled ‘chrM’. That sequence can be deleted from the reference prior to building the genome indexes. The downside of this approach is that the alignment numbers will look much worse; all of the mitochondrial reads will count as unaligned.</li><li>Remove the mitochondrial reads after alignment. A python script, creatively named removeChrom, is available in the ATAC-seq module to accomplish this.</li></ol></blockquote><p>Since the percentage of mtDNA-reads is a indicator of library quality, we usually remove mitochondrial reads after alignment. It is run as follows:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">samtools view -@ $PPN -h ${sample}.bam | grep -v chrM | samtools sort -@ $PPN -O bam -o ${sample}.rmChrM.bam</span><br></pre></td></tr></table></figure><h4 id="pcr-duplicates"><a class="markdownIt-Anchor" href="#pcr-duplicates"></a> PCR duplicates</h4><p>Ref: <a href="https://github.com/harvardinformatics/ATAC-seq" target="_blank" rel="noopener">Harvard FAS Informatics - ATAC-seq Guidelines</a></p><blockquote><p>PCR duplicates are exact copies of DNA fragments that arise during PCR. Since they are artifacts of the library preparation procedure, they may interfere with the biological signal of interest. Therefore, they should be removed as part of the analysis pipeline.</p><p>One commonly used program for removing PCR duplicates is Picard’s <code>MarkDuplicates</code>.</p></blockquote><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash">{path2java} -XX:ParallelGCThreads=<span class="variable">${PPN}</span> -Djava.io.tmpdir=/tmp -jar <span class="variable">${path2picard}</span> MarkDuplicates \</span></span><br><span class="line"> QUIET=true INPUT=${sample}.bam OUTPUT=${sample}.marked.bam METRICS_FILE=${sample}.sorted.metrics \</span><br><span class="line"> REMOVE_DUPLICATES=false CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp</span><br><span class="line"><span class="meta">#</span><span class="bash"> REMOVE_DUPLICATES=<span class="literal">false</span>: mark duplicate reads, not remove.</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> Change it to <span class="literal">true</span> to remove duplicate reads.</span></span><br></pre></td></tr></table></figure><p><code>MarkDuplicates</code> will add a FALG <code>1024</code> to duplicate reads, we can remove them using <code>samtools</code>:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">samtools view -h -b -F 1024 ${sample}.bam > ${sample}.rmDup.bam</span><br></pre></td></tr></table></figure><h4 id="non-unique-alignments"><a class="markdownIt-Anchor" href="#non-unique-alignments"></a> Non-unique alignments</h4><p>Ref: <a href="https://github.com/harvardinformatics/ATAC-seq" target="_blank" rel="noopener">Harvard FAS Informatics - ATAC-seq Guidelines</a></p><blockquote><p>Some researchers choose to remove non-uniquely aligned reads, using the <code>-q</code> parameter of <code>samtools view</code>.</p></blockquote><p>Different genome aligners have varied implementation of mapping quality (MAPQ). See <a href="https://www.acgt.me/?offset=1426809676847" target="_blank" rel="noopener">More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation)</a>. So, when using MAPQ to filter non-unique alignments, do check the MAPQ values of the aligner using.</p><p>For Bowtie2, people usually use MAPQ > 30.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> Remove multi-mapped reads (i.e. those with MAPQ < 30, using -q <span class="keyword">in</span> SAMtools)</span></span><br><span class="line">samtools view -h -q 30 ${sample}.bam > ${sample}.rmMulti.bam</span><br></pre></td></tr></table></figure><h4 id="others"><a class="markdownIt-Anchor" href="#others"></a> Others</h4><p>In the pipeline by ENCODE or some papers, the following reads were also removed (samtoolf flag 1796 or 1804).</p><ul><li>reads unmapped,</li><li>not primary alignment</li><li>reads failing platform</li><li>duplicates</li></ul><p>The remaining reads are so-called “properly mapped reads”.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> Remove reads unmapped, mate unmapped, not primary alignment, reads failing platform, duplicates (-F 1804)</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> Retain properly paired reads -f 2</span></span><br><span class="line">samtools view -h -b -F 1804 -f 2 ${sample}.bam > ${sample}.filtered.bam</span><br></pre></td></tr></table></figure><h4 id="merging-bams-optional"><a class="markdownIt-Anchor" href="#merging-bams-optional"></a> Merging BAMs (optional)</h4><p>When several libraries were constructed for one experimental condition (aka. one experimental condition inlcuded several biological and/or technical replicates), one may want to merge different <code>BAM</code> files before calling peaks (e.g. merge BAM files from technical replicates, merge BAM files to get bigger read depth).</p><p>Considerations about “merge BAMs” or “merge peaks” have been discussed:</p><ul><li><p><a href="https://www.biostars.org/p/112778/" target="_blank" rel="noopener">ChIP-Seq: Calling peaks with replicates</a></p><blockquote><p>First, are these A) technical or B) biological replicates? That is, the same biological sample run several times with the same antibody (same lot also if polyclonal) protocol, or different biological samples run the same way with the same protocol?</p><p>If it is A it may be reasonable to merge them for some analyses, such as just annotating peaks. I would merge the bam alignment files and then do the calls versus merging the calls.</p><p>However, first you have analyze your replicates to check they they all perform the same. We did a lot of performance comparisons here: <a href="https://epigeneticsandchromatin.biomedcentral.com/articles/10.1186/s13072-016-0100-6" target="_blank" rel="noopener">https://epigeneticsandchromatin.biomedcentral.com/articles/10.1186/s13072-016-0100-6</a></p><p>You can steal those ideas, especially using the ENCODE segmentation tracks if it’s human and they have tracks for something like your cell type. But just counting the reads in bins and then doing a correlation is pretty informative.</p><p>But even in our data, and we used a robot and do it a lot, one of our technical replicates behaved strangely. See supplemental figure S6.</p><p>If it is B, biological replicates, you almost certainly don’t want to merge them. You will lose your information about biological variance is present. If you are looking at something like differential peaks between conditions DESeq and really all reputable programs will want some sort of replicates, almost always biological. In general, if you want to compute a p value on anything you need separate replicates (not merged).</p><p>If you are just annotating peaks you don’t need a p value.</p></blockquote></li><li><p><a href="https://www.biostars.org/p/191474/" target="_blank" rel="noopener">how to pool together biological replicates?</a></p></li><li><p><a href="https://www.biostars.org/p/210564/" target="_blank" rel="noopener">ChipSeq: merge bam file before peak calling</a></p></li><li><p><a href="https://www.biostars.org/p/230055/" target="_blank" rel="noopener">Chip-Seq merging peak files</a></p></li></ul><p>And merging BAM can be done using <code>samtools</code>.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">samtools merge -@ $PPN condition1.merged.bam sample1.bam sample2.bam sample3.bam</span><br><span class="line">samtools index -@ $PPN condition1.merged.bam</span><br></pre></td></tr></table></figure><p>Also, <code>multiBamSummary</code> in <code>deepTools</code> can be used to check the correlations between <code>BAM</code> files before merging.</p><h3 id="shifting-reads"><a class="markdownIt-Anchor" href="#shifting-reads"></a> Shifting reads</h3><p>In the first ATAC-seq paper (<a href="https://www.nature.com/articles/nmeth.2688" target="_blank" rel="noopener">Buenrostro et al., 2013</a>), all reads aligning to the + strand were offset by +4 bp, and all reads aligning to the – strand were offset −5 bp, since Tn5 transposase has been shown to bind as a dimer and insert two adaptors separated by 9 bp <a href="https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-12-r119" target="_blank" rel="noopener">(Adey et al., 2010)</a>.</p><p>Ref: <a href="https://github.com/GreenleafLab/NucleoATAC/issues/58" target="_blank" rel="noopener">shifting reads bam for NucleoATAC?</a></p><blockquote><p>However, for peak calling, shifting of reads is not likely very important, as it is a pretty minor adjustment and peaks are 100s of basepairs. The shifting is only crucial when doing things where the exact position of the insertion matters at single base resolution, e.g. TF motif footprinting.</p></blockquote><p>Also, remember that not all TF footprinting tools need shifted reads. Some of them may do this internally, e.g. <code>NucleoATAC</code>.</p><p>But, <strong>how to adjust the reads alignment?</strong></p><p>First, we could do this using <code>bedtools</code> and <code>awk</code>.</p><p>Ref: <a href="https://www.biostars.org/p/187204/#187206" target="_blank" rel="noopener">Shifting reads for ATAC-seq alignments</a></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> the BAM file should be sorted by <span class="built_in">read</span> name beforehand</span></span><br><span class="line">samtools sort -n -T aln.sorted -o aln.sorted.bam aln.bam</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> The bedtools <span class="built_in">command</span> should extract the paired-end alignments as bedpe format, <span class="keyword">then</span> the awk <span class="built_in">command</span> should <span class="built_in">shift</span> the fragments as needed</span></span><br><span class="line">bedtools bamtobed -i reads.bam -bedpe | awk -v OFS="\t" '{($9=="+"){print $1,$2+4,$6+4} \</span><br><span class="line"><span class="meta"> ($</span><span class="bash">9==<span class="string">"-"</span>){<span class="built_in">print</span> <span class="variable">$1</span>,<span class="variable">$2</span>-5,<span class="variable">$6</span>-5}}<span class="string">' > fragments.bed</span></span></span><br></pre></td></tr></table></figure><p>Or, we could do this using <a href="https://deeptools.readthedocs.io/en/develop/content/tools/alignmentSieve.html" target="_blank" rel="noopener">alignmentSieve</a>.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> use --ATACshift</span></span><br><span class="line">alignmentSieve --numberOfProcessors 8 --ATACshift --bam sample1.bam -o sample1.tmp.bam</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> the bam file needs to be sorted again</span></span><br><span class="line">samtools sort -@ 8 -O bam -o sample1.shifted.bam sample1.tmp.bam</span><br><span class="line">samtools index -@ 8 sample1.shifted.bam</span><br><span class="line">rm sample1.tmp.bam</span><br></pre></td></tr></table></figure><img src="/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/20190318163447907_19139.png"><p>We could also do this in <code>R</code> using <a href="https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html" target="_blank" rel="noopener">ATACseqQC</a></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">## load the library</span></span><br><span class="line"><span class="keyword">library</span>(ATACseqQC)</span><br><span class="line"></span><br><span class="line"><span class="comment">## input is bamFile</span></span><br><span class="line">bamfile <- system.file(<span class="string">"extdata"</span>, <span class="string">"GL1.bam"</span>, package=<span class="string">"ATACseqQC"</span>, mustWork=<span class="literal">TRUE</span>)</span><br><span class="line">bamfile.labels <- gsub(<span class="string">".bam"</span>, <span class="string">""</span>, basename(bamfile))</span><br><span class="line"></span><br><span class="line"><span class="comment">## bamfile tags</span></span><br><span class="line">tags <- c(<span class="string">"AS"</span>, <span class="string">"XN"</span>, <span class="string">"XM"</span>, <span class="string">"XO"</span>, <span class="string">"XG"</span>, <span class="string">"NM"</span>, <span class="string">"MD"</span>, <span class="string">"YS"</span>, <span class="string">"YT"</span>)</span><br><span class="line"><span class="comment">## files will be output into outPath</span></span><br><span class="line">outPath <- <span class="string">"splited"</span></span><br><span class="line">dir.create(outPath)</span><br><span class="line"><span class="comment">## shift the bam file by the 5'ends</span></span><br><span class="line"><span class="keyword">library</span>(BSgenome.Hsapiens.UCSC.hg19)</span><br><span class="line">seqlev <- <span class="string">"chr1"</span> <span class="comment">## subsample data for quick run</span></span><br><span class="line">which <- as(seqinfo(Hsapiens)[seqlev], <span class="string">"GRanges"</span>)</span><br><span class="line">gal <- readBamFile(bamfile, tag=tags, which=which, asMates=<span class="literal">TRUE</span>)</span><br><span class="line">gal1 <- shiftGAlignmentsList(gal)</span><br><span class="line">shiftedBamfile <- file.path(outPath, <span class="string">"shifted.bam"</span>)</span><br><span class="line">export(gal1, shiftedBamfile)</span><br></pre></td></tr></table></figure><h3 id="peak-calling-using-macs2"><a class="markdownIt-Anchor" href="#peak-calling-using-macs2"></a> Peak calling using MACS2</h3><p>Ref: <a href="https://www.biostars.org/p/265061/#265063" target="_blank" rel="noopener">Biostars - ATACseq with STAR and macs2</a></p><blockquote><p>You typically use the <code>--nomodel</code> option, as the shifting model of MACS does not really make sense for open chromation data. As you have probably paired-end data, also use the <code>-f BAMPE</code> option. It forces MACS to pileup the real fragment length instead of an estimate, which maked sense imho, due to the quiet different fragment sizes that the library prep creates, also if you have paired-end why not make full use of it. Check reproducibility of the peaks between replicate samples, then rerun MACS with the merged bam file and feed the count matrix into DESeq2. Reference e.g. Corces et al 2016 Nat Genetics.</p></blockquote><p>Ref: <a href="https://github.com/taoliu/MACS/issues/145" target="_blank" rel="noopener">ATAC-seq settings · Issue #145 · taoliu/MACS</a></p><blockquote><p>Liu Tao: If you followed original protocol for ATAC-Seq, you should get Paired-End reads. If so, I would suggest you just use <code>--format BAMPE</code> to let MACS2 pileup the whole fragments in general. But if you want to focus on looking for where the ‘cutting sites’ are, then <code>--nomodel --shift -100 --extsize 200</code> should work.</p></blockquote><p>Since paired-end sequencing is commonly used in ATAC-seq, so, we will tell <code>MACS2</code> that the data is paired using the <code>-f</code> argument. By this way, <code>MACS2</code> would only analyze properly mapped reads (as we get the bam after filtering above). The fragments are defined by the paired alignment, and there is no modeling or artificial extension.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> -f BAMPE, use paired-end information</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> --keep-dup all, keep all duplicate reads.</span></span><br><span class="line">macs2 callpeak -f BAMPE -g hs --keep-dup all --cutoff-analysis -n sample1 \</span><br><span class="line"> -t sample1.shifted.bam --outdir macs2/sample1 2> macs2.log</span><br></pre></td></tr></table></figure><h3 id="creating-browser-tracks"><a class="markdownIt-Anchor" href="#creating-browser-tracks"></a> Creating browser tracks</h3><p>If <code>-B</code> parameter was used when running <code>macs2 callpeak</code>, you would get bedGraph files together with narrowPeak files. Someone would use these bedGraph files to create browser tracks (e.g. <a href="https://github.com/ParkerLab/bioinf545" target="_blank" rel="noopener">ParkerLab - ATAC-seq lab for BIOINF545</a>), while others say they look kind of weird (<a href="https://www.biostars.org/p/325946/#325951" target="_blank" rel="noopener">Biostars - How to compare bigwig tracks of two ATAC libraries</a>). I do not know yet.</p><p>I would like to create bigWig files for visualizing using <code>bamCoverage</code> in <code>deepTools</code>. It provides several different ways to normalize the signal.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> bam to bigwig, normalize using 1x effective genome size</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> effective genome size: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html</span></span><br><span class="line">bamCoverage --numberOfProcessors 8 --binSize 10 --normalizeUsing RPGC \</span><br><span class="line"> --effectiveGenomeSize $effect_genome_size --bam sample1.shifted.bam -o sample1.shifted.bw</span><br></pre></td></tr></table></figure><h3 id="quality-check"><a class="markdownIt-Anchor" href="#quality-check"></a> Quality check</h3><p>In processing ATAC-seq data, we would get several metrics to check the quality of data/libraries.</p><p>Ref:</p><ul><li><a href="https://www.encodeproject.org/data-standards/terms/#enrichment" target="_blank" rel="noopener">ENCODE - Terms and Definitions</a></li><li><a href="https://www.encodeproject.org/atac-seq/" target="_blank" rel="noopener">ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline</a></li></ul><p>Here is the standards used by ENCODE, and the detailed description of each term will be explained below.</p><img src="/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/20190111215229195_23911.png" title="ENCODE ATAC-seq standards (accessed: 20190111)"><h4 id="fragmentinsert-size"><a class="markdownIt-Anchor" href="#fragmentinsert-size"></a> Fragment/Insert size</h4><p>Ref: <a href="https://dbrg77.wordpress.com/2017/02/10/atac-seq-insert-size-plotting/" target="_blank" rel="noopener">Not A Rocket Scientist - ATAC-seq insert size plotting</a></p><blockquote><p>One common QC for the data is to plot the fragment size density of your libraries. The successful construction of a ATAC library requires a proper pair of Tn5 transposase cutting events at the ends of DNA. In the nucleosome-free open chromatin regions, many molecules of Tn5 can kick in and chop the DNA into small pieces; around nucleosome-occupied regions, Tn5 can only access the linker regions. Therefore, in a normal ATAC-seq library, you should expect to see a sharp peak at the <100 bp region (open chromatin), and a peak at ~200bp region (mono-nucleosome), and other larger peaks (multi-nucleosomes). Examples from one of my data:</p><p>Regular scale:</p><img src="/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/20190103170441947_28541.png"><p>Log scale:</p><img src="/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/20190103170526610_9297.png"><p>This clear nucleosome phasing pattern indicates a good quality of the experiment.</p><p>Different people probably have different way of plotting this using different codes, but making this plot can be simply achieved by a combination of one line of code and Excel. You don’t need some complicated scripts to do it at all.</p><p><code>samtools view ATAC_f2q30_sorted.bam | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' > fragment_length_count.txt</code></p></blockquote><p>See discussions here: <a href="https://www.biostars.org/p/332440/" target="_blank" rel="noopener">Biostars - ATAC-seq fragment length distribution</a></p><h4 id="mitochondrial-reads-2"><a class="markdownIt-Anchor" href="#mitochondrial-reads-2"></a> %mitochondrial reads</h4><p>High mitochondrial reads was a well-known problem, but in the latest ATAC-seq protocol (Omni-ATAC) this problem has been well addressed (<a href="https://www.nature.com/articles/nmeth.4396" target="_blank" rel="noopener">Corces et al., 2017</a>).</p><p>I guess in the future people will not care about this any more, and I keep this section in case.</p><p>Ref: <a href="https://github.com/harvardinformatics/ATAC-seq" target="_blank" rel="noopener">Harvard FAS Informatics - ATAC-seq Guidelines</a></p><blockquote><p>It is a well-known problem that ATAC-seq datasets usually contain a large percentage of reads that is derived from mitochondrial DNA (for example, see <a href="http://seqanswers.com/forums/showthread.php?t=35318" target="_blank" rel="noopener">this discussion</a>). Some have gone as far as <a href="https://www.nature.com/articles/s41598-017-02547-w" target="_blank" rel="noopener">using CRISPR to reduce mitochondrial contamination</a>. The recently published <a href="https://www.nature.com/articles/nmeth.4396" target="_blank" rel="noopener">Omni-ATAC method</a> uses detergents to remove mitochondria and is likely to be more accessible for most researchers (but, <a href="https://www.biorxiv.org/content/early/2018/12/17/496521" target="_blank" rel="noopener">do <strong>not</strong> follow their computational workflow</a>).</p></blockquote><p>The following code can be used to compute the percentage of mitochondrial reads.</p><p>Ref: <a href="https://www.biostars.org/p/170294/#196173" target="_blank" rel="noopener">Biostars - ATACseq alignment issues</a></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">!/bin/bash</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">### Calculate percentage of reads mapped to mitochondrial genome (mtDNA) using SAMtools idxstats</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">### Can be useful for ATAC-seq data. Requires an indexed BAM file:</span></span></span><br><span class="line"></span><br><span class="line">if [[ $# -eq 0 ]] ; then</span><br><span class="line"> echo '[ERROR]: No input file given!'</span><br><span class="line"> exit 1</span><br><span class="line">fi</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Check if index is present. If not, create it:</span></span></span><br><span class="line">if [[ ! -e ${1}.bai ]];</span><br><span class="line"> then</span><br><span class="line"> echo '[INFO]: File does not seem to be indexed. Indexing now:'</span><br><span class="line"> samtools index $i</span><br><span class="line"> fi</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Calculate %mtDNA:</span></span></span><br><span class="line">mtReads=$(samtools idxstats $1 | grep 'chrM' | cut -f 3)</span><br><span class="line">totalReads=$(samtools idxstats $1 | awk '{SUM += $3} END {print SUM}')</span><br><span class="line"></span><br><span class="line">echo '==> mtDNA Content:' $(bc <<< "scale=2;100*$mtReads/$totalReads")'%'</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Usage: ./script.sh atacseq.bam</span></span></span><br></pre></td></tr></table></figure><h4 id="library-complexity"><a class="markdownIt-Anchor" href="#library-complexity"></a> Library complexity</h4><p>Ref: <a href="https://www.encodeproject.org/atac-seq/" target="_blank" rel="noopener">ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline</a></p><blockquote><p>Library complexity is measured using the <a href="https://www.encodeproject.org/data-standards/terms/#library" target="_blank" rel="noopener">Non-Redundant Fraction (NRF) and PCR Bottlenecking Coefficients 1 and 2, or PBC1 and PBC2</a>. The preferred values are as follows: NRF>0.9, PBC1>0.9, and PBC2>3.</p></blockquote><img src="/blog/2019/03/ATAC-seq-data-analysis-from-FASTQ-to-peaks/20190318163604339_31625.png"><p>The following code was from <a href="https://github.com/ENCODE-DCC/atac-seq-pipeline" target="_blank" rel="noopener">ENCODE ATAC-seq pipeline</a>.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> the bam used here is sorted bam after duplicates marking</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> sort bam by names</span></span><br><span class="line">samtools sort -@ 10 -n -O BAM -o tmp.bam ${sample}.bam</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> calculate PBC metrics</span></span><br><span class="line">bedtools bamtobed -bedpe -i tmp.bam | awk 'BEGIN{OFS="\t"}{print $1,$2,$4,$6,$9,$10}' \</span><br><span class="line"> | grep -v 'chrM' | sort | uniq -c | awk 'BEGIN{mt=0;m0=0;m1=0;m2=0}($1==1){m1=m1+1} \</span><br><span class="line"><span class="meta"> ($</span><span class="bash">1==2){m2=m2+1} {m0=m0+1} {mt=mt+<span class="variable">$1</span>} \</span></span><br><span class="line"> END{printf "%d\t%d\t%d\t%d\t%f\t%f\t%f\n", mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}' > ${sample}.pbc.qc</span><br><span class="line">rm tmp.bam</span><br></pre></td></tr></table></figure><h4 id="fraction-of-reads-in-peaks-frip"><a class="markdownIt-Anchor" href="#fraction-of-reads-in-peaks-frip"></a> Fraction of reads in peaks (FRiP)</h4><p>Ref: <a href="https://www.encodeproject.org/data-standards/terms/#enrichment" target="_blank" rel="noopener">ENCODE - Terms and Definitions</a></p><blockquote><p>Fraction of reads in peaks (FRiP) - Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. (Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)</p></blockquote><p>Ref: <a href="https://www.biostars.org/p/337872/#356888" target="_blank" rel="noopener">Biostars - FRIP score ATAC-seq</a></p><blockquote><p>In paired-end sequencing, we use the word fragment because the two reads that are produced always originate from the same DNA fragment and are therefore not independent of each other as reads from single-end sequencing would be. As FRiP comes from single-end ChIP-seq data, this is why they probably termed it reads. ATAC-seq is most commonly paired-end. You can use BEDtools for paired-end data but it requires more pre-processing of your data, that is why I use featureCounts, being faster and more convinient with plenty of customizable options. Choice is still yours. FRiP is probably not a very objective measure anyway, as it highly depends on how you prefilter your data, e.g. in terms of mapping quality, the definition of a properly-paired reads and the stringency of your peak calling (last sentence thinking aloud).</p></blockquote><p>Ref: <a href="https://www.encodeproject.org/atac-seq/" target="_blank" rel="noopener">ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline</a></p><blockquote><p>The fraction of reads in called peak regions (<a href="https://www.encodeproject.org/data-standards/terms/#enrichment" target="_blank" rel="noopener">FRiP score</a>) should be >0.3, though values greater than 0.2 are acceptable. For EN-TEx tissues (ENCODE GTEx tissue sample), FRiP scores will not be enforced as QC metric. TSS enrichment remains in place as a key signal to noise measure.</p></blockquote><p>FRiP score can be calculated by <code>samtools</code> and <code>bedtools</code>:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> total reads</span></span><br><span class="line">total_reads=$(samtools view -c ${sample}.bam)</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> reads <span class="keyword">in</span> peaks</span></span><br><span class="line">reads_in_peaks=$(bedtools sort -i ${sample}_peaks.narrowPeak \</span><br><span class="line"> | bedtools merge -i stdin | bedtools intersect -u -nonamecheck \</span><br><span class="line"> -a ${sample}.bam -b stdin -ubam | samtools view -c)</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> FRiP score</span></span><br><span class="line">FRiP=$(awk "BEGIN {print "${reads_in_peaks}"/"${total_reads}"}")</span><br></pre></td></tr></table></figure><p>While someone recommended to use <code>featureCounts</code> (<a href="https://www.biostars.org/p/337872/#337890" target="_blank" rel="noopener">https://www.biostars.org/p/337872/#337890</a>), I found the results of <code>featureCounts</code> and <code>intersectBed</code> were close. So, I would like to use <code>bedtools</code>.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">samtools view -c ${sample}.bam</span><br><span class="line">38439210</span><br><span class="line"></span><br><span class="line">bedtools sort -i ${sample}_peaks.narrowPeak \</span><br><span class="line"> | bedtools merge -i stdin | bedtools intersect -u -nonamecheck \</span><br><span class="line"> -a ${sample}.bam -b stdin -ubam | samtools view -c</span><br><span class="line">428359</span><br><span class="line"></span><br><span class="line">awk 'BEGIN{FS=OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"} \</span><br><span class="line"> {print $4, $1, $2+1, $3, "."}' ${sample}_peaks.narrowPeak > ${sample}_peaks.saf</span><br><span class="line">featureCounts -p -a M0.A.005_peaks.saf -F SAF -o readCountInPeaks.txt ${sample}.bam</span><br><span class="line">|| Total alignments : 19219605 ||</span><br><span class="line">|| Successfully assigned alignments : 230826 (1.2%) ||</span><br><span class="line"></span><br><span class="line"><span class="meta">></span><span class="bash"> 428359/38439210</span></span><br><span class="line">[1] 0.0111438</span><br><span class="line"><span class="meta">></span><span class="bash"> 230826/19219605</span></span><br><span class="line">[1] 0.01200992</span><br></pre></td></tr></table></figure><p>Find more details in note: <a href="/blog/2019/03/Calculate-FRiP-score/" title="Calculate FRiP score">Calculate FRiP score</a>.</p><h4 id="transcription-start-site-tss-enrichment"><a class="markdownIt-Anchor" href="#transcription-start-site-tss-enrichment"></a> Transcription start site (TSS) enrichment</h4><p>Ref: <a href="https://www.encodeproject.org/atac-seq/" target="_blank" rel="noopener">ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline</a></p><blockquote><p><a href="https://www.encodeproject.org/data-standards/terms/#enrichment" target="_blank" rel="noopener">Transcription start site (TSS) enrichment</a> values are dependent on the reference files used; cutoff values for high quality data are listed in the table below.</p></blockquote><p>Ref: <a href="https://www.encodeproject.org/data-standards/terms/#enrichment" target="_blank" rel="noopener">ENCODE - Terms and Definitions</a></p><blockquote><p><strong>Transcription Start Site (TSS) Enrichment Score</strong> - The TSS enrichment calculation is a signal to noise calculation. The reads around a reference set of TSSs are collected to form an aggregate distribution of reads centered on the TSSs and extending to 1000 bp in either direction (for a total of 2000bp). This distribution is then normalized by taking the average read depth in the 100 bps at each of the end flanks of the distribution (for a total of 200bp of averaged data) and calculating a fold change at each position over that average read depth. This means that the flanks should start at 1, and if there is high read signal at transcription start sites (highly open regions of the genome) there should be an increase in signal up to a peak in the middle. We take the signal value at the center of the distribution after this normalization as our TSS enrichment metric. <strong>Used to evaluate ATAC-seq.</strong></p></blockquote><p>The following code was from <a href="https://github.com/kundajelab/ataqc" target="_blank" rel="noopener">ATAqC</a>.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">import</span> matplotlib</span><br><span class="line">matplotlib.use(<span class="string">'Agg'</span>)</span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> pybedtools</span><br><span class="line"><span class="keyword">import</span> metaseq</span><br><span class="line"><span class="keyword">import</span> logging</span><br><span class="line"></span><br><span class="line"><span class="keyword">from</span> matplotlib <span class="keyword">import</span> pyplot <span class="keyword">as</span> plt</span><br><span class="line"><span class="keyword">from</span> matplotlib <span class="keyword">import</span> mlab</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">make_tss_plot</span><span class="params">(bam_file, tss, prefix, chromsizes, read_len, bins=<span class="number">400</span>, bp_edge=<span class="number">2000</span>,</span></span></span><br><span class="line"><span class="function"><span class="params"> processes=<span class="number">8</span>, greenleaf_norm=True)</span>:</span></span><br><span class="line"> <span class="string">'''</span></span><br><span class="line"><span class="string"> Take bootstraps, generate tss plots, and get a mean and</span></span><br><span class="line"><span class="string"> standard deviation on the plot. Produces 2 plots. One is the</span></span><br><span class="line"><span class="string"> aggregation plot alone, while the other also shows the signal</span></span><br><span class="line"><span class="string"> at each TSS ordered by strength.</span></span><br><span class="line"><span class="string"> '''</span></span><br><span class="line"> logging.info(<span class="string">'Generating tss plot...'</span>)</span><br><span class="line"> tss_plot_file = <span class="string">'{0}_tss-enrich.pdf'</span>.format(prefix)</span><br><span class="line"> tss_plot_large_file = <span class="string">'{0}_large_tss-enrich.pdf'</span>.format(prefix)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># Load the TSS file</span></span><br><span class="line"> tss = pybedtools.BedTool(tss)</span><br><span class="line"> tss_ext = tss.slop(b=bp_edge, g=chromsizes)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># Load the bam file</span></span><br><span class="line"> bam = metaseq.genomic_signal(bam_file, <span class="string">'bam'</span>) <span class="comment"># Need to shift reads and just get ends, just load bed file?</span></span><br><span class="line"> bam_array = bam.array(tss_ext, bins=bins, shift_width = -read_len/<span class="number">2</span>, <span class="comment"># Shift to center the read on the cut site</span></span><br><span class="line"> processes=processes, stranded=<span class="keyword">True</span>)</span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> greenleaf_norm:</span><br><span class="line"> <span class="comment"># Use enough bins to cover 100 bp on either end</span></span><br><span class="line"> num_edge_bins = int(<span class="number">100</span>/(<span class="number">2</span>*bp_edge/bins))</span><br><span class="line"> bin_means = bam_array.mean(axis=<span class="number">0</span>)</span><br><span class="line"> avg_noise = (sum(bin_means[:num_edge_bins]) +</span><br><span class="line"> sum(bin_means[-num_edge_bins:]))/(<span class="number">2</span>*num_edge_bins)</span><br><span class="line"> bam_array /= avg_noise</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> bam_array /= bam.mapped_read_count() / <span class="number">1e6</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># Generate a line plot</span></span><br><span class="line"> fig = plt.figure()</span><br><span class="line"> ax = fig.add_subplot(<span class="number">111</span>)</span><br><span class="line"> x = np.linspace(-bp_edge, bp_edge, bins)</span><br><span class="line"></span><br><span class="line"> ax.plot(x, bam_array.mean(axis=<span class="number">0</span>), color=<span class="string">'r'</span>, label=<span class="string">'Mean'</span>)</span><br><span class="line"> ax.axvline(<span class="number">0</span>, linestyle=<span class="string">':'</span>, color=<span class="string">'k'</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># Note the middle high point (TSS)</span></span><br><span class="line"> tss_point_val = max(bam_array.mean(axis=<span class="number">0</span>))</span><br><span class="line"></span><br><span class="line"> ax.set_xlabel(<span class="string">'Distance from TSS (bp)'</span>)</span><br><span class="line"> ax.set_ylabel(<span class="string">'Average read coverage (per million mapped reads)'</span>)</span><br><span class="line"> ax.legend(loc=<span class="string">'best'</span>)</span><br><span class="line"></span><br><span class="line"> fig.savefig(tss_plot_file)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># Print a more complicated plot with lots of info</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># Find a safe upper percentile - we can't use X if the Xth percentile is 0</span></span><br><span class="line"> upper_prct = <span class="number">99</span></span><br><span class="line"> <span class="keyword">if</span> mlab.prctile(bam_array.ravel(), upper_prct) == <span class="number">0.0</span>:</span><br><span class="line"> upper_prct = <span class="number">100.0</span></span><br><span class="line"></span><br><span class="line"> plt.rcParams[<span class="string">'font.size'</span>] = <span class="number">8</span></span><br><span class="line"> fig = metaseq.plotutils.imshow(bam_array,</span><br><span class="line"> x=x,</span><br><span class="line"> figsize=(<span class="number">5</span>, <span class="number">10</span>),</span><br><span class="line"> vmin=<span class="number">5</span>, vmax=upper_prct, percentile=<span class="keyword">True</span>,</span><br><span class="line"> line_kwargs=dict(color=<span class="string">'k'</span>, label=<span class="string">'All'</span>),</span><br><span class="line"> fill_kwargs=dict(color=<span class="string">'k'</span>, alpha=<span class="number">0.3</span>),</span><br><span class="line"> sort_by=bam_array.mean(axis=<span class="number">1</span>))</span><br><span class="line"></span><br><span class="line"> <span class="comment"># And save the file</span></span><br><span class="line"> fig.savefig(tss_plot_large_file)</span><br><span class="line"></span><br><span class="line"> <span class="keyword">return</span> tss_plot_file, tss_plot_large_file, tss_point_val</span><br></pre></td></tr></table></figure><h3 id="blacklist-filtering-for-peaks"><a class="markdownIt-Anchor" href="#blacklist-filtering-for-peaks"></a> Blacklist filtering for peaks</h3><p>One may want to filter peaks using <a href="https://www.encodeproject.org/annotations/ENCSR636HFF/" target="_blank" rel="noopener">DAC Blacklisted Regions</a>.</p><p>The following code was from <a href="https://www.encodeproject.org/atac-seq/" target="_blank" rel="noopener">ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline</a></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">PEAK="${PREFIX}.narrowPeak"</span><br><span class="line">FILTERED_PEAK="${PREFIX}.narrowPeak.filt.gz"</span><br><span class="line">bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \</span><br><span class="line"> | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \</span><br><span class="line"> | grep -P 'chr[0-9XY]+(?!_)' | gzip -nc > ${FILTERED_PEAK}</span><br></pre></td></tr></table></figure><p>The 5th column of narroPeak file is integer score for display calculated as int(-10*log10qvalue). Since currently this value might be out of the [0-1000] range defined in <a href="https://genome.ucsc.edu/FAQ/FAQformat.html#format12" target="_blank" rel="noopener">UCSC Encode narrowPeak format</a>, the <code>awk</code> code is to assign values greater than 1000 to 1000.</p><h3 id="merging-peaks-optional"><a class="markdownIt-Anchor" href="#merging-peaks-optional"></a> Merging peaks (optional)</h3><p>One may want to merge peaks from different libraries or different samples.</p><p><a href="https://www.nature.com/articles/ng.3646" target="_blank" rel="noopener">Corces et al., 2016</a> used the following method to merge peaks, and I like their way:</p><blockquote><p>To generate a non-redundant list of hematopoiesis- and cancer-related peaks, we first extended summits to 500-bp windows (±250 bp). We then ranked the 500-bp peaks by summit significance value (defined by MACS2) and chose a list of non-overlapping, maximally significant peaks.</p></blockquote><p>We can do this by using <code>BEDOPS</code> (ref: <a href="https://bedops.readthedocs.io/en/latest/content/usage-examples/master-list.html" target="_blank" rel="noopener">Collapsing multiple BED files into a master list by signal</a>)</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">!/bin/bash</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> modified from https://bedops.readthedocs.io/en/latest/content/usage-examples/master-list.html</span></span><br><span class="line"></span><br><span class="line">summit_bed=(sample1_summits.bed sample2_summits.bed sample3_summits.bed)</span><br><span class="line"></span><br><span class="line">out=fAdrenal.master.merge.bed</span><br><span class="line"></span><br><span class="line">tmpd=/tmp/tmp$$</span><br><span class="line">mkdir -p $tmpd</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># First, union all the peaks together into a single file.</span></span></span><br><span class="line">bedlist=""</span><br><span class="line">for bed in ${beds[*]}</span><br><span class="line">do</span><br><span class="line"> bedlist="$bedlist $summit_bed"</span><br><span class="line">done</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> extended summits to 500-bp windows (±250 bp)</span></span><br><span class="line">bedops --range 250 -u $bedlist > $tmpd/tmp.bed</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># The master list is constructed iteratively. For each pass through</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># the loop, elements not yet in the master list are merged into</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># non-overlapping intervals that span the union (this is just bedops</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># -m). Then for each merged interval, an original element of highest</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># score within the interval is selected to go in the master list.</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Anything that overlaps the selected element is thrown out, and the</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># process then repeats.</span></span></span><br><span class="line">iters=1</span><br><span class="line">solns=""</span><br><span class="line">stop=0</span><br><span class="line">while [ $stop == 0 ]</span><br><span class="line">do</span><br><span class="line"> echo "merge steps..."</span><br><span class="line"></span><br><span class="line"> ## Condense the union into merged intervals. This klugey bit</span><br><span class="line"> ## before and after the merging is because we don't want to merge</span><br><span class="line"> ## regions that are simply adjacent but not overlapping</span><br><span class="line"> bedops -m --range 0:-1 $tmpd/tmp.bed \</span><br><span class="line"> | bedops -u --range 0:1 - \</span><br><span class="line"> > $tmpd/tmpm.bed</span><br><span class="line"></span><br><span class="line"> ## Grab the element with the highest score among all elements forming each interval.</span><br><span class="line"> ## If multiple elements tie for the highest score, just grab one of them.</span><br><span class="line"> ## Result is the current master list. Probably don't need to sort, but do it anyway</span><br><span class="line"> ## to be safe since we're not using --echo with bedmap call.</span><br><span class="line"> bedmap --max-element $tmpd/tmpm.bed $tmpd/tmp.bed \</span><br><span class="line"> | sort-bed - \</span><br><span class="line"> > $tmpd/$iters.bed</span><br><span class="line"> solns="$solns $tmpd/$iters.bed"</span><br><span class="line"> echo "Adding `awk 'END { print NR }' $tmpd/$iters.bed` elements"</span><br><span class="line"></span><br><span class="line"> ## Are there any elements that don't overlap the current master</span><br><span class="line"> ## list? If so, add those in, and repeat. If not, we're done.</span><br><span class="line"> bedops -n 1 $tmpd/tmp.bed $tmpd/$iters.bed \</span><br><span class="line"> > $tmpd/tmp2.bed</span><br><span class="line"></span><br><span class="line"> mv $tmpd/tmp2.bed $tmpd/tmp.bed</span><br><span class="line"></span><br><span class="line"> if [ ! -s $tmpd/tmp.bed ]</span><br><span class="line"> then</span><br><span class="line"> stop=1</span><br><span class="line"> fi</span><br><span class="line"></span><br><span class="line"> ((iters++))</span><br><span class="line">done</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># final solution</span></span></span><br><span class="line">bedops -u $solns \</span><br><span class="line"><span class="meta"> ></span><span class="bash"> <span class="variable">$out</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Clean up</span></span></span><br><span class="line">rm -r $tmpd</span><br><span class="line"></span><br><span class="line">exit 0</span><br></pre></td></tr></table></figure><h3 id="outreach"><a class="markdownIt-Anchor" href="#outreach"></a> Outreach</h3><p>Well, then we complete the main upstream ATAC-seq data analysis. For downstream analysis, one may want to annotate peaks, find enriched motifs of TFs, compare peaks under different conditions and combine ATAC-seq data with other data types like RNA-seq etc.</p><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20190103: create the note.</li><li>20190320: complete the note.</li><li>20190401: update the section of “Merging BAMs”. Add some discussions about “pooling replicates”</li></ul>]]></content>
<summary type="html">
<p>The content were compiled from multiple resources on the internet (forum, papers, workshop etc.). I could not indicate all the sources, b
</summary>
<category term="Peak" scheme="https://yiweiniu.github.io/blog/categories/Peak/"/>
<category term="ATAC-seq" scheme="https://yiweiniu.github.io/blog/categories/Peak/ATAC-seq/"/>
<category term="peak" scheme="https://yiweiniu.github.io/blog/tags/peak/"/>
<category term="ATAC-seq" scheme="https://yiweiniu.github.io/blog/tags/ATAC-seq/"/>
<category term="MACS" scheme="https://yiweiniu.github.io/blog/tags/MACS/"/>
<category term="bowtie2" scheme="https://yiweiniu.github.io/blog/tags/bowtie2/"/>
</entry>
<entry>
<title>Calculate FRiP score</title>
<link href="https://yiweiniu.github.io/blog/2019/03/Calculate-FRiP-score/"/>
<id>https://yiweiniu.github.io/blog/2019/03/Calculate-FRiP-score/</id>
<published>2019-03-20T05:46:54.000Z</published>
<updated>2019-03-20T05:48:34.000Z</updated>
<content type="html"><![CDATA[<p>Ref: <a href="https://www.encodeproject.org/data-standards/terms/#enrichment" target="_blank" rel="noopener">ENCODE - Terms and Definitions</a></p><blockquote><p>Fraction of reads in peaks (FRiP) - Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. (Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)</p></blockquote><p>In <a href="https://www.encodeproject.org/atac-seq/" target="_blank" rel="noopener">ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline</a>, FRiP score is calculated using <code>tagAlign</code> with <code>intersectBed</code>, and they use <code>bamtobed</code> covert <code>BAM</code> files to <code>tagAlign</code>. However, <code>intersectBed</code> can also use <code>BAM</code> files to count the intersection. Also, someone argued that one should use <code>featureCounts</code> to get accurate results (<a href="https://www.biostars.org/p/337872/#337890" target="_blank" rel="noopener">https://www.biostars.org/p/337872/#337890</a>).</p><p>I compared different ways to calculate FRiP scores below.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> 1. prepare</span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># convert BAM (BAM used to call peaks) to BED</span></span></span><br><span class="line"><span class="meta">$</span><span class="bash"> bedtools bamtobed -i <span class="variable">${sample}</span>.sorted.marked.filtered.shifted.bam | awk <span class="string">'BEGIN{OFS="\t"}{$4="N";$5="1000";print $0}'</span> > <span class="variable">${sample}</span>.sorted.marked.filtered.shifted.PE2SE.tagAlign</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> 2. total reads <span class="keyword">in</span> BAM/BED</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> samtools view -c <span class="variable">${sample}</span>.sorted.marked.filtered.shifted.bam</span></span><br><span class="line">38439210</span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> wc -l <span class="variable">${sample}</span>.sorted.marked.filtered.shifted.PE2SE.tagAlign</span></span><br><span class="line">38439210 ${sample}.sorted.marked.filtered.shifted.PE2SE.tagAlign</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> 3. count reads <span class="keyword">in</span> peak regions</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># 3.1 tagAlign, intersectBed -a tagAlign -b bed</span></span></span><br><span class="line"><span class="meta">$</span><span class="bash"> time bedtools sort -i <span class="variable">${sample}</span>_peaks.narrowPeak | bedtools merge -i stdin | bedtools intersect -u -a <span class="variable">${sample}</span>.sorted.marked.filtered.shifted.PE2SE.tagAlign -b stdin | wc -l</span></span><br><span class="line">428359</span><br><span class="line"></span><br><span class="line">real 0m27.012s</span><br><span class="line">user 0m25.726s</span><br><span class="line">sys 0m1.357s</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># 3.2 tagAlign, intersectBed -a bed -b tagAlign</span></span></span><br><span class="line"><span class="meta">$</span><span class="bash"> time bedtools sort -i <span class="variable">${sample}</span>_peaks.narrowPeak |bedtools merge -i stdin | bedtools intersect -c -a stdin -b <span class="variable">${sample}</span>.sorted.marked.filtered.shifted.PE2SE.tagAlign | awk <span class="string">'{{ sum+=$4 }} END {{ print sum }}'</span></span></span><br><span class="line">428359</span><br><span class="line"></span><br><span class="line">real 0m51.089s</span><br><span class="line">user 0m39.199s</span><br><span class="line">sys 0m11.945s</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># 3.3 BAM, intersectBed -a bam -b bed</span></span></span><br><span class="line"><span class="meta">$</span><span class="bash"> time bedtools sort -i <span class="variable">${sample}</span>_peaks.narrowPeak | bedtools merge -i stdin | bedtools intersect -u -a <span class="variable">${sample}</span>.sorted.marked.filtered.shifted.bam -b stdin -ubam | samtools view -c</span></span><br><span class="line">428359</span><br><span class="line"></span><br><span class="line">real 1m12.844s</span><br><span class="line">user 1m11.979s</span><br><span class="line">sys 0m0.951s</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># 3.4 BAM, intersectBed -a bed -b bam</span></span></span><br><span class="line"><span class="meta">$</span><span class="bash"> time bedtools sort -i <span class="variable">${sample}</span>_peaks.narrowPeak | bedtools merge -i stdin | bedtools intersect -c -a stdin -b <span class="variable">${sample}</span>.sorted.marked.filtered.shifted.bam | awk <span class="string">'{{ sum+=$4 }} END {{ print sum }}'</span></span></span><br><span class="line">428359</span><br><span class="line"></span><br><span class="line">real 1m49.981s</span><br><span class="line">user 1m28.747s</span><br><span class="line">sys 0m20.837s</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># 3.5 featureCoutns</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">## covert BED (the peaks) to SAF</span></span></span><br><span class="line"><span class="meta">$</span><span class="bash"> awk <span class="string">'BEGIN{FS=OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}{print $4, $1, $2+1, $3, "."}'</span> <span class="variable">${sample}</span>_peaks.narrowPeak > <span class="variable">${sample}</span>_peaks.saf</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">## count</span></span></span><br><span class="line"><span class="meta">$</span><span class="bash"> featureCounts -p -a <span class="variable">${sample}</span>_peaks.saf -F SAF -o readCountInPeaks.txt <span class="variable">${sample}</span>.sorted.marked.filtered.shifted.bam</span></span><br><span class="line">//========================== featureCounts setting ===========================\\</span><br><span class="line">|| ||</span><br><span class="line">|| Input files : 1 BAM file ||</span><br><span class="line">|| P M0.A.005.sorted.marked.filtered.shifted.bam ||</span><br><span class="line">|| ||</span><br><span class="line">|| Output file : readCountInPeaks.txt ||</span><br><span class="line">|| Summary : readCountInPeaks.txt.summary ||</span><br><span class="line">|| Annotation : M0.A.005_peaks.saf (SAF) ||</span><br><span class="line">|| Dir for temp files : ./ ||</span><br><span class="line">|| ||</span><br><span class="line">|| Threads : 1 ||</span><br><span class="line">|| Level : meta-feature level ||</span><br><span class="line">|| Paired-end : yes ||</span><br><span class="line">|| Multimapping reads : not counted ||</span><br><span class="line">|| Multi-overlapping reads : not counted ||</span><br><span class="line">|| Min overlapping bases : 1 ||</span><br><span class="line">|| ||</span><br><span class="line">|| Chimeric reads : counted ||</span><br><span class="line">|| Both ends mapped : not required ||</span><br><span class="line">|| ||</span><br><span class="line">\\===================== http://subread.sourceforge.net/ ======================//</span><br><span class="line"></span><br><span class="line">//================================= Running ==================================\\</span><br><span class="line">|| ||</span><br><span class="line">|| Load annotation file M0.A.005_peaks.saf ... ||</span><br><span class="line">|| Features : 5948 ||</span><br><span class="line">|| Meta-features : 5948 ||</span><br><span class="line">|| Chromosomes/contigs : 24 ||</span><br><span class="line">|| ||</span><br><span class="line">|| Process BAM file M0.A.005.sorted.marked.filtered.shifted.bam... ||</span><br><span class="line">|| Paired-end reads are included. ||</span><br><span class="line">|| Assign alignments (paired-end) to features... ||</span><br><span class="line">|| ||</span><br><span class="line">|| WARNING: reads from the same pair were found not adjacent to each ||</span><br><span class="line">|| other in the input (due to read sorting by location or ||</span><br><span class="line">|| reporting of multi-mapping read pairs). ||</span><br><span class="line">|| ||</span><br><span class="line">|| Pairing up the read pairs. ||</span><br><span class="line">|| ||</span><br><span class="line">|| Total alignments : 19219605 ||</span><br><span class="line">|| Successfully assigned alignments : 230826 (1.2%) ||</span><br><span class="line">|| Running time : 0.49 minutes ||</span><br><span class="line">|| ||</span><br><span class="line">|| ||</span><br><span class="line">|| Summary of counting results can be found in file "readCountInPeaks.txt.su ||</span><br><span class="line">|| mmary" ||</span><br><span class="line">|| ||</span><br><span class="line">\\===================== http://subread.sourceforge.net/ ======================//</span><br></pre></td></tr></table></figure><p>Equal read counts in peak regions were got either from <code>BAM</code> file or <code>tagAlign</code> file. Although counting from <code>BAM</code> consumes more time, one do not need to covert <code>BAM</code> to <code>tagAlign</code>.</p><p>And for <code>featureCounts</code>, the result is close to those from <code>intersectBed</code>, which is expected. <code>featureCounts</code> counts the number of fragments, while <code>intersectBed</code> counts the number of reads. But the number of reads is twice as big as the number of fragments (only considering properly mapped reads).</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># from intersectBed</span></span><br><span class="line">> <span class="number">428359</span>/<span class="number">38439210</span></span><br><span class="line">[<span class="number">1</span>] <span class="number">0.0111438</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># from featureCounts</span></span><br><span class="line">> <span class="number">230826</span>/<span class="number">19219605</span></span><br><span class="line">[<span class="number">1</span>] <span class="number">0.01200992</span></span><br></pre></td></tr></table></figure><p><code>featureCounts</code> may be more accurate when assigning reads spanning multiple features, but it may not be worthy.</p><p>So, in practice, I would like to use <code>intersectBed</code> to calculate the FRiP score from <code>BAM</code> files.</p><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20190320: create the note.</li></ul>]]></content>
<summary type="html">
<p>Ref: <a href="https://www.encodeproject.org/data-standards/terms/#enrichment" target="_blank" rel="noopener">ENCODE - Terms and Definitio
</summary>
<category term="Peak" scheme="https://yiweiniu.github.io/blog/categories/Peak/"/>
<category term="QC" scheme="https://yiweiniu.github.io/blog/categories/Peak/QC/"/>
<category term="peak" scheme="https://yiweiniu.github.io/blog/tags/peak/"/>
<category term="QC" scheme="https://yiweiniu.github.io/blog/tags/QC/"/>
</entry>
<entry>
<title>Understand Bioconductor Annotation Packages</title>
<link href="https://yiweiniu.github.io/blog/2018/09/Understand-Bioconductor-Annotation-Packages/"/>
<id>https://yiweiniu.github.io/blog/2018/09/Understand-Bioconductor-Annotation-Packages/</id>
<published>2018-09-18T15:04:52.000Z</published>
<updated>2018-09-18T15:51:23.000Z</updated>
<content type="html"><![CDATA[<p>This note is to help me figure out the design schema of annotation packages in Bioconductor. And this note is mainly compiled from:</p><ul><li><a href="https://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/AnnotationSlidesBioc2011.pdf" target="_blank" rel="noopener">Annotation Packages: the big picture</a>. Fantastic slide!</li><li><a href="https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf" target="_blank" rel="noopener">Introduction To Bioconductor Annotation Packages</a></li><li><a href="https://bioconductor.org/packages/release/bioc/manuals/AnnotationDbi/man/AnnotationDbi.pdf" target="_blank" rel="noopener">Package ‘AnnotationDbi’ Reference Manual</a></li><li><a href="https://bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf" target="_blank" rel="noopener">Making and Utilizing TxDb Objects</a></li></ul><h2 id="introduction"><a class="markdownIt-Anchor" href="#introduction"></a> Introduction</h2><p>Packages in <a href="https://bioconductor.org/" target="_blank" rel="noopener">Bioconductor</a> can be divided into three categories: software, annotation and experiment.</p><blockquote><p>All of the ‘.db’ (and most other Bioconductor annotation packages) are updated every 6 months corresponding to each release of Bioconductor. Exceptions are made for packages where the actual resources that the packages are based on have not themselves been updated.</p></blockquote><h2 id="schema-of-annotation-packages"><a class="markdownIt-Anchor" href="#schema-of-annotation-packages"></a> Schema of Annotation Packages</h2><p>Here is a very representative graph, but not informative-enough.</p><img src="/blog/2018/09/Understand-Bioconductor-Annotation-Packages/1537268974_20811.png"><p>There are three major types of annotation in Bioconductor:</p><ul><li>Gene centric <code>AnnotationDb</code> packages:<ul><li>Organism level: e.g. <code>org.Mm.eg.db</code>.</li><li>Platform level: e.g. <code>hgu133plus2.db</code>, <code>hgu133plus2.probes</code>, <code>hgu133plus2.cdf</code>.</li><li>Homology level: e.g. <code>hom.Dm.inp.db</code>.</li><li>System-biology level: e.g. <code>GO.db</code>.</li></ul></li><li>Genome centric <code>GenomicFeatures</code> packages include:<ul><li>Transcriptome level: e.g. <code>TxDb.Hsapiens.UCSC.hg19.knownGene</code>, <code>EnsDb.Hsapiens.v75</code>.</li><li>Generic genome feature: can generate via <code>GenomicFeatures</code>.</li></ul></li><li>One web-based resource accesss <code>biomart</code>, via <code>biomaRt</code> package:<ul><li>Query web-based ‘biomart’ resource for genes, sequence, SNPs, and etc.</li></ul></li></ul><h2 id="working-with-anotationdb-objects"><a class="markdownIt-Anchor" href="#working-with-anotationdb-objects"></a> Working with AnotationDb objects</h2><p><code>AnnotationDb</code> is the virtual base class for all annotation packages. It contain a database connection and is meant to be the parent for a set of classes in the Bioconductor annotation packages. These classes will provide a means of dispatch for a widely available set of select methods and thus allow the easy extraction of data from the annotation packages.</p><p>All the annotation packages that base on <code>AnnotationDb</code> object expose an object named exactly the same as the package itself.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(org.Hs.eg.db)</span><br><span class="line">class(org.Hs.eg.db)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">[1] "OrgDb"</span><br><span class="line">attr(,"package")</span><br><span class="line">[1] "AnnotationDbi"</span><br></pre></td></tr></table></figure><p>The more specific classes (the ones that you will actually see in the wild) have names like <code>OrgDb</code>, <code>ChipDb</code> or <code>TxDb</code> objects. These names correspond to the kind of package (and underlying schema) being represented.</p><p>Like:</p><ul><li><code>org.Hs.eg.db</code> - Genome wide annotation for Human.</li><li><code>TxDb.Hsapiens.UCSC.hg19.knownGene</code> - Annotation package for TxDb object(s).</li><li><code>hgu95av2.db</code> - annotations for the hgu95av2 Affymetrix platform.</li></ul><p><strong>Methods</strong></p><p><code>select</code>, <code>columns</code> and <code>keys</code> are used together to extract data from an <code>AnnotationDb</code> object (or any object derived from the parent class). Examples of classes derived from the <code>AnnotationDb</code> object include (but are not limited to): <code>ChipDb</code>, <code>OrgDb</code>, <code>GODb</code>, <code>InparanoidDb</code> and <code>ReactomeDb</code>.</p><ul><li><code>columns</code> shows which kinds of data can be returned for the AnnotationDb object.</li><li><code>keytypes</code> allows the user to discover which keytypes can be passed in to select or keys and the keytype argument.</li><li><code>keys</code> returns keys for the database contained in the <code>AnnotationDb</code> object . This method is already documented in the keys manual page but is mentioned again here because it’s usage with select is so intimate. By default it will return the primary keys for the database, but if used with the <code>keytype</code> argument, it will return the keys from that <code>keytype</code>.</li><li><code>select</code> will retrieve the data as a <code>data.frame</code> based on parameters for selected keys <code>columns</code> and <code>keytype</code> arguments. <mark>Users should be warned that if you call <code>select</code> and request columns that have multiple matches for your keys, select will return a <code>data.frame</code> with one row for each possible match. This has the effect that if you request multiple columns and some of them have a many to one relationship to the keys, things will continue to multiply accordingly.</mark> So it’s not a good idea to request a large number of columns unless you know that what you are asking for should have a one to one relationship with the initial set of keys. In general, if you need to retrieve a column (like GO) that has a many to one relationship to the original keys, it is most useful to extract that separately.</li><li><code>mapIds</code> gets the mapped ids (column) for a set of keys that are of a particular keytype. Usually returned as a named character vector, a list or even a SimpleCharacterList.</li><li><code>saveDb</code> will take an <code>AnnotationDb</code> object and save the database to the file specified by the path passed in to the file argument.</li><li><code>loadDb</code> takes a .sqlite database file as an argument and uses data in the metadata table of that file to return an AnnotationDb style object of the appropriate type.</li><li><code>species</code> shows the genus and species label currently attached to the AnnotationDb objects database.</li><li><code>dbfile</code> gets the database file associated with an object.</li><li><code>dbconn</code> gets the datebase connection associated with an object.</li><li><code>taxonomyId</code> gets the taxonomy ID associated with an object (if available).</li></ul><h3 id="chipdb"><a class="markdownIt-Anchor" href="#chipdb"></a> ChipDb</h3><p>Platfom-based or chip-based annotation package are an extremely common kind of Annotation package. The following examples show how to use standard methods to interact with an object of this type.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(<span class="string">"hgu95av2.db"</span>)</span><br></pre></td></tr></table></figure><p>Things loaded along with this package</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ls(<span class="string">"package:hgu95av2.db"</span>)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"> [1] "hgu95av2" "hgu95av2.db" "hgu95av2_dbconn" "hgu95av2_dbfile" "hgu95av2_dbInfo" </span><br><span class="line"> [6] "hgu95av2_dbschema" "hgu95av2ACCNUM" "hgu95av2ALIAS2PROBE" "hgu95av2CHR" "hgu95av2CHRLENGTHS" </span><br><span class="line">[11] "hgu95av2CHRLOC" "hgu95av2CHRLOCEND" "hgu95av2ENSEMBL" "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID" </span><br><span class="line">[16] "hgu95av2ENZYME" "hgu95av2ENZYME2PROBE" "hgu95av2GENENAME" "hgu95av2GO" "hgu95av2GO2ALLPROBES" </span><br><span class="line">[21] "hgu95av2GO2PROBE" "hgu95av2MAP" "hgu95av2MAPCOUNTS" "hgu95av2OMIM" "hgu95av2ORGANISM" </span><br><span class="line">[26] "hgu95av2ORGPKG" "hgu95av2PATH" "hgu95av2PATH2PROBE" "hgu95av2PFAM" "hgu95av2PMID" </span><br><span class="line">[31] "hgu95av2PMID2PROBE" "hgu95av2PROSITE" "hgu95av2REFSEQ" "hgu95av2SYMBOL" "hgu95av2UNIGENE" </span><br><span class="line">[36] "hgu95av2UNIPROT"</span><br></pre></td></tr></table></figure><p>These packages appear to contain a lot of data but it is an illusion.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">> <span class="keyword">library</span>(hgu95av2.db)</span><br><span class="line">> hgu95av2()</span><br><span class="line">> hgu95av2_dbInfo()</span><br><span class="line">> hgu95av2GENENAME</span><br><span class="line">> hgu95av2_dbschema()</span><br></pre></td></tr></table></figure><p>Use <code>columns()</code> to see possible values for columns.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">columns(hgu95av2.db)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"> [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE" </span><br><span class="line"> [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "IPI" "MAP" "OMIM" "ONTOLOGY" </span><br><span class="line">[17] "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROBEID" "PROSITE" "REFSEQ" "SYMBOL" </span><br><span class="line">[25] "UCSCKG" "UNIGENE" "UNIPROT"</span><br></pre></td></tr></table></figure><p>Use <code>help("xxx")</code>to see the description of columns.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">help(<span class="string">'SYMBOL'</span>)</span><br></pre></td></tr></table></figure><p>Use <code>keytypes()</code> to see possible values for keytypes. In reality, some kinds of values make poor keys and so this list is shorter than that of above.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">keytypes(hgu95a.db)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"> [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE" </span><br><span class="line"> [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "IPI" "MAP" "OMIM" "ONTOLOGY" </span><br><span class="line">[17] "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROBEID" "PROSITE" "REFSEQ" "SYMBOL" </span><br><span class="line">[25] "UCSCKG" "UNIGENE" "UNIPROT"</span><br></pre></td></tr></table></figure><p>Use <code>keys()</code> to extract some sample keys back. (default if the primary key.)</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">head(keys(hgu95av2.db))</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[1] "1000_at" "1001_at" "1002_f_at" "1003_s_at" "1004_at" "1005_at"</span><br></pre></td></tr></table></figure><p>Or for a particular keytype.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">head(keys(hgu95av2.db, keytype=<span class="string">'SYMBOL'</span>))</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"</span><br></pre></td></tr></table></figure><p>Use <code>select()</code> to retrieve data.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#1st get some example keys</span></span><br><span class="line">k <- head(keys(hgu95av2.db, keytype=<span class="string">"PROBEID"</span>))</span><br><span class="line"><span class="comment"># then call select</span></span><br><span class="line">select(hgu95av2.db, keys=k, columns=c(<span class="string">"SYMBOL"</span>,<span class="string">"GENENAME"</span>), keytype=<span class="string">"PROBEID"</span>)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">'select()' returned 1:1 mapping between keys and columns</span><br><span class="line"> PROBEID SYMBOL GENENAME</span><br><span class="line">1 1000_at MAPK3 mitogen-activated protein kinase 3</span><br><span class="line">2 1001_at TIE1 tyrosine kinase with immunoglobulin like and EGF like domains 1</span><br><span class="line">3 1002_f_at CYP2C19 cytochrome P450 family 2 subfamily C member 19</span><br><span class="line">4 1003_s_at CXCR5 C-X-C motif chemokine receptor 5</span><br><span class="line">5 1004_at CXCR5 C-X-C motif chemokine receptor 5</span><br><span class="line">6 1005_at DUSP1 dual specificity phosphatase 1</span><br></pre></td></tr></table></figure><p>If one wants to get only one column of data, <code>mapIds</code> can be used.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mapIds(hgu95av2.db, keys=k, column=c(<span class="string">"GENENAME"</span>), keytype=<span class="string">"PROBEID"</span>)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">'select()' returned 1:1 mapping between keys and columns</span><br><span class="line"> 1000_at 1001_at </span><br><span class="line"> "mitogen-activated protein kinase 3" "tyrosine kinase with immunoglobulin like and EGF like domains 1" </span><br><span class="line"> 1002_f_at 1003_s_at </span><br><span class="line"> "cytochrome P450 family 2 subfamily C member 19" "C-X-C motif chemokine receptor 5" </span><br><span class="line"> 1004_at 1005_at </span><br><span class="line"> "C-X-C motif chemokine receptor 5" "dual specificity phosphatase 1"</span><br></pre></td></tr></table></figure><h3 id="orgdb"><a class="markdownIt-Anchor" href="#orgdb"></a> OrgDb</h3><p>An organism level package (an ‘org’ package) uses a central gene identifier (e.g. Entrez Gene id) and contains mappings between this identifier and other kinds of identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.).</p><p>The name of an org package is always of the form <code>org.<Ab>.<id>.db</code> (e.g. <code>org.Sc.sgd.db</code>) where <code><Ab></code> is a 2-letter abbreviation of the organism (e.g. <code>Sc</code> for <em>Saccharomyces cerevisiae</em>) and <code><id></code> is an abbreviation (in lower-case) describing the type of cen- tral identifier (e.g. <code>sgd</code> for gene identifiers assigned by the Saccharomyces Genome Database, or <code>eg</code> for Entrez Gene ids).</p><p>Using <code>OrgDb</code> packages is just like using <code>ChipDb</code> packages.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(org.Hs.eg.db)</span><br><span class="line"></span><br><span class="line">> columns(org.Hs.eg.db)</span><br><span class="line"> [<span class="number">1</span>] <span class="string">"ACCNUM"</span> <span class="string">"ALIAS"</span> <span class="string">"ENSEMBL"</span> <span class="string">"ENSEMBLPROT"</span> <span class="string">"ENSEMBLTRANS"</span> <span class="string">"ENTREZID"</span> <span class="string">"ENZYME"</span> <span class="string">"EVIDENCE"</span> </span><br><span class="line"> [<span class="number">9</span>] <span class="string">"EVIDENCEALL"</span> <span class="string">"GENENAME"</span> <span class="string">"GO"</span> <span class="string">"GOALL"</span> <span class="string">"IPI"</span> <span class="string">"MAP"</span> <span class="string">"OMIM"</span> <span class="string">"ONTOLOGY"</span> </span><br><span class="line">[<span class="number">17</span>] <span class="string">"ONTOLOGYALL"</span> <span class="string">"PATH"</span> <span class="string">"PFAM"</span> <span class="string">"PMID"</span> <span class="string">"PROSITE"</span> <span class="string">"REFSEQ"</span> <span class="string">"SYMBOL"</span> <span class="string">"UCSCKG"</span> </span><br><span class="line">[<span class="number">25</span>] <span class="string">"UNIGENE"</span> <span class="string">"UNIPROT"</span></span><br><span class="line"></span><br><span class="line">> keytypes(org.Hs.eg.db)</span><br><span class="line"> [<span class="number">1</span>] <span class="string">"ACCNUM"</span> <span class="string">"ALIAS"</span> <span class="string">"ENSEMBL"</span> <span class="string">"ENSEMBLPROT"</span> <span class="string">"ENSEMBLTRANS"</span> <span class="string">"ENTREZID"</span> <span class="string">"ENZYME"</span> <span class="string">"EVIDENCE"</span> </span><br><span class="line"> [<span class="number">9</span>] <span class="string">"EVIDENCEALL"</span> <span class="string">"GENENAME"</span> <span class="string">"GO"</span> <span class="string">"GOALL"</span> <span class="string">"IPI"</span> <span class="string">"MAP"</span> <span class="string">"OMIM"</span> <span class="string">"ONTOLOGY"</span> </span><br><span class="line">[<span class="number">17</span>] <span class="string">"ONTOLOGYALL"</span> <span class="string">"PATH"</span> <span class="string">"PFAM"</span> <span class="string">"PMID"</span> <span class="string">"PROSITE"</span> <span class="string">"REFSEQ"</span> <span class="string">"SYMBOL"</span> <span class="string">"UCSCKG"</span> </span><br><span class="line">[<span class="number">25</span>] <span class="string">"UNIGENE"</span> <span class="string">"UNIPROT"</span></span><br><span class="line"></span><br><span class="line">> head(keys(org.Hs.eg.db))</span><br><span class="line">[<span class="number">1</span>] <span class="string">"1"</span> <span class="string">"2"</span> <span class="string">"3"</span> <span class="string">"9"</span> <span class="string">"10"</span> <span class="string">"11"</span></span><br></pre></td></tr></table></figure><h3 id="godb"><a class="markdownIt-Anchor" href="#godb"></a> GO.db</h3><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(GO.db)</span><br><span class="line"></span><br><span class="line">> GO.db</span><br><span class="line">GODb object:</span><br><span class="line">| GOSOURCENAME: Gene Ontology</span><br><span class="line">| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/</span><br><span class="line">| GOSOURCEDATE: <span class="number">2018</span>-Mar28</span><br><span class="line">| Db type: GODb</span><br><span class="line">| package: AnnotationDbi</span><br><span class="line">| DBSCHEMA: GO_DB</span><br><span class="line">| GOEGSOURCEDATE: <span class="number">2018</span>-Apr4</span><br><span class="line">| GOEGSOURCENAME: Entrez Gene</span><br><span class="line">| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA</span><br><span class="line">| DBSCHEMAVERSION: <span class="number">2.1</span></span><br><span class="line"></span><br><span class="line">Please see: help(<span class="string">'select'</span>) <span class="keyword">for</span> usage information</span><br><span class="line"></span><br><span class="line">> columns(GO.db)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"DEFINITION"</span> <span class="string">"GOID"</span> <span class="string">"ONTOLOGY"</span> <span class="string">"TERM"</span> </span><br><span class="line"></span><br><span class="line">> keytypes(GO.db)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"DEFINITION"</span> <span class="string">"GOID"</span> <span class="string">"ONTOLOGY"</span> <span class="string">"TERM"</span></span><br><span class="line"></span><br><span class="line">> head(keys(GO.db))</span><br><span class="line">[<span class="number">1</span>] <span class="string">"GO:0000001"</span> <span class="string">"GO:0000002"</span> <span class="string">"GO:0000003"</span> <span class="string">"GO:0000006"</span> <span class="string">"GO:0000007"</span> <span class="string">"GO:0000009"</span></span><br><span class="line"></span><br><span class="line">> head(keys(GO.db, keytype=<span class="string">'ONTOLOGY'</span>))</span><br><span class="line">[<span class="number">1</span>] <span class="string">"BP"</span> <span class="string">"CC"</span> <span class="string">"MF"</span> <span class="string">"universal"</span></span><br><span class="line"></span><br><span class="line">> head(keys(GO.db, keytype=<span class="string">'TERM'</span>))</span><br><span class="line">[<span class="number">1</span>] <span class="string">"mitochondrion inheritance"</span> <span class="string">"mitochondrial genome maintenance"</span> </span><br><span class="line">[<span class="number">3</span>] <span class="string">"reproduction"</span> <span class="string">"ribosome biogenesis"</span> </span><br><span class="line">[<span class="number">5</span>] <span class="string">"protein binding involved in protein folding"</span> <span class="string">"unfolded protein binding"</span></span><br><span class="line"></span><br><span class="line">> keys <- head(keys(GO.db))</span><br><span class="line">> select(GO.db, keys=keys, columns=c(<span class="string">"TERM"</span>,<span class="string">"ONTOLOGY"</span>), keytype=<span class="string">"GOID"</span>)</span><br><span class="line"><span class="string">'select()'</span> returned <span class="number">1</span>:<span class="number">1</span> mapping between keys and columns</span><br><span class="line"> GOID TERM ONTOLOGY</span><br><span class="line"><span class="number">1</span> GO:<span class="number">0000001</span> mitochondrion inheritance BP</span><br><span class="line"><span class="number">2</span> GO:<span class="number">0000002</span> mitochondrial genome maintenance BP</span><br><span class="line"><span class="number">3</span> GO:<span class="number">0000003</span> reproduction BP</span><br><span class="line"><span class="number">4</span> GO:<span class="number">0000006</span> high-affinity zinc transmembrane transporter activity MF</span><br><span class="line"><span class="number">5</span> GO:<span class="number">0000007</span> low-affinity zinc ion transmembrane transporter activity MF</span><br><span class="line"><span class="number">6</span> GO:<span class="number">0000009</span> alpha-<span class="number">1</span>,<span class="number">6</span>-mannosyltransferase activity MF</span><br></pre></td></tr></table></figure><h3 id="txdb"><a class="markdownIt-Anchor" href="#txdb"></a> TxDb</h3><p>A <code>TxDb</code> package connects a set of genomic coordinates to various transcript oriented features. The package can also contain identifiers to features such as genes and transcripts, and the internal schema describes the relationships between these different elements.</p><img src="/blog/2018/09/Understand-Bioconductor-Annotation-Packages/1537281088_5251.png"><p>This class maps the 5’ and 3’ untranslated regions (UTRs), protein coding sequences (CDSs) and exons for a set of mRNA transcripts to their associated genome. <code>TxDb</code> objects have numerous accessors functions to allow such features to be retrieved individually or grouped together in a way that reflects the underlying biology.</p><p>All <code>TxDb</code> containing packages follow a specific naming scheme that tells where the data came from as well as which build of the genome it comes from.</p><p>Package <code>GenomicFeatures</code> contain a set of tools and methods to make and manipulate transcript centric annotation.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(TxDb.Hsapiens.UCSC.hg19.knownGene)</span><br><span class="line">txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene <span class="comment">#shorthand (for convenience</span></span><br><span class="line">txdb</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line">TxDb object:</span><br><span class="line"># Db type: TxDb</span><br><span class="line"># Supporting package: GenomicFeatures</span><br><span class="line"># Data source: UCSC</span><br><span class="line"># Genome: hg19</span><br><span class="line"># Organism: Homo sapiens</span><br><span class="line"># Taxonomy ID: 9606</span><br><span class="line"># UCSC Table: knownGene</span><br><span class="line"># Resource URL: http://genome.ucsc.edu/</span><br><span class="line"># Type of Gene ID: Entrez Gene ID</span><br><span class="line"># Full dataset: yes</span><br><span class="line"># miRBase build ID: GRCh37</span><br><span class="line"># transcript_nrow: 82960</span><br><span class="line"># exon_nrow: 289969</span><br><span class="line"># cds_nrow: 237533</span><br><span class="line"># Db created by: GenomicFeatures package from Bioconductor</span><br><span class="line"># Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)</span><br><span class="line"># GenomicFeatures version at creation time: 1.21.30</span><br><span class="line"># RSQLite version at creation time: 1.0.0</span><br><span class="line"># DBSCHEMAVERSION: 1.1</span><br><span class="line"></span><br><span class="line">> class(txdb)</span><br><span class="line">[1] "TxDb"</span><br><span class="line">attr(,"package")</span><br><span class="line">[1] "GenomicFeatures"</span><br></pre></td></tr></table></figure><p>In addition to accessors via <code>select</code>, <code>TxDb</code> objects also provide access via the more familiar <code>transcripts</code>, <code>exons</code>, <code>cds</code>, <code>transcriptsBy</code>, <code>exonsBy</code> and <code>cdsBy</code> methods, and they will return <code>GRanges</code> objects.</p><p>The ‘ungrouped’ functions <code>transcripts</code>, <code>exons</code>, <code>cds</code>, <code>genes</code> and <code>promoters</code> return the coordinate information as a <code>GRanges</code> object.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">GR = transcripts(txdb)</span><br><span class="line">> GR[<span class="number">1</span>:<span class="number">3</span>]</span><br><span class="line">GRanges object with <span class="number">3</span> ranges and <span class="number">2</span> metadata columns:</span><br><span class="line"> seqnames ranges strand | tx_id tx_name</span><br><span class="line"> <Rle> <IRanges> <Rle> | <integer> <character></span><br><span class="line"> [<span class="number">1</span>] chr1 <span class="number">11874</span>-<span class="number">14409</span> + | <span class="number">1</span> uc001aaa.3</span><br><span class="line"> [<span class="number">2</span>] chr1 <span class="number">11874</span>-<span class="number">14409</span> + | <span class="number">2</span> uc010nxq.1</span><br><span class="line"> [<span class="number">3</span>] chr1 <span class="number">11874</span>-<span class="number">14409</span> + | <span class="number">3</span> uc010nxr.1</span><br><span class="line"> -------</span><br><span class="line"> seqinfo: <span class="number">93</span> sequences (<span class="number">1</span> circular) from hg19 genome</span><br><span class="line"></span><br><span class="line">> length(GR)</span><br><span class="line">[<span class="number">1</span>] <span class="number">82960</span></span><br></pre></td></tr></table></figure><p>The ‘grouped’ function <code>transcriptsBy</code>, <code>exonsBy</code>, <code>cdsBy</code>, <code>intronsByTranscript</code>, <code>fiveUTRsByTranscript</code> and <code>threeUTRsByTranscript</code> extract genomic features of a given type grouped based on another type of genomic feature.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line">> GRList <- transcriptsBy(txdb, by = <span class="string">"gene"</span>)</span><br><span class="line">> GRList</span><br><span class="line">GRangesList object of length <span class="number">23459</span>:</span><br><span class="line">$<span class="number">1</span> </span><br><span class="line">GRanges object with <span class="number">2</span> ranges and <span class="number">2</span> metadata columns:</span><br><span class="line"> seqnames ranges strand | tx_id tx_name</span><br><span class="line"> <Rle> <IRanges> <Rle> | <integer> <character></span><br><span class="line"> [<span class="number">1</span>] chr19 <span class="number">58858172</span>-<span class="number">58864865</span> - | <span class="number">70455</span> uc002qsd.4</span><br><span class="line"> [<span class="number">2</span>] chr19 <span class="number">58859832</span>-<span class="number">58874214</span> - | <span class="number">70456</span> uc002qsf.2</span><br><span class="line"></span><br><span class="line">$<span class="number">10</span> </span><br><span class="line">GRanges object with <span class="number">1</span> range and <span class="number">2</span> metadata columns:</span><br><span class="line"> seqnames ranges strand | tx_id tx_name</span><br><span class="line"> [<span class="number">1</span>] chr8 <span class="number">18248755</span>-<span class="number">18258723</span> + | <span class="number">31944</span> uc003wyw.1</span><br><span class="line"></span><br><span class="line">$<span class="number">100</span> </span><br><span class="line">GRanges object with <span class="number">1</span> range and <span class="number">2</span> metadata columns:</span><br><span class="line"> seqnames ranges strand | tx_id tx_name</span><br><span class="line"> [<span class="number">1</span>] chr20 <span class="number">43248163</span>-<span class="number">43280376</span> - | <span class="number">72132</span> uc002xmj.3</span><br><span class="line"></span><br><span class="line"><span class="keyword">...</span></span><br><span class="line"><<span class="number">23456</span> more elements></span><br><span class="line">-------</span><br><span class="line">seqinfo: <span class="number">93</span> sequences (<span class="number">1</span> circular) from hg19 genome</span><br></pre></td></tr></table></figure><p>The <code>transcriptsBy</code> function returns a <code>GRangesList</code> class object. The <code>show</code> method for a <code>GRangesList</code> object will display as a list of <code>GRanges</code> objects. And, at the bottom the seqinfo will be displayed once for the entire list.</p><p>Then standard <code>GRanges</code> and <code>GRangesList</code> accessors can be used to deal with the returnings. And one can also leverage many nice <code>IRanges</code> methods.</p><h3 id="ensdb"><a class="markdownIt-Anchor" href="#ensdb"></a> EnsDb</h3><p>Similar to the <code>TxDb</code> objects/packages, <code>EnsDb</code> objects/packages provide genomic coordinates of gene models along with additional annotations (e.g. gene names, biotypes etc) but are tailored to annotations provided by Ensembl.</p><p>The central methods implemented for <code>EnsDb</code> objects allow also the use of the <code>EnsDb</code> specific filtering framework to retrieve only selected information from the database.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(EnsDb.Hsapiens.v86)</span><br><span class="line">edb = EnsDb.Hsapiens.v86</span><br><span class="line"></span><br><span class="line">> class(edb)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"EnsDb"</span></span><br><span class="line">attr(,<span class="string">"package"</span>)</span><br><span class="line">[<span class="number">1</span>] <span class="string">"ensembldb"</span></span><br></pre></td></tr></table></figure><p><code>key()</code> function has an additional <code>filter</code> parameter, which accepts <code>AnnotationFilter</code> object.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">keys <- head(keys(edb, keytype=<span class="string">"GENEID"</span>))</span><br><span class="line"></span><br><span class="line">keys(edb, filter=list(GeneBiotypeFilter(<span class="string">"lincRNA"</span>), SeqNameFilter(<span class="string">"Y"</span>)))</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"> [1] "ENSG00000129816" "ENSG00000129845" "ENSG00000131538" "ENSG00000147753" "ENSG00000147761" "ENSG00000176728" "ENSG00000180910"</span><br><span class="line"> [8] "ENSG00000183385" "ENSG00000184991" "ENSG00000185700" "ENSG00000212855" "ENSG00000212856" "ENSG00000215560" "ENSG00000223517"</span><br><span class="line">[15] "ENSG00000223641" "ENSG00000224075" "ENSG00000224989" "ENSG00000225516" "ENSG00000225520" "ENSG00000226362" "ENSG00000226906"</span><br><span class="line">[22] "ENSG00000227439" "ENSG00000228240" "ENSG00000228296" "ENSG00000228379" "ENSG00000228786" "ENSG00000228890" "ENSG00000229236"</span><br><span class="line">[29] "ENSG00000229308" "ENSG00000229643" "ENSG00000230663" "ENSG00000231141" "ENSG00000231535" "ENSG00000232348" "ENSG00000232419"</span><br><span class="line">[36] "ENSG00000233522" "ENSG00000233699" "ENSG00000233864" "ENSG00000235059" "ENSG00000235412" "ENSG00000236951" "ENSG00000237048"</span><br><span class="line">[43] "ENSG00000237069" "ENSG00000237563" "ENSG00000239225" "ENSG00000240450" "ENSG00000251510" "ENSG00000254488" "ENSG00000260197"</span><br><span class="line">[50] "ENSG00000277930" "ENSG00000278847" "ENSG00000280961"</span><br></pre></td></tr></table></figure><p><code>keys</code> in <code>mapIds</code> and <code>select</code> also accepts <code>AnnotationFilter</code> object.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">txs <- select(edb, keys=list(GeneBiotypeFilter(<span class="string">"lincRNA"</span>), SeqNameFilter(<span class="string">"Y"</span>)), columns=c(<span class="string">"TXID"</span>, <span class="string">"TXSEQSTART"</span>, <span class="string">"TXBIOTYPE"</span>))</span><br><span class="line"></span><br><span class="line">> head(txs, n=<span class="number">3</span>)</span><br><span class="line"> TXID TXSEQSTART TXBIOTYPE GENEBIOTYPE SEQNAME</span><br><span class="line"><span class="number">1</span> ENST00000250776 <span class="number">6390431</span> lincRNA lincRNA Y</span><br><span class="line"><span class="number">2</span> ENST00000250805 <span class="number">9753156</span> lincRNA lincRNA Y</span><br><span class="line"><span class="number">3</span> ENST00000253838 <span class="number">22439593</span> lincRNA lincRNA Y</span><br></pre></td></tr></table></figure><h2 id="other-questions"><a class="markdownIt-Anchor" href="#other-questions"></a> Other Questions</h2><p><a href="https://www.biostars.org/p/287871/" target="_blank" rel="noopener">Question: Difference between GO.db, biomaRt, and org.Hs.eg.db in GO annotations</a></p><blockquote><p><code>GO.db</code> and <code>org.Hs.eg.db</code> are copies of the GO annotations. <code>GO.db</code> is updated every 6 months with each release of Bioconductor. <code>org.Hs.eg.db</code> is also updated at the same time and using <code>GO.db</code>. <code>biomaRt</code> connects to the server where the informations is stored, so it will be the most up to date. If you want a stable release you can use either <code>GO.db</code> or <code>org.Hs.eg.db</code>, if you want the most up to date (from yesterday) data every time you do an analysis you can use <code>biomaRt</code>.</p></blockquote><p><a href="https://support.bioconductor.org/p/84593/#84594" target="_blank" rel="noopener">Question: org.Hs.eg.db - hg38 build?</a></p><blockquote><p>The orgDb packages don’t really contain any positional annotation. They used to, but these days you will be directed to a TxDb package if you try to get positional info. And the TxDb have the build in the package name. The orgDb packages mostly contain mappings between various databases and some functional annotation, none of which is based on any build. In fact, most of that stuff is updated weekly or monthly, so the orgDb packages get outdated to a certain extent rather quickly.</p></blockquote><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180918: create the note.</li></ul>]]></content>
<summary type="html">
<p>This note is to help me figure out the design schema of annotation packages in Bioconductor. And this note is mainly compiled from:</p>
<
</summary>
<category term="R" scheme="https://yiweiniu.github.io/blog/categories/R/"/>
<category term="Bioconductor" scheme="https://yiweiniu.github.io/blog/categories/R/Bioconductor/"/>
<category term="R" scheme="https://yiweiniu.github.io/blog/tags/R/"/>
<category term="Bioconductor" scheme="https://yiweiniu.github.io/blog/tags/Bioconductor/"/>
<category term="Bioconductor package" scheme="https://yiweiniu.github.io/blog/tags/Bioconductor-package/"/>
<category term="annotation package" scheme="https://yiweiniu.github.io/blog/tags/annotation-package/"/>
</entry>
<entry>
<title>Biological ID Conversion</title>
<link href="https://yiweiniu.github.io/blog/2018/09/Biological-ID-Conversion/"/>
<id>https://yiweiniu.github.io/blog/2018/09/Biological-ID-Conversion/</id>
<published>2018-09-18T07:54:27.000Z</published>
<updated>2018-09-18T15:06:00.000Z</updated>
<content type="html"><![CDATA[<p>ID mapping is annoying but we have to face very often. This note is a collection of methods to deal with this trouble.</p><h2 id="r-bioconductor"><a class="markdownIt-Anchor" href="#r-bioconductor"></a> R (Bioconductor)</h2><p>There are lots of annotation packages in Bioconductor and they contain various kinds of annotation we need and we don’t need. Different series of annotation packages may have different design purpose, and these differences should be considered when in practice.</p><p>For ID conversion, two main resources can be used: <code>biomaRt</code>, the R interface of <a href="http://www.biomart.org/" target="_blank" rel="noopener">BioMart</a>, and various specialized annotation packages.</p><h3 id="biomart"><a class="markdownIt-Anchor" href="#biomart"></a> biomaRt</h3><p>Ref: <a href="https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html" target="_blank" rel="noopener">The biomaRt users guide</a></p><p><code>biomaRt</code> is a R interface to <a href="http://www.biomart.org/" target="_blank" rel="noopener">BioMart</a> databases. It’s very powerful and ID conversion is only one of many applications.</p><blockquote><p>The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, COSMIC, Uniprot, HGNC, Gramene, Wormbase and dbSNP mapped to Ensembl. These major databases give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from gene annotation to database mining. <em>via: <a href="http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html" target="_blank" rel="noopener">http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html</a></em></p></blockquote><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(<span class="string">"biomaRt"</span>)</span><br></pre></td></tr></table></figure><p>Display all available BioMart web services</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">listMarts()</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"> biomart version</span><br><span class="line">1 ENSEMBL_MART_ENSEMBL Ensembl Genes 93</span><br><span class="line">2 ENSEMBL_MART_MOUSE Mouse strains 93</span><br><span class="line">3 ENSEMBL_MART_SNP Ensembl Variation 93</span><br><span class="line">4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 93</span><br></pre></td></tr></table></figure><p>Choose to query the Ensembl BioMart database.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ensembl=useMart(<span class="string">"ensembl"</span>)</span><br></pre></td></tr></table></figure><p>Look at which datasets are available in the selected BioMart by using the function</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">listDatasets(ensembl)[1:5, ]</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"> dataset description version</span><br><span class="line">1 acarolinensis_gene_ensembl Anole lizard genes (AnoCar2.0) AnoCar2.0</span><br><span class="line">2 amelanoleuca_gene_ensembl Panda genes (ailMel1) ailMel1</span><br><span class="line">3 amexicanus_gene_ensembl Cave fish genes (AstMex102) AstMex102</span><br><span class="line">4 anancymaae_gene_ensembl Ma's night monkey genes (Anan_2.0) Anan_2.0</span><br><span class="line">5 aplatyrhynchos_gene_ensembl Duck genes (BGI_duck_1.0) BGI_duck_1.0</span><br></pre></td></tr></table></figure><p>Update the Mart object using the function <code>useDataset()</code></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ensembl = useDataset(<span class="string">"hsapiens_gene_ensembl"</span>, mart=ensembl)</span><br></pre></td></tr></table></figure><p>Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ensembl = useMart(<span class="string">"ensembl"</span>, dataset=<span class="string">"hsapiens_gene_ensembl"</span>)</span><br></pre></td></tr></table></figure><p>Shows all available filters in the selected dataset</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">filters = listFilters(ensembl)</span><br><span class="line">filters[<span class="number">1</span>:<span class="number">5</span>,]</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">## name description</span><br><span class="line">## 1 chromosome_name Chromosome/scaffold name</span><br><span class="line">## 2 start Start</span><br><span class="line">## 3 end End</span><br><span class="line">## 4 band_start Band Start</span><br><span class="line">## 5 band_end Band End</span><br></pre></td></tr></table></figure><p>Displays all available attributes in the selected dataset</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">attributes = listAttributes(ensembl)</span><br><span class="line">attributes[<span class="number">1</span>:<span class="number">5</span>,]</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"> name description page</span><br><span class="line">1 ensembl_gene_id Gene stable ID feature_page</span><br><span class="line">2 ensembl_gene_id_version Gene stable ID version feature_page</span><br><span class="line">3 ensembl_transcript_id Transcript stable ID feature_page</span><br><span class="line">4 ensembl_transcript_id_version Transcript stable ID version feature_page</span><br><span class="line">5 ensembl_peptide_id Protein stable ID feature_page</span><br></pre></td></tr></table></figure><p>Annotate a set of Affymetrix identifiers with HUGO symbol and chromosomal locations of corresponding genes</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">affyids=c("202763_at","209310_s_at","207500_at")</span><br><span class="line">getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'), </span><br><span class="line"> filters = 'affy_hg_u133_plus_2', </span><br><span class="line"> values = affyids, </span><br><span class="line"> mart = ensembl)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"> affy_hg_u133_plus_2 entrezgene</span><br><span class="line">1 209310_s_at 837</span><br><span class="line">2 207500_at 838</span><br><span class="line">3 202763_at 836</span><br></pre></td></tr></table></figure><p>Retrieve all HUGO gene symbols of genes that are located on chromosomes 17,20 or Y, and are associated with specific GO terms.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">go=c(<span class="string">"GO:0051330"</span>,<span class="string">"GO:0000080"</span>,<span class="string">"GO:0000114"</span>,<span class="string">"GO:0000082"</span>)</span><br><span class="line">chrom=c(<span class="number">17</span>,<span class="number">20</span>,<span class="string">"Y"</span>)</span><br><span class="line">getBM(attributes= <span class="string">"hgnc_symbol"</span>,</span><br><span class="line"> filters=c(<span class="string">"go"</span>,<span class="string">"chromosome_name"</span>),</span><br><span class="line"> values=list(go, chrom), mart=ensembl)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"> hgnc_symbol</span><br><span class="line">1 RPS6KB1</span><br><span class="line">2 CDC6</span><br><span class="line">3 RPA1</span><br><span class="line">4 CDK3</span><br><span class="line">5 MCM8</span><br><span class="line">6 CRLF3</span><br></pre></td></tr></table></figure><p>Annotate a set of EntrezGene identifiers with GO annotation.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">entrez=c(<span class="string">"673"</span>,<span class="string">"837"</span>)</span><br><span class="line">goids = getBM(attributes = c(<span class="string">'entrezgene'</span>, <span class="string">'go_id'</span>), </span><br><span class="line"> filters = <span class="string">'entrezgene'</span>, </span><br><span class="line"> values = entrez, </span><br><span class="line"> mart = ensembl)</span><br><span class="line">head(goids)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"> entrezgene go_id</span><br><span class="line">1 673 GO:0000166</span><br><span class="line">2 673 GO:0004672</span><br><span class="line">3 673 GO:0004674</span><br><span class="line">4 673 GO:0005524</span><br><span class="line">5 673 GO:0006468</span><br><span class="line">6 673</span><br></pre></td></tr></table></figure><p>Ensembl id to gene symbol and entrez id.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(biomaRt)</span><br><span class="line">ensembl = useMart(<span class="string">"ensembl"</span>, dataset = <span class="string">"hsapiens_gene_ensembl"</span>)</span><br><span class="line"></span><br><span class="line">ensg = c(<span class="string">'ENSG00000242268.2'</span>, <span class="string">'ENSG00000158486.13'</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># get stable id</span></span><br><span class="line">ensg.no_version = sapply(strsplit(as.character(ensg),<span class="string">"\\."</span>),<span class="string">"[["</span>,<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line">getBM(attributes = c(<span class="string">'ensembl_gene_id'</span>, <span class="string">'entrezgene'</span>, <span class="string">'hgnc_symbol'</span>), filters = <span class="string">'ensembl_gene_id'</span>, values=ensg.no_version, mart=ensembl)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"> ensembl_gene_id entrezgene hgnc_symbol</span><br><span class="line">1 ENSG00000158486 55567 DNAH3</span><br><span class="line">2 ENSG00000242268 NA LINC02082</span><br></pre></td></tr></table></figure><p>If you do not want to <code>NA</code>, use <code>na.omit</code> to remove those genes that can’t be transformed.</p><h3 id="orgdb-packages-bitr"><a class="markdownIt-Anchor" href="#orgdb-packages-bitr"></a> OrgDb packages + bitr</h3><p>Ref: <a href="https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html#bitr-biological-id-translator" target="_blank" rel="noopener">clusterProfiler - bitr</a></p><p>orgDb packages are gene-centric annotation packages at organism level, such as <code>org.Hs.eg.db</code>, <code>org.Mmu.eg.db</code>.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(org.Hs.eg.db)</span><br></pre></td></tr></table></figure><p>Here is <code>org.Hs.eg</code> package. We can see all the resources used to build this package.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line">> org.Hs.eg.db</span><br><span class="line">OrgDb object:</span><br><span class="line">| DBSCHEMAVERSION: <span class="number">2.1</span></span><br><span class="line">| Db type: OrgDb</span><br><span class="line">| Supporting package: AnnotationDbi</span><br><span class="line">| DBSCHEMA: HUMAN_DB</span><br><span class="line">| ORGANISM: Homo sapiens</span><br><span class="line">| SPECIES: Human</span><br><span class="line">| EGSOURCEDATE: <span class="number">2018</span>-Apr4</span><br><span class="line">| EGSOURCENAME: Entrez Gene</span><br><span class="line">| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA</span><br><span class="line">| CENTRALID: EG</span><br><span class="line">| TAXID: <span class="number">9606</span></span><br><span class="line">| GOSOURCENAME: Gene Ontology</span><br><span class="line">| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/</span><br><span class="line">| GOSOURCEDATE: <span class="number">2018</span>-Mar28</span><br><span class="line">| GOEGSOURCEDATE: <span class="number">2018</span>-Apr4</span><br><span class="line">| GOEGSOURCENAME: Entrez Gene</span><br><span class="line">| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA</span><br><span class="line">| KEGGSOURCENAME: KEGG GENOME</span><br><span class="line">| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes</span><br><span class="line">| KEGGSOURCEDATE: <span class="number">2011</span>-Mar15</span><br><span class="line">| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)</span><br><span class="line">| GPSOURCEURL: </span><br><span class="line">| GPSOURCEDATE: <span class="number">2018</span>-Mar26</span><br><span class="line">| ENSOURCEDATE: <span class="number">2017</span>-Dec04</span><br><span class="line">| ENSOURCENAME: Ensembl</span><br><span class="line">| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta</span><br><span class="line">| UPSOURCENAME: Uniprot</span><br><span class="line">| UPSOURCEURL: http://www.UniProt.org/</span><br><span class="line">| UPSOURCEDATE: Mon Apr <span class="number">9</span> <span class="number">20</span>:<span class="number">58</span>:<span class="number">54</span> <span class="number">2018</span></span><br><span class="line"></span><br><span class="line">Please see: help(<span class="string">'select'</span>) <span class="keyword">for</span> usage information</span><br></pre></td></tr></table></figure><p>Use <code>keytypes()</code> to list all supporting types.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">keytypes(org.Hs.eg.db)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"> [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE" </span><br><span class="line"> [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "IPI" "MAP" "OMIM" "ONTOLOGY" </span><br><span class="line">[17] "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" </span><br><span class="line">[25] "UNIGENE" "UNIPROT"</span><br></pre></td></tr></table></figure><p>Key types supported by differenct packages can be different.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">keytypes(org.Ss.eg.db)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"> [1] "ACCNUM" "ALIAS" "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL" </span><br><span class="line">[10] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"</span><br></pre></td></tr></table></figure><p>Convert Ensembl ids to entrez id and gene symbol.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(clusterProfiler)</span><br><span class="line"><span class="keyword">library</span>(org.Hs.eg.db)</span><br><span class="line"></span><br><span class="line">ensg = c(<span class="string">'ENSG00000242268.2'</span>, <span class="string">'ENSG00000158486.13'</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># remove version number</span></span><br><span class="line">ensg.no_version = sapply(strsplit(as.character(ensg),<span class="string">"\\."</span>),<span class="string">"[["</span>,<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line">bitr(ensg.no_version, fromType=<span class="string">"ENSEMBL"</span>, toType=c(<span class="string">"ENTREZID"</span>, <span class="string">"SYMBOL"</span>), OrgDb=<span class="string">"org.Hs.eg.db"</span>)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">'select()' returned 1:1 mapping between keys and columns</span><br><span class="line"> ENSEMBL ENTREZID SYMBOL</span><br><span class="line">1 ENSG00000242268 100507661 LINC02082</span><br><span class="line">2 ENSG00000158486 55567 DNAH3</span><br></pre></td></tr></table></figure><h3 id="other-annotation-packages"><a class="markdownIt-Anchor" href="#other-annotation-packages"></a> Other Annotation Packages</h3><p>Apart from the <code>OrgDb</code> packages, there are also many other annotation packages like <code>TxDb</code>packages and <code>EnsDb</code> packages, which provide various kinds of information. And most of them are based on <code>AnnotationDb</code> object, and one can use standard <code>select</code> function to retrieve information needed.</p><h2 id="ncbi-gene-data"><a class="markdownIt-Anchor" href="#ncbi-gene-data"></a> NCBI gene DATA</h2><p>Sometimes we want to have all information on local disks and use in-house scripts to do the conversion. <a href="ftp://ftp.ncbi.nih.gov/gene/DATA" target="_blank" rel="noopener">ftp://ftp.ncbi.nih.gov/gene/DATA</a> provide most up-to-date and comprehensive collections of gene-centric information.</p><blockquote><p>By incorporating the data from LocusLink in an Entrez database with gene-specific data from other species, you now have a single point of lookup for gene-specific information for the taxa within the scope of the RefSeq project. You also have more immediate access to related data that was cumbersome to maintain independent of Entrez, and can harness the power of Entrez-based tools such as Entrez Programming Utilities (E-Utilities) and MyNCBI. <em>via: <a href="https://www.ncbi.nlm.nih.gov/entrez/query/static/help/LL2G.html#files" target="_blank" rel="noopener">https://www.ncbi.nlm.nih.gov/entrez/query/static/help/LL2G.html#files</a></em></p></blockquote><p>This <a href="ftp://ftp.ncbi.nih.gov/gene/DATA/README" target="_blank" rel="noopener">README</a> discribes all the files included. Here is a short summary.</p><table><thead><tr><th>Entrez Gene file name</th><th>Comments</th></tr></thead><tbody><tr><td>DATA/ASN_BINARY</td><td>Files in this directory contain comprehensive extractions from Entrez Gene in ASN.1 format.</td></tr><tr><td>DATA/GENE_INFO</td><td>extractions from Entrez Gene in the same format as the gene_info file. Each file contains a subset of data for the species or taxonomic group indicated by the file name.</td></tr><tr><td>DATA/expression</td><td>reports of normalized RNA expression levels computed from RNA-seq data for human, mouse, and rat genes.</td></tr><tr><td>gene2accession</td><td>a comprehensive report of the accessions that are related to a GeneID. It includes sequences from the international sequence collaboration, Swiss-Prot, and RefSeq. The RefSeq subset of this file is also available as gene2refseq… If you want to convert any accessions into GeneIDs, this one file should suffice.</td></tr><tr><td>gene2ensembl</td><td>This file reports matches between NCBI and Ensembl annotation based on comparison of rna and protein features.</td></tr><tr><td>gene2vega</td><td>This file reports matches between NCBI and Vega annotation.</td></tr><tr><td>gene2go</td><td>GeneID/GO ID/Evidence Code. Consolidated summary based on gene_association files from the GO Consortium and Entrez Gene’s gene_info file.</td></tr><tr><td>gene2pubmed</td><td>gene2pubmed includes the identifier for the species of the GeneID (i.e. the Taxonomy ID).</td></tr><tr><td>gene2refseq</td><td>This file is the RefSeq subset of gene2accession. The file in Entrez Gene does not include information about secondary accessions. This function is now provided from the RefSeq ftp site, as documented in the current release notes: <a href="ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release#.txt" target="_blank" rel="noopener">ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release#.txt</a>, where # is the value of the current release number.</td></tr><tr><td>gene2sts</td><td>GeneID/UniSTS marker ID relationship</td></tr><tr><td>gene2unigene</td><td>GeneID/UniGene cluster relationship</td></tr><tr><td>gene_group</td><td>report of genes and their relationships to other genes</td></tr><tr><td>gene_orthologs</td><td>report of orthologous genes</td></tr><tr><td>gene_history</td><td>comprehensive information about GeneIDs that are no longer current</td></tr><tr><td>gene_info</td><td>GeneID, names, map locations, and database cross-reference.</td></tr><tr><td>gene_neighbors</td><td>reports neighboring genes for all genes placed on a given genomic sequence.</td></tr><tr><td>gene_refseq_uniprotkb_collab</td><td>report of the relationship between NCBI Reference Sequence protein accessions and UniProtKB protein accessions</td></tr><tr><td>mim2gene_medgen</td><td>report of the relationship between MIM numbers (OMIM), GeneIDs, and Records in MedGen</td></tr></tbody></table><h2 id="api"><a class="markdownIt-Anchor" href="#api"></a> API</h2><p>Many databases provide APIs to help access their data and some of them can be used for id conversion. But I do not recommend to use these APIs directly if one dose not want to spend much time on this job, as they can be changed over time and users have to be familiar with the data structure provided. Many commonly used APIs have external software or packages to access, and you may use Google to find them before using the APIs.</p><h3 id="ensembl-rest-api"><a class="markdownIt-Anchor" href="#ensembl-rest-api"></a> Ensembl REST API</h3><p><a href="http://rest.ensembl.org/" target="_blank" rel="noopener">Ensembl REST API</a> provides many user-friendly interfaces to retrive information. And there are three APIs for cross biological id mapping.</p><ul><li><a href="http://rest.ensembl.org/documentation/info/xref_external" target="_blank" rel="noopener">GET xrefs/symbol/:species/:symbol</a> looks up an external symbol and returns all Ensembl objects linked to it.</li><li><a href="http://rest.ensembl.org/documentation/info/xref_id" target="_blank" rel="noopener">GET xrefs/id/:id</a> performs lookups of Ensembl Identifiers and retrieve their external references in other databases.</li><li><a href="http://rest.ensembl.org/documentation/info/xref_name" target="_blank" rel="noopener">GET xrefs/name/:species/:name</a> performs a lookup based upon the primary accession or display label of an external reference and returning the information we hold about the entry.</li></ul><p>I guess <code>biomaRt</code> aforementioned is actually a well-capsulated software that communicates with databases through APIs.</p><h3 id="kegg-api"><a class="markdownIt-Anchor" href="#kegg-api"></a> KEGG API</h3><p><a href="https://www.kegg.jp/kegg/rest/keggapi.html" target="_blank" rel="noopener">KEGG API</a> is a REST-stype Application Programming Interface to the KEGG database resource.</p><p>We can use this API by <code>bitr_kegg</code> in <code>clusterProfiler</code> package or <code>KEGGREST</code> package.</p><h4 id="bitr_kegg"><a class="markdownIt-Anchor" href="#bitr_kegg"></a> bitr_kegg</h4><p>Ref: <a href="https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html#bitr_kegg-converting-biological-ids-using-kegg-api" target="_blank" rel="noopener">clusterProfiler - bitr_kegg</a></p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(clusterProfiler)</span><br><span class="line"></span><br><span class="line">hg = c(<span class="string">"4597"</span>, <span class="string">"7111"</span>, <span class="string">"5266"</span>, <span class="string">"2175"</span>, <span class="string">"755"</span>, <span class="string">"23046"</span>)</span><br><span class="line"></span><br><span class="line">bitr_kegg(hg, fromType=<span class="string">'kegg'</span>, toType=<span class="string">'ncbi-proteinid'</span>, organism=<span class="string">'hsa'</span>)</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"> kegg ncbi-proteinid</span><br><span class="line">1 2175 NP_000126</span><br><span class="line">2 23046 NP_001239029</span><br><span class="line">3 4597 NP_002452</span><br><span class="line">4 5266 NP_002629</span><br><span class="line">5 7111 NP_001159588</span><br><span class="line">6 755 NP_004919</span><br></pre></td></tr></table></figure><blockquote><p>The ID type (both <code>fromType</code> & <code>toType</code>) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is <code>entrezgene</code> ID for eukaryote species and <code>Locus</code> ID for prokaryotes.</p></blockquote><blockquote><p><mark>Many prokaryote species don’t have entrezgene ID available</mark>. For example we can check the gene information of <code>ece:Z5100</code> in <a href="http://www.genome.jp/dbget-bin/www_bget?ece:Z5100" target="_blank" rel="noopener">http://www.genome.jp/dbget-bin/www_bget?ece:Z5100</a>, which have <code>NCBI-ProteinID</code> and <code>UnitProt</code> links in the <code>Other DBs</code> Entry, but not <code>NCBI-GeneID</code>.</p></blockquote><blockquote><p>The full list of KEGG supported organisms can be accessed via <a href="http://www.genome.jp/kegg/catalog/org_list.html" target="_blank" rel="noopener">http://www.genome.jp/kegg/catalog/org_list.html</a>.</p></blockquote><h4 id="keggrest"><a class="markdownIt-Anchor" href="#keggrest"></a> KEGGREST</h4><p><a href="http://bioconductor.org/packages/KEGGREST/" target="_blank" rel="noopener">KEGGREST</a> provides a client interface to the KEGG REST server. And <code>keggConv()</code> can be used for converting identifiers.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">library</span>(KEGGREEST)</span><br></pre></td></tr></table></figure><p>Convert between KEGG identifiers and outside identifiers.</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">keggConv(<span class="string">"ncbi-proteinid"</span>, c(<span class="string">"hsa:10458"</span>, <span class="string">"ece:Z5100"</span>))</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"> hsa:10458 ece:Z5100 </span><br><span class="line">"ncbi-proteinid:NP_059345" "ncbi-proteinid:AAG58814"</span><br></pre></td></tr></table></figure><p>…or get the mapping for an entire species:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">head(keggConv(<span class="string">"eco"</span>, <span class="string">"ncbi-geneid"</span>))</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ncbi-geneid:944742 ncbi-geneid:945803 ncbi-geneid:947498 ncbi-geneid:945198 ncbi-geneid:944747 ncbi-geneid:944749 </span><br><span class="line"> "eco:b0001" "eco:b0002" "eco:b0003" "eco:b0004" "eco:b0005" "eco:b0006"</span><br></pre></td></tr></table></figure><p>Reversing the arguments does the opposite mapping:</p><figure class="highlight r"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">head(keggConv(<span class="string">"ncbi-geneid"</span>, <span class="string">"eco"</span>))</span><br></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"> eco:b0001 eco:b0002 eco:b0003 eco:b0004 eco:b0005 eco:b0006 </span><br><span class="line">"ncbi-geneid:944742" "ncbi-geneid:945803" "ncbi-geneid:947498" "ncbi-geneid:945198" "ncbi-geneid:944747" "ncbi-geneid:944749"</span><br></pre></td></tr></table></figure><h2 id="web-server"><a class="markdownIt-Anchor" href="#web-server"></a> Web Server</h2><ul><li><a href="https://david.ncifcrf.gov/conversion.jsp" target="_blank" rel="noopener">DAVID - Gene ID Conversion Tool</a>. easy to use, but a bit old.</li><li><a href="http://www.ensembl.org/biomart/martview" target="_blank" rel="noopener">BioMart - Ensembl</a>. up-to-date and powerful.</li></ul><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180918: create the note.</li></ul>]]></content>
<summary type="html">
<p>ID mapping is annoying but we have to face very often. This note is a collection of methods to deal with this trouble.</p>
<h2 id="r-bioc
</summary>
<category term="bioinformatics" scheme="https://yiweiniu.github.io/blog/categories/bioinformatics/"/>
<category term="basic" scheme="https://yiweiniu.github.io/blog/categories/bioinformatics/basic/"/>
<category term="ID conversion" scheme="https://yiweiniu.github.io/blog/tags/ID-conversion/"/>
<category term="bioconductor" scheme="https://yiweiniu.github.io/blog/tags/bioconductor/"/>
<category term="biomaRt" scheme="https://yiweiniu.github.io/blog/tags/biomaRt/"/>
<category term="OrgDb" scheme="https://yiweiniu.github.io/blog/tags/OrgDb/"/>
<category term="API" scheme="https://yiweiniu.github.io/blog/tags/API/"/>
</entry>
<entry>
<title>Remove Microbial Contamination in Reads</title>
<link href="https://yiweiniu.github.io/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/"/>
<id>https://yiweiniu.github.io/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/</id>
<published>2018-07-25T10:37:59.000Z</published>
<updated>2019-01-15T03:02:27.000Z</updated>
<content type="html"><![CDATA[<p>Purpose in short: I’ve got both illumina (PE and MPE) and PacBio reads of an insect for <em>de novo</em> geome assembly. Since whole bodies were used for DNA extraction, I thought there was some contamination in the raw sequencing reads, so I want to remove them before assembling.</p><p>The main tool for illumina reads I used was <code>BBDuk</code>, which belongs to the <a href="https://sourceforge.net/projects/bbmap/" target="_blank" rel="noopener">BBTools suite</a>. <code>Seal</code> and <code>BBSplit</code> can also do this, but <code>BBDuk</code> is the most suitable for my purpose. See discussions here: <a href="https://www.biostars.org/p/165059/" target="_blank" rel="noopener">Question: How to remove contamination from the transcriptome assembly</a>. <code>BBTools</code> is quite versatile, fast, and convenient! Though now I don’t fully understand every kit in it, the experience was very good from the tools I’ve tried. The author Brian Bushnell is very active and responsive. Here is a handy summary of <code>BBMap</code>: <a href="http://seqanswers.com/forums/showthread.php?t=58221" target="_blank" rel="noopener">Yes … BBMap can do that!</a>.</p><p>But for PacBio long reads, I didn’t find any tools specialized for that. I tried several tools including <code>mapPacBio.sh</code> (based on <code>bbmap</code>, and is tuned for the error profile of long reads.), <code>blasr</code>, <code>MashMap</code>, and <code>minimap2</code>. At last, I found <code>minimap2</code> best met my needs.</p><h2 id="prepare-the-contamination-library"><a class="markdownIt-Anchor" href="#prepare-the-contamination-library"></a> Prepare the contamination library</h2><ul><li><p>Download all the sequences of bacteria, viral, fungi, protozoa, and archaea using <a href="https://github.com/sschmeier/refseq2kraken" target="_blank" rel="noopener">refseq2kraken</a>. See note <a href="/blog/2018/06/Detect-Microbial-Contamination-in-Contigs-by-Kraken/" title="Detect Microbial Contamination in Contigs by Kraken">Detect Microbial Contamination in Contigs by Kraken</a> for details. We can also use <a href="https://github.com/kblin/ncbi-genome-download" target="_blank" rel="noopener">ncbi-genome-download</a> to download them.</p> <figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> folder structures</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> ls genomes/refseq/</span></span><br><span class="line">archaea bacteria fungi protozoa viral</span><br></pre></td></tr></table></figure></li><li><p>Merge all the sequences into one single Fasta.</p></li><li><p>Since I’ve got the mitochondrion DNA sequences of the target organism, I added them in the contaminant library to remove reads from mtDNA too. This library is refered as to <code>$contaminants</code> below.</p></li><li><p>After discussing with the author of <code>MashMap</code>, I included several insects’ genomes of the same order of my target insect to improve the specificity of aligners. This library is refered as to <code>$contaminants2</code>.</p></li></ul><h2 id="bbduk-for-short-reads"><a class="markdownIt-Anchor" href="#bbduk-for-short-reads"></a> BBDuk for short reads</h2><p><a href="http://seqanswers.com/forums/showthread.php?t=42776" target="_blank" rel="noopener">BBDuk</a> is a member of the <a href="https://sourceforge.net/projects/bbmap/" target="_blank" rel="noopener">BBTools</a> package. “Duk” stands for Decontamination Using Kmers. BBDuk is extremely fast, scalable, and memory-efficient, while maintaining greater sensitivity and specificity than other tools.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">WORKDIR=/parastor300/niuyw/Project/beetle_genome_171231/data/second</span><br><span class="line"></span><br><span class="line">ANNODIR=/home/niuyw/RefData</span><br><span class="line">TOOLDIR=/home/niuyw/software</span><br><span class="line">path2java=$TOOLDIR/jre1.8.0_111/bin/java</span><br><span class="line"></span><br><span class="line">DATADIR=/home/niuyw/Project/beetle_genome_171231/data/second</span><br><span class="line"></span><br><span class="line">PPN=24</span><br><span class="line"></span><br><span class="line">path2bbduk=$TOOLDIR/bbmap/bbduk.sh</span><br><span class="line">contaminants=$ANNODIR/NCBI_contaminants/contaminants_4_bettle_genome.fa</span><br><span class="line"></span><br><span class="line">for sample in 270B 500B 800B 3k_1 5k-1 5k-2 10k</span><br><span class="line">do</span><br><span class="line"><span class="meta">$</span><span class="bash">path2bbduk ref=<span class="variable">${contaminants}</span> threads=<span class="variable">${PPN}</span> ordered=t k=31 <span class="keyword">in</span>=<span class="variable">${DATADIR}</span>/TrimGalore/<span class="variable">${sample}</span>_R1_val_1.fq.gz in2=<span class="variable">${DATADIR}</span>/TrimGalore/<span class="variable">${sample}</span>_R2_val_2.fq.gz out=<span class="variable">${DATADIR}</span>/bbduk/<span class="variable">${sample}</span>.R1.fq out2=<span class="variable">${DATADIR}</span>/bbduk/<span class="variable">${sample}</span>.R2.fq outm=<span class="variable">${DATADIR}</span>/bbduk/<span class="variable">${sample}</span>.bad.R1.fq outm2=<span class="variable">${DATADIR}</span>/bbduk/<span class="variable">${sample}</span>.bad.R2.fq</span></span><br><span class="line">done</span><br></pre></td></tr></table></figure><p>The ‘good’ reads will be in <code>${sample}.R*.fq</code>, and the ‘bad’ ones will be in <code>${sample}.bad.R*.fq</code>.</p><h2 id="long-reads"><a class="markdownIt-Anchor" href="#long-reads"></a> Long reads</h2><h3 id="mappacbio"><a class="markdownIt-Anchor" href="#mappacbio"></a> mapPacBio</h3><p><a href="http://seqanswers.com/forums/showthread.php?p=133568#post133568" target="_blank" rel="noopener">mapPacBio.sh</a> is based on <code>bbmap</code>, and is tuned for the error profile of long reads.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line">WORKDIR=/parastor300/niuyw/Project/beetle_genome_171231</span><br><span class="line"></span><br><span class="line">ANNODIR=/home/niuyw/RefData</span><br><span class="line"></span><br><span class="line">TOOLDIR=/home/niuyw/software</span><br><span class="line">path2java=$TOOLDIR/jre1.8.0_111/bin/java</span><br><span class="line"></span><br><span class="line">DATADIR=/home/niuyw/Project/beetle_genome_171231/data/third</span><br><span class="line"></span><br><span class="line">PPN=24</span><br><span class="line"></span><br><span class="line">path2bbmap=$TOOLDIR/bbmap</span><br><span class="line">contaminants=$ANNODIR/NCBI_contaminants/contaminants_4_bettle_genome.fa</span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash">path2bbmap/mapPacBio.sh threads=<span class="variable">${PPN}</span> ref=<span class="variable">${contaminants}</span> <span class="keyword">in</span>=third_all.fasta maxlen=6000 out=bbmap.sam</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2bbmap/reformat.sh unmappedonly <span class="keyword">in</span>=bbmap.sam out=good.fa</span></span><br></pre></td></tr></table></figure><p>The output of <code>bbmap</code> or <code>mapPacBio.sh</code> is <code>SAM</code>, and both mapped and unmapped reads are saved in one file. <code>reformat.sh</code> was used to extract mapped reads and transfrom it to <code>FASTA</code>. See this: <a href="https://www.biostars.org/p/233118/" target="_blank" rel="noopener">Question: bbmap command to extract mapped and unmapped pair end reads</a>.</p><p>But <code>mapPacBio</code> was very very very slow, even I used 24 threads. After about 25 days, the ouput was about 4.6G (my input was about 60G!), though it was still increasing, I killed the job.</p><h3 id="blasr"><a class="markdownIt-Anchor" href="#blasr"></a> blasr</h3><p><a href="https://github.com/PacificBiosciences/blasr" target="_blank" rel="noopener">blasr</a> may be the one of the first long-read aligner <sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>. So I gave it a shot.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/anaconda2/bin/blasr third_all.fasta <span class="variable">$contaminants</span> --nproc <span class="variable">$PPN</span> --out blasr.bad --unaligned blasr.good</span></span><br></pre></td></tr></table></figure><p>Related: <a href="https://github.com/PacificBiosciences/blasr/issues/347" target="_blank" rel="noopener">output unmapped reads</a>.</p><p>But it went wrong:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">[INFO] 2018-07-02T18:56:18 [blasr] started.</span><br><span class="line">ERROR! Reading fasta files greater than 4Gbytes is not supported.</span><br></pre></td></tr></table></figure><p>What?! My fasta was about 60G. If I wanted use <code>blasr</code>, I had to split the input into small ones, but I didn’t want to.</p><h3 id="mashmap"><a class="markdownIt-Anchor" href="#mashmap"></a> MashMap</h3><p><a href="https://github.com/marbl/MashMap" target="_blank" rel="noopener">MashMap</a> is a fast approximate aligner for long DNA sequences <sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>. It’s very fast and is easy to use.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">PPN=8</span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/mashmap-Linux64-v2.0/mashmap -r <span class="variable">$contaminants</span> -q third_all.fasta -o mashmap.out</span></span><br></pre></td></tr></table></figure><p><code>-s</code> is the minimum query length (default is 5000), and <code>--pi</code> is the minimum identity to be reported (default is 85).</p><p>The output of <code>MashMap</code> is like this. Separated by space, it is <code>query name, length, 0-based start, end, strand, target name, length, start, end and mapping nucleotide identity</code> in turn.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104/24935_31494 6559 0 4999 - NC_037282.1 2038340 596479 601478 82.1711</span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/357/31541_41029 9488 0 4999 - NC_004326.2 1343557 117176 122175 81.9933</span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/562/0_12626 12626 7626 12625 + NC_001224.1 85779 50909 55908 82.2314</span><br></pre></td></tr></table></figure><p>Here is also a thread talking about contamination: <a href="https://github.com/marbl/MashMap/issues/6" target="_blank" rel="noopener">Decontamination of bacterial sequences in an assembly</a>. The author suggested that “One more thing that may be helpful for you is to include the representative genome (corresponding to your assembly) in the database as well. This would help improve the specificity of the method for correct portions of your assembly.”</p><p>But, the default parameters <code>MashMap</code> use are a bit loose, I wanted to use more strict parameters. When I lowered the <code>-s</code> and <code>--pi</code>, the program needed huge RAM and couldn’t run on our machine.</p><p>I’ve reported this issue to the author: <a href="https://github.com/marbl/MashMap/issues/6#issuecomment-402911053" target="_blank" rel="noopener">Decontamination of bacterial sequences in an assembly</a>.</p><p>And, there were also some questions about its alignment: <a href="https://github.com/marbl/MashMap/issues/12" target="_blank" rel="noopener">Questions about the alignment of MashMap</a>.</p><p>I first ran <code>MashMap</code> with different parameters and got several outputs:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run1, with default parameters</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2mashmap -r <span class="variable">$contaminants</span> -q third_all.fasta -o mashmap.out</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> run2, with -s 2500 --pi 80</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2mashmap -t 8 -r <span class="variable">$contaminants</span> -q third_all.fasta -s 2500 --pi 80 -o mashmap2.out</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> run3, with -s 500 --pi 85</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2mashmap -t 8 -r <span class="variable">$contaminants</span> -q third_all.fasta -s 500 --pi 85 -o mashmap3.out</span></span><br></pre></td></tr></table></figure><p>And the outputs from three runs varied:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> there are 6633142 sequences of input</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> grep -c <span class="string">'>'</span> third_all.fasta</span></span><br><span class="line">6633142</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> run1</span></span><br><span class="line">cut -f 1 -d ' ' mashmap.out |sort|uniq|wc -l</span><br><span class="line">463569</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> run2</span></span><br><span class="line">cut -f 1 -d ' ' mashmap2.out |sort|uniq|wc -l</span><br><span class="line">2821004</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> run3</span></span><br><span class="line">cut -f 1 -d ' ' mashmap3.out |sort|uniq|wc -l</span><br><span class="line">6189307</span><br></pre></td></tr></table></figure><p>As can be seen, nearly all the sequences were aligned to contaminant library. That really shocked me!</p><p>Then I checked the top 10 sequences with highest identity and top 10 ones with loweset identity from the first run using <code>blastn</code>. The highest ones were fine. There were some differences between hits reported by blastn and MashMap, but maybe it’s because they used different databases. But the loweset ones were problematic. Most of them were ‘No significant similarity found’ when default parameters of <code>blastn</code> were used. And when I unselected ‘Low complexity regions’, the alignments were unreliable. There maybe something with ‘low complexity regions’ or ‘repeat’ things.</p><p>And the author explained:</p><blockquote><p>Mashmap identity is an estimate based on Jaccard similarity- not the precise identity; unfortunately the Jaccard-similarity based metric delivers poor specificity in cases when the source of reads is absent from database. See if you can include an insect reference genome (s) in the reference list to avoid this.</p></blockquote><p>Then I added several insects’ genomes of the same order of the target insect into the contaminant library and ran <code>MapshMap</code> again:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash">path2mashmap -t 20 -r <span class="variable">$contaminants2</span> -q third_all.fasta -s 500 --pi 80 -o mashmap4.out</span></span><br></pre></td></tr></table></figure><p>For each read, it will be one of two states: mapped or unmapped, and for the mapped, it will be one of three states: mapped to insects (good), mapped to contaminants (bad) and mapped to both (ambivalent). So I counted reads in each categories:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">No. of total reads: 6633142</span><br><span class="line"> No. of reads in the mashmap.output: 6214500</span><br><span class="line"> No. of good: 853515</span><br><span class="line"> No. of bad: 138799</span><br><span class="line"> No. of ambivalent: 5222186</span><br><span class="line"> No. of reads not in the mashmap.output: 418642</span><br></pre></td></tr></table></figure><p>As can be seen, majority of reads had been mapped ambivalently, and here is a example of such reads:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 3500 4015 + NW_017852934.1 2683736 1681820 1682319 79.4204</span><br><span class="line">m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 1000 1999 + NW_019280650.1 1003565 813077 813577 78.0766</span><br><span class="line">m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 2500 2999 - LJIG01019880.1 38067 34865 35364 79.3626</span><br><span class="line">m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 500 999 + NC_007418.3 31381287 24981081 24981580 79.5573</span><br><span class="line">m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 3000 3499 - kraken:taxid|76857|NZ_CP022123.1 2521394 1537365 1537864 81.3159</span><br><span class="line">m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 0 499 - kraken:taxid|1202539|NC_018417.1 157543 40864 41363 79.4397</span><br><span class="line">m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 1500 2499 + kraken:taxid|1936081|NZ_CP019389.1 3752836 1813582 1814081 76.7726</span><br></pre></td></tr></table></figure><p>So it’s very hard to extract good ones.</p><p>After all this, I got two points:</p><ul><li>I should add some insects’ genomes into the library to reduce the false positive hits.</li><li>Alignment from <code>MashMap</code> maybe not so reliable ().</li></ul><h3 id="minimap2"><a class="markdownIt-Anchor" href="#minimap2"></a> minimap2</h3><p><a href="https://github.com/lh3/minimap2" target="_blank" rel="noopener">minimap2</a> is a versatile pairwise aligner for genomic and spliced nucleotide sequences created by Heng Li <sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup>. It’s easy to use and runs very fast.</p><h4 id="mimimap2"><a class="markdownIt-Anchor" href="#mimimap2"></a> mimimap2</h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">PPN=24</span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/minimap2-2.8_x64-linux/minimap2 -x map-pb <span class="variable">$contaminants</span> third_all.fasta -t <span class="variable">$PPN</span> -a -Q > minimap.sam</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> -a: output the SAM format</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> -Q:not output base quality <span class="keyword">in</span> SAM</span></span><br></pre></td></tr></table></figure><p>I had about 60G input, and <code>minimap2</code> was so fast, finished after 77 CPU hours (4 real hours, 24 threads).</p><p>The output was very huge (~685G!), and it’s like this (truncated, and the sequences were replaced by ‘seq’.):</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">m54174_171023_074758/9962478/22561_26454 4 * 0 0 * * 0 0 seq *</span><br><span class="line">m54174_171023_074758/9962478/22561_26454 4 * 0 0 * * 0 0 seq *</span><br><span class="line">m54174_171023_074758/9962478/22561_26454 16 kraken:taxid|1094466|NC_017025.1 1304571 1 3661S5M2D6M4I8M1D4M1D13M1D11M1I16M1D13M1D13M1D27M1D4M1D4M2I4M2D12M2D9M1D2M1D13M1D12M1D14M1D13M1D6M2I7M1D7M * 0 0 seq * NM:i:43 ms:i:220 AS:i:220 nn:i:0 tp:A:P cm:i:3 s1:i:40 s2:i:0 dv:f:0.0200</span><br><span class="line">m54174_171023_074758/9962478/22561_26454 4 * 0 0 * * 0 0 seq *</span><br><span class="line">m54174_171023_074758/9962478/22561_26454 16 kraken:taxid|1323664|NZ_CP012748.1 2198444 1 903S12M1I12M1I12M1I4M1I6M1I2M1I12M1I39M1I9M1I6M5I4M4I7M1I2M1I5M1D6M1I12M1I12M1I12M1I13M1I11M2I5M1D8M2I22M1I13M1I23M1I12M1I12M1I10M1I3M2I10M3I5M1I6M1I3M1I16M1D6M1I12M2I6M1D3M1I2M1I7M1D4M1I5M1I7M1I12M1I6M7I3M1I3M1D6M1I11M1I12M1I12M2464S * 0 0 seq * NM:i:95 ms:i:432 AS:i:432 nn:i:0 tp:A:P cm:i:4 s1:i:49 s2:i:77 dv:f:0.0701 SA:Z:kraken:taxid|1323664|NZ_CP012748.1,2198449,-,1461S464M47I1921S,15,95;kraken:taxid|1323664|NZ_CP012748.1,2198443,-,1853S470M47I1523S,1,95;kraken:taxid|1323664|NZ_CP012748.1,2198444,-,683S469M51I2690S,4,106;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,2305S464M17I1107S,16,105;kraken:taxid|1323664|NZ_CP012748.1,2198456,-,2607S450M11I825S,22,93;kraken:taxid|1323664|NZ_CP012748.1,2198451,-,430S460M63I2940S,3,105;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,13S464M94I3322S,11,133;</span><br><span class="line">m54174_171023_074758/9962478/22561_26454 2064 kraken:taxid|1323664|NZ_CP012748.1 2198449 15 1461H7M1I12M1I7M1D4M1I12M1I4M1D10M1I10M1D24M1I18M2I9M1I8M1I12M2I9M1I2M6I7M1I13M5I5M1I7M1D4M1I10M1I2M1I12M1I12M1I12M2I13M1I11M1I11M1I5M1I7M1I5M1I8M1I12M1I5M1D4M1I2M1I4M2I2M1I7M5D5M1D7M2I21M1I24M4I8M1I12M1I12M1I4M1I20M1I12M1921H * 0 0 seq * NM:i:95 ms:i:420 AS:i:420 nn:i:0 tp:A:P cm:i:6 s1:i:80 s2:i:0 dv:f:0.0526 SA:Z:kraken:taxid|1323664|NZ_CP012748.1,2198444,-,903S469M57I2464S,1,95;kraken:taxid|1323664|NZ_CP012748.1,2198443,-,1853S470M47I1523S,1,95;kraken:taxid|1323664|NZ_CP012748.1,2198444,-,683S469M51I2690S,4,106;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,2305S464M17I1107S,16,105;kraken:taxid|1323664|NZ_CP012748.1,2198456,-,2607S450M11I825S,22,93;kraken:taxid|1323664|NZ_CP012748.1,2198451,-,430S460M63I2940S,3,105;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,13S464M94I3322S,11,133;</span><br><span class="line">m54174_171023_074758/9962478/22561_26454 272 kraken:taxid|1323664|NZ_CP012748.1 2198445 0 1010S9M1I2M1I4M1I8M1I12M1I5M1D6M1I12M1I39M1I11M2I3M1I10M1I22M1I13M1I23M1I12M1I12M1I10M1I3M2I11M1I1M2I3M1I6M1I3M1I16M1D6M1I13M4D7M1D5M2I7M1D4M1I5M1I6M1I13M1I6M1I9M1I2M1D7M1I11M1I10M1I15M1I10M1I13M1I12M1I12M1I7M1D4M1I13M2386S * 0 0 * * NM:i:86 ms:i:418 AS:i:418 nn:i:0 tp:A:S cm:i:6 s1:i:77 dv:f:0.0589</span><br><span class="line">m54174_171023_074758/9962478/22561_26454 2064 kraken:taxid|1323664|NZ_CP012748.1 2198443 1 1853H19M1I22M4I8M1I12M1I38M1I18M2I9M1I3M1I5M1I12M1I24M1I12M2I24M2I3M1I3M1D5M3I12M1I13M2I3M1I8M1I11M2I8M1D3M1I2M1I11M1I12M1I12M1I5M1D3M1D2M1I12M1I12M1I13M2I1M3I4M1I10M1D14M2I2M1I4M1D4M1I4M1I8M1I5M1D4M1I2M1I12M2I12M1I7M1D10M1523H * 0 0 seq * NM:i:95 ms:i:414 AS:i:414 nn:i:0 tp:A:P cm:i:7 s1:i:60 s2:i:97 dv:f:0.0466 SA:Z:kraken:taxid|1323664|NZ_CP012748.1,2198444,-,903S469M57I2464S,1,95;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,1461S464M47I1921S,15,95;kraken:taxid|1323664|NZ_CP012748.1,2198444,-,683S469M51I2690S,4,106;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,2305S464M17I1107S,16,105;kraken:taxid|1323664|NZ_CP012748.1,2198456,-,2607S450M11I825S,22,93;kraken:taxid|1323664|NZ_CP012748.1,2198451,-,430S460M63I2940S,3,105;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,13S464M94I3322S,11,133;</span><br></pre></td></tr></table></figure><p>We can clearly found that the same sequence had been mapped to different reference sequences, and it’s been reported several times, with the exactly same line content.</p><p>This is a limitation of <code>minimap2</code>, as reported: <a href="https://github.com/lh3/minimap2/issues/164" target="_blank" rel="noopener">Multiple empty hits?</a> and <a href="https://github.com/lh3/minimap2/issues/141" target="_blank" rel="noopener">How does using a multi part index affect the accuracy?</a>. Specifically, when a huge reference is used, <code>minimap2</code> will split it into multiple parts and align all queries against each part independently. For most parts, <code>minimap2</code> will print unmapped records.</p><p>The <strong>good news</strong> is that <a href="https://github.com/lh3/minimap2/releases/tag/v2.12" target="_blank" rel="noopener">Minimap2-2.12 (r827)</a> had addressed this bug.</p><h4 id="minimap2-arm"><a class="markdownIt-Anchor" href="#minimap2-arm"></a> minimap2-arm</h4><p><a href="https://github.com/hasindu2008/minimap2" target="_blank" rel="noopener">minimap2-arm</a> is a solution provieded by Hasindu Gamaarachchi, and it merges the results from a multi-part index to achieve a considerably similar output from a single-part index. So I cloned this modified <code>minimap2</code> to run the job anain.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> install</span></span><br><span class="line">git clone https://github.com/hasindu2008/minimap2 minimap2-arm && cd minimap2-arm && git checkout multipart-merge-tmp && make</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> run</span></span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/minimap2-arm/minimap2 -x map-pb -I 500G -t <span class="variable">$PPN</span> -a -Q --multi-prefix tmp <span class="variable">$contaminants2</span> third_all.fasta > minimap2.sam</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> --multi-prefix: <span class="built_in">enable</span> mergine</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> -I: split index <span class="keyword">for</span> every ~500G input bases, this number is far more than the reference.</span></span><br></pre></td></tr></table></figure><p>I counted reads in each categories like I did in <code>MashMap</code> part:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">No. of total reads: 6633142</span><br><span class="line"> No. of reads in the SAM: 6633142</span><br><span class="line"> No. of mapped: 661646</span><br><span class="line"> No. of good: 490150</span><br><span class="line"> No. of bad: 125390</span><br><span class="line"> No. of ambivalent: 46106</span><br><span class="line"> No. of Unmapped: 5971496</span><br><span class="line"> No. of reads not in the SAM: 0</span><br></pre></td></tr></table></figure><p>Then I kept the ‘unmapped’, ‘good’ and ‘ambivalent’ reads for downstream analysis.</p><p>To parse the result and get the clean sequences, I used a simple python script to extract clean fasta and bad fasta ids. (appendix 1)</p><p>To validate the effciency, I checked several sequences by <code>blastn</code> manually.</p><p>Belows are two examples.</p><p>This is the alignment of ‘m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873’ by <code>minimap2</code>.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> grep <span class="string">'m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873'</span> minimap2.sam</span></span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 16 kraken:taxid|66084|NC_012416.1 1207118 5 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D5M1I3M2D18M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I11M1I2M2D6M2D3M1D31M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I5M1D3M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I28M1D3M1D3M1I16M1D2M1D3M1I6M1I14M2I12M1D28M1D3M1D6M2I5M1I32M1D8M1D25M1D8M1D30M1I11M1I1M1I16M1D14M1D17M1D19M1I32M1D14M1D6M1D12M1I50M1D26M1I8M1D3M1D12M1D13M1I8M1D18M1D12M1I24M1D4M2D6M3I10M1D3M2D18M1D3M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I31M1I13M1D6M1I22M2D2I3M1I6M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D9M1I17M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I3M1D3M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D10M1D3M1D4M1D6M1D7M2D5M2I5M2I2M2D2M1I16M1I5M14I4M2D3M1I11M1D6M1I1M1I9M1I14M2I6M2I14M1D2M1D6M1D9M2D1M1D22M1D12M1D7M1I13M2D8M1I3M1I15M1I18M1I17M9S * 0 0 seq * NM:i:424 ms:i:2078 AS:i:2078 nn:i:0 tp:A:P cm:i:28 s1:i:477 s2:i:474 dv:f:0.0866 SA:Z:kraken:taxid|66084|NC_012416.1,1360016,-,616S1173M32I3052S,9,195;</span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|1236909|NC_021089.1 1067958 0 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D8M7D4M3D15M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I11M1I2M2D6M2D3M1D31M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I5M1D3M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I28M1D3M1D3M1I16M1D2M1D3M1I6M1I18M1I3M1I5M1D28M1D3M1D6M2I5M1I32M1D3M1D30M1D8M1D30M1I11M1I1M1I16M1D14M1D22M1D14M1I32M1D14M1D6M1D13M1I49M1D26M1I8M1D3M1D10M1D15M1I8M1D18M1D12M1I25M1D3M2D6M3I10M1D3M2D17M1D4M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I35M1I9M1D6M1I22M1I11M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D9M1I17M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I4M1D2M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D13M2D4M1D6M1D7M2D5M2I11M1I16M1I5M14I4M2D4M1I10M1D6M1I1M1I9M1I14M2I6M2I14M1D2M1D6M1D8M2D2M1D22M1D12M1D7M1I12M1D1M1D8M1D8M1I3M1I1M1I5M1I18M1I17M9S * 0 0 * NM:i:435 ms:i:2028 AS:i:2028 nn:i:0 tp:A:S cm:i:28 s1:i:474 dv:f:0.0866</span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|225364|NZ_LK055284.1 991010 0 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D8M7D4M3D15M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I3M1I3M1D4M1I2M2D6M3D34M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I6M1D2M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I24M1D7M1D4M1I15M1D2M1D3M1I6M1I18M1I3M1I5M1D28M1D3M1D6M2I5M1I32M1D3M1D30M1D8M1D30M1I11M1I1M1I16M1D14M1D22M1D14M1I32M1D12M1D8M1D13M1I49M1D26M1I8M1D3M1D10M1D15M1I8M1D18M1D12M1I25M1D3M2D6M3I10M1D3M2D17M1D4M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I31M1I13M1D6M1I22M2D2I3M1I6M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D10M1I16M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I3M1D3M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D10M1D3M1D4M1D6M1D7M2D5M2I5M2I2M2D2M1I16M1I5M14I4M2D4M1I10M1D6M1I1M1I3M1I14M2I3M1I4M1D4M2I14M1D2M1D6M1D8M2D2M1D22M1D12M1D7M1I13M2D8M1I3M1I15M1I18M1I17M9S * NM:i:444 ms:i:1984 AS:i:1984 nn:i:0 tp:A:S cm:i:24 s1:i:437 dv:f:0.0923</span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|163164|NC_002978.6 1039834 0 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D8M7D4M3D15M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I3M1I3M1D4M1I2M2D6M3D34M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I6M1D2M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I24M1D7M1D4M1I15M1D2M1D3M1I6M1I18M1I3M1I5M1D28M1D3M1D6M2I5M1I32M1D3M1D30M1D8M1D30M1I11M1I1M1I16M1D14M1D22M1D14M1I32M1D12M1D8M1D13M1I49M1D26M1I8M1D3M1D10M1D15M1I8M1D18M1D12M1I25M1D3M2D6M3I10M1D3M2D17M1D4M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I31M1I13M1D6M1I22M2D2I3M1I6M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D10M1I16M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I3M1D3M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D10M1D3M1D4M1D6M1D7M2D5M2I5M2I2M2D2M1I16M1I5M14I4M2D4M1I10M1D6M1I1M1I3M1I14M2I3M1I4M1D4M2I14M1D2M1D6M1D8M2D2M1D22M1D12M1D7M1I13M2D8M1I3M1I15M1I18M1I17M9S * 0 NM:i:444 ms:i:1984 AS:i:1984 nn:i:0 tp:A:S cm:i:25 s1:i:452 dv:f:0.0908</span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|1633785|NZ_CP011148.1 1039878 0 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D8M7D4M3D15M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I3M1I3M1D4M1I2M2D6M3D34M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I6M1D2M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I24M1D7M1D4M1I15M1D2M1D3M1I6M1I18M1I3M1I5M1D28M1D3M1D6M2I5M1I32M1D3M1D30M1D8M1D30M1I11M1I1M1I16M1D14M1D22M1D14M1I32M1D12M1D8M1D13M1I49M1D26M1I8M1D3M1D10M1D15M1I8M1D18M1D12M1I25M1D3M2D6M3I10M1D3M2D17M1D4M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I31M1I13M1D6M1I22M2D2I3M1I6M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D10M1I16M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I3M1D3M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D10M1D3M1D4M1D6M1D7M2D5M2I5M2I2M2D2M1I16M1I5M14I4M2D4M1I10M1D6M1I1M1I3M1I14M2I3M1I4M1D4M2I14M1D2M1D6M1D8M2D2M1D22M1D12M1D7M1I13M2D8M1I3M1I15M1I18M1I17M9S * NM:i:446 ms:i:1972 AS:i:1972 nn:i:0 tp:A:S cm:i:25 s1:i:452 dv:f:0.0908</span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 2064 kraken:taxid|66084|NC_012416.1 1360016 9 616H6M1I3M1I9M1D12M1I3M1I4M1I1M1I6M3I5M1D10M1D2M1D4M1D4M1I28M2I7M4I3M2I10M1I1M1I4M1D4M6I7M1I6M1I9M1I2M1I13M1I1M1I8M1D4M1I8M1I15M1D3M1I15M1D5M1I1M3D4M1I10M1I8M2I7M1I9M1D4M1I7M1I12M2D11M1I27M1D21M1D5M2I23M5I3M1I9M1I6M3I14M1I6M1D3M1D43M1I2M1D4M1D8M1D21M2I16M1I7M1D5M1D5M1I11M1I21M1D6M1D2M1D7M2I3M1I15M3I2M1I1M2I6M2I4M1I6M1D9M1D62M1I10M1D3M1I11M2D7M1D13M1I4M1I10M1D4M1I8M1D15M1D12M1D10M1D10M1D10M1I12M1D2M2D1M1D14M1I8M1D6M1D11M2D15M1I13M1I14M1I8M1I32M1D4M1I18M1D30M2D5M1D10M1D5M1D11M1D13M1D6M1I7M1I4M1D13M1I1M2D18M2D8M1I8M1I3M1I7M3052H * 0 0 seq * NM:i:195 ms:i:1194 AS:i:1194 nn:i:0 tp:A:P cm:i:21 s1:i:287 s2:i:284 dv:f:0.0502 SA:Z:kraken:taxid|66084|NC_012416.1,1207118,-,2514S2299M51I9S,5,424;</span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|1633785|NZ_CP011148.1 1051218 0 616S6M1I3M1I9M1D12M1I3M1I4M1I1M1I6M3I5M1D10M1D2M1D4M1D4M1I28M2I7M4I3M2I10M1I1M1I4M1D4M6I7M1I6M1I9M1I2M1I13M1I1M1I8M1D4M1I8M1I15M1D3M1I15M1D7M2D4M1I10M1I8M2I7M1I9M1D4M1I7M1I12M2D11M1I27M1D29M1I21M5I8M1I4M1I6M3I14M1I6M1D3M1D43M1I2M1D4M1D8M1D21M2I16M1I7M1D5M1D5M1I11M1I21M1D6M1D2M1D7M2I3M1I15M3I2M1I2M4I9M1I6M1D9M1D62M1I10M1D3M1I11M2D5M1D15M1I4M1I10M1D4M1I8M1D15M1D12M1D10M1D10M1D10M1I12M1D2M2D1M1D14M1I8M1D6M1D11M2D15M1I13M1I14M1I8M1I32M1D4M1I18M1D30M2D5M1D10M1D5M1D11M1D13M1D6M1I7M1I4M1D13M1I1M2D18M2D8M1I8M1I3M1I7M3052S * 0 0 * * NM:i:200 ms:i:1164 AS:i:1164 nn:i:0 tp:A:S cm:i:22 s1:i:284 dv:f:0.0655</span><br></pre></td></tr></table></figure><p>This is the results of <code>blastn</code>, and I marked them with taxonomy ID. The results agreed well with that of <code>minimap2</code>.</p><img src="/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/blast_1532401032_2561.png"><p>Here is another example.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> grep <span class="string">'m161109_080520_42256_c101052872550000001823247601061737_s1_p0/796/0_3902'</span> minimap2.sam</span></span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/796/0_3902 16 kraken:taxid|573570|NZ_CP016796.1 218142 29 390S6M1D15M1D10M3I6M1D15M1I3M2I11M3I4M1I14M3I4M1I9M1D7M1D3M2I9M3I20M2I6M1D15M1I3M2I17M3I9M2I4M1I5M1D11M3I9M3I4M4I5M1I10M1I12M1I6M3D10M2D5M2D12M2I11M3177S * 0 0 seq * NM:i:85 ms:i:178 AS:i:178 nn:i:0 tp:A:P cm:i:5 s1:i:93 s2:i:52 dv:f:0.0291 SA:Z:kraken:taxid|573570|NZ_CP016796.1,218159,-,296S287M9I3310S,22,82;</span><br><span class="line">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/796/0_3902 2064 kraken:taxid|573570|NZ_CP016796.1 218159 22 296H9M3I6M4I7M1D11M3I3M1I3M1D11M3I8M1I10M3I6M1D11M3I13M1I5M3I6M1D15M1I3M2I11M4D4M5D7M5D11M3I4M1I9M1D7M1D3M2I6M3I23M2I6M1D17M5D4M2D5M2D13M3310H * 0 0 seq * NM:i:82 ms:i:178 AS:i:178 nn:i:0 tp:A:P cm:i:7 s1:i:166 s2:i:120 dv:f:0.0235 SA:Z:kraken:taxid|573570|NZ_CP016796.1,218142,-,390S304M31I3177S,29,85;</span><br></pre></td></tr></table></figure><p>And, I got different results using <code>blastn</code>.</p><img src="/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/1532401743_26861.png"><p>And the detailed alignment of the top 1.</p><img src="/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/1532401866_17878.png"><p>But when I checked the raw sequence, there are lots of repeats. I don’t know what it is.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">></span><span class="bash">m161109_080520_42256_c101052872550000001823247601061737_s1_p0/796/0_3902 RQ=0.860</span></span><br><span class="line">CAGAAGATGAGATTTAAATTGCATCCCAATTGTCATATATAATTAATCATATAAATTATATATGTGTAAG</span><br><span class="line">TTTTTTTTTTTTGTGAACATTATAGAAATAATAATAATATTCCATGGCCCTGACAAAGGGCCGAGTATTT</span><br><span class="line">ACTCAATTTGGACAAAGAATAAAAAATTTAAGCCTTCCTTTCAGTTTCTTGGGCAACAGAACATATTTGG</span><br><span class="line">TAAAACACCACCTTGTGCAATCGTACACCAAACAGAAGTTTAATTTATTCTTCGTCCACCATCATACGAA</span><br><span class="line">TTGCAATTGCAAATGTCTATGGAATAATACGAGTTTTCTTGTTGTCCACAACTCTTATGAGCAGCATTAC</span><br><span class="line">CGCGCCAATTCTAGTACTATCAAGCAGCTAAATATTACCCATGACAGCAGCCAAATCATAAACTGGTGCA</span><br><span class="line">CCAGCTCCAACACGTTCAGCATAATTTCCTTTGAAAAAATTTAAAAATTTTGAAAAATTCAAAATTTTGA</span><br><span class="line">AAAATTCAAAAATTTTGAAAAATTACAACAAATTTGAAAAAAATTCAAAAATTTGAAAAAATTCAATAAA</span><br><span class="line">ATTTTGAAAAAATTCAAAAAAATTTTAAAAAATCTCAAAAATTTTGAAAATTCATAAATTTTGAAAAAAT</span><br><span class="line">TCAAAACCTTTTGAAAACAAAATCAAAATTTGAAAAAAATTCAAAAATTTTGAAAAATTCAAAAATTTGA</span><br><span class="line">AAAACCATTCAAAAATTTTGAAAAAAAATTCAAAAATTTTGAAAAAATTCAAAAATTTTGAAATAAAAAA</span><br><span class="line">TTTCAAAAATTTTGAAAAAAATTCAAAAATTTTGAAAAATCTCAAAAAATTTTGAAAAAGTTCAATGAAA</span><br><span class="line">TTTTGAAAAAGTTTAAAAGATTTTGAAAAAAATACCAAAAATTTTGAAAAAATCAAGAATTTTGAAAAAA</span><br><span class="line">ATTCCAAAATTTTGAAAAAATTCAATTAAATTTTGAAAAATTCAAAATCTTTGAAAAAAATTCAACAAAT</span><br><span class="line">TTTGAAAAAATTCACAAAAATTTTGAAAAAATTCAAAAATTTTGAAAAATTCAAAAATTTTTGAAAAAAC</span><br><span class="line">AAACTCAAAATTTTGAAAAAATTCAAAAATTTGAAAAAATTCAAAAAATTTGAAAAAATTCAAAAACTTT</span><br><span class="line">TGAAAAAAATTCAAAAATCTTTGAAAAAATTCAAAAATTTTTGAAAAAATTCAAAAATTTTGAAAAAAAT</span><br><span class="line">TCAAAAATTTTGAAAAAATTCAAAAATTTTTGAAAAAATCAAAAATTTGAAAAAATTCCAAAAATTTTTG</span><br><span class="line">AAAAAATTCAAAAAATTTTCGAAAAATTCAATTTTGAAAAATTCAAAAATTTACTGAAAAAATTTCAAAA</span><br><span class="line">ATTTTGAAAAAATTCAATAAATTTTGAAAAAATTCAAACATTTTGAAAAATTCAAAAATTTTGAAAAAAT</span><br><span class="line">CAACAAATTTTTGAAAAAATTTCAAAAATTTTTGAAAAATCAAAAAATTTTTTGAAAAAAATTCACGCCA</span><br><span class="line">AATTTTGAAAAAATTTAAAAATTTTGAAAAAATTAAATAAATTTTAAAATTTGTATGATTTTTCAAATTT</span><br><span class="line">TTGATAAGTTTTCATTTTGAAAAATTTAAATTTTTTCAAAATTTTTTGAATTTTTTCAAAATTTTGAATT</span><br><span class="line">TTTTCACCCAATTTTTGAATTTTCTTTCAAAAAATTTTTAGAAATTAACAAATAACCTATTTCAAATTTT</span><br><span class="line">TGAATTTTTCAAATTTTTGATTTTTTCAAAATTTTGAATTTTTTTTCAAAATTTTGAATTTTTAAATACA</span><br><span class="line">AATTTTTGAATTTTTCAAAATTGTTTGAATTTTTATCAAAATTTTTGAATTTTTTTAAAAATTTTTGATT</span><br><span class="line">TTTTCAAAGGAAATTATTGCTGAAGCCGTGTTGGAGCTGTGCACCAGTTTTATTTGGCTGCCTGTTCATG</span><br><span class="line">GAACTTTTAGCTGCTGAGTACTAAGAATTGGCGGGTAATGCTGTCAGTGACAGAACAGAAAACTCGTATT</span><br><span class="line">ATTTCCTAGACATTTGCAATTGGCAATTCAGTAAATGACGAACCAATTAAATAATACTTCTGTCTGGTGT</span><br><span class="line">TTACGAATGCACAAGGTGAGTGTTTTACCAAATAACAAGCTGTTCTGATGCCTCAACGAAAACTGAAAGA</span><br><span class="line">AGGCTTAAATTTTTTTATTCCTTTGTCCAAATTAGTAAACTACTTCGGCCCTTTTCAGGGCCATAATATT</span><br><span class="line">CATTATCTATTTTCACTCAGATGTTCGCAAAAAAAAAACTTACACATATAATAATTTTATATATTAATTT</span><br><span class="line">GCAAATTTGGATGCAATTTAAAATTGAATGATAAAGTGCAATGGTGTTGTCTAGTCTACAAAAATTCTAT</span><br><span class="line">AAACGTACACAAAAAAAATATTCGCTAAATTGAATTGTTGATAAAAAAAATATTTTTACTTAGAAAATCT</span><br><span class="line">TAAAAAAAAAAAACACAATAGATATGATTGTATATAATCAAAAAATGCTATTGAATGTAAATAATTTTTT</span><br><span class="line">AACAATTTCAAATTTTTAAAAACGTTCTCGCTTAGTTATATCAAATTATTCGGATTTTGGTTTTTTTATT</span><br><span class="line">ATTAATTATTATTATATTAATAATAAATTATTATTCTACTCAATTATTAAAATTTGAATAATTTTGTGGC</span><br><span class="line">CTCAAAGGGCCATTTGTTTTATAATGACAATTTTATTGAAGGAATACCAAAAGAATTGAAATAAACATAC</span><br><span class="line">GATTGATATATTAAAGTATTAGCGTGTTTTATTTCTTTCTTTTAGGTGAAAATGCTGCCTTCCTGGGCTT</span><br><span class="line">TAGGTGAAGTATGGATGTTTTTGGCCAGTTTTATGGTTTATTTGGTGCCTTTGGTTTTTAGTTTGGACCT</span><br><span class="line">TTTAGCTGATTTTTTGGCCTTTAATCGGTGAATTTTGCAGTTTGTAGTCGTACGAATGTAGTTTTAGGCA</span><br><span class="line">GCAGATGGTGGTTTTTTCTTCTCGTCTTTTTTTCGGTTTAGCAGCTTTAATTGGTGATTTTTTTTGCTTT</span><br><span class="line">GATTGATTTCTTACAGCCGGTTTTATGTTTTAATTGTCCAGTTTGCTAAAGTAGAATTCGTTTTACAGTT</span><br><span class="line">GCAAGCTTTTTTGGGTTTTGTACCAGAAGCCCTTCTTTCTTTTTCTTTTAGAACTTTTTTTTCTTAGAAC</span><br><span class="line">TTTTTTTTTTTAGAACATTTTTTTTTAGGAACTTTTTTTTTAGAACTTTTTTTAGAAACTTTTGTTATCT</span><br><span class="line">TTTCTGTTTGTTCTTGTTATCTTTTTGTTTGTTTTGTTATCATTTTTTTGTTTCTTTTGGTTCTCTTTTT</span><br><span class="line">GTTTGTTTTTGTTATCATTTTTTGTTTGTTTTTGGGTATCTCTTTTTTTGTTTGTTTTGTTATCTTTTTG</span><br><span class="line">TTGTTTGGTTTATCTTTTCCTTTGTTTTGTTTATTTTTTTGTTTGTTTTGGTTATATTTTTGTTTGTTTG</span><br><span class="line">TTATCTTTTTTGTTTTTTTTGTTATCTTTTTGTTTGTTTGTATCTTTTTTTGTTTGTTTTGTTATCTTTT</span><br><span class="line">TTTGTTTGTTTTGTTATCTTTTTGTTTGTTTTTGTTATCTTTTTGTTTGTTTTGTTTATCTTTATTGTTT</span><br><span class="line">GTTTTGTTATCTTTTTGTTTGTTTTGTTATCTTCTTTTGTTTGTTTTGTTATCTTTTTGCTTTGTTTTGT</span><br><span class="line">TATCTTTTTGTTTTGTTTTGTTATCTTTTTGTTGTTCTTTATTCTGGTTAATCATTTTTTTGGTTGTTTT</span><br><span class="line">GTTCATCTTTTTGTGTTTGTTTTTGTTATCTTTTTGTTTGTTTTGTTAATCTTTTTGTTTGCTTTGTTAT</span><br><span class="line">CATTTTTTGTTTGCTTTGTTATTTTTTGTTTGCTTTGTTATATTTTTGTTGCTTTGTTATCATTTTTGTT</span><br><span class="line">TGCTTTGGTTATCTTTGTTTGCTTTGTTATCTTTTTTTGTTGGCTTTGTTTATCGTTTTTGTATTTGCTT</span><br><span class="line">TGTTATCGTTTTTGTTTGCTTTGTTATCTTTTTTGTTTTGCGTTTTTTTAGC</span><br></pre></td></tr></table></figure><p>The differences between <code>minimap2</code> and <code>blastn</code> are explainable, they used different algorithm and different databases. And generally, I think <code>minimap2</code> is reliable.</p><h2 id="in-summary"><a class="markdownIt-Anchor" href="#in-summary"></a> In summary</h2><p>In summary, I removed contaminants in reads by the following steps:</p><ul><li>Prepare contamination library (bacteria, viral, fungi, protozoa, and archaea from Refseq and mtDNAs).</li><li>Use <code>BBDuk</code> to remove contaminants from illumina short reads.</li><li>Use <code>minimap2</code> to remove contaminants from PacBio long reads.</li></ul><p>But there are some concerns existing:</p><ul><li>Some real sequences may also be removed along with the contaminants.</li><li>Many repeat sequences were removed, and I don’t know where they were from.</li><li>For the ‘ambivalent’ reads, I kept them for downstream analysis, but I didn’t know whether they should be removed, say, throw away reads which the primary aligment were contaminant.</li></ul><p>There are some other long-reads mapper, and people also try to tune the parameters of short-read aligners to work with long-reads. There are some threads/posts talking about this:</p><ul><li><a href="https://www.biostars.org/p/116696/" target="_blank" rel="noopener">Question: Long read alignment</a></li><li><a href="https://www.biostars.org/p/130978/" target="_blank" rel="noopener">Question: Long read-alignment + variant calling</a></li><li><a href="https://www.biostars.org/p/231120/" target="_blank" rel="noopener">Question: Alternative to BLASR ?</a></li><li><a href="https://www.biostars.org/p/231559/" target="_blank" rel="noopener">Question: BWA-MEM using long PacBio reads</a></li><li><a href="https://davetang.org/muse/2012/10/12/mapping-long-reads-with-bowtie/" target="_blank" rel="noopener">Mapping long reads with Bowtie</a></li><li><a href="https://groups.google.com/forum/#!searchin/rna-star/starlong/rna-star/-2mBTPWRCJY/PKSPCNcade8J" target="_blank" rel="noopener">STAR: segmentation fault when using long reads</a></li></ul><p>I don’t really understand the “mapping” things now, but I expect that there will be several dominant tools for long-reads mapping, just as short-reads mapping.</p><h2 id="useful-links"><a class="markdownIt-Anchor" href="#useful-links"></a> Useful links</h2><ul><li><a href="https://www.biostars.org/p/244912/" target="_blank" rel="noopener">Question: Removing contaminations from PacBio reads</a></li><li><a href="https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete" target="_blank" rel="noopener">How can I download RefSeq data for all complete bacterial genomes?</a></li><li><a href="http://seqanswers.com/forums/showthread.php?t=41288" target="_blank" rel="noopener">Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries</a></li><li><a href="http://seqanswers.com/forums/showthread.php?t=58221" target="_blank" rel="noopener">Yes … BBMap can do that!</a></li><li><a href="https://github.com/HRGV/DGHM2017_assembly/wiki/From-raw-reads-to-assembly---STEP-by-STEP" target="_blank" rel="noopener">From raw reads to assembly STEP by STEP</a></li><li><a href="http://seqanswers.com/forums/showthread.php?t=78397" target="_blank" rel="noopener"> Removing contamination with BBDUK</a></li><li><a href="https://www.biostars.org/p/143019/" target="_blank" rel="noopener">Question: Tool to separate human and mouse rna seq reads</a></li></ul><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180628: create the note.</li><li>20180725: complete the note.</li><li>20180807: update the ‘minimap-arm’ part, add the results of <code>kraken</code>.</li><li>20180815: add the part of ‘Minimap2-2.12 (r827)’</li></ul><hr class="footnotes-sep"><section class="footnotes"><ol class="footnotes-list"><li id="fn1" class="footnote-item"><p>Chaisson MJ, Tesler G. 2012. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 13:238. doi:10.1186/1471-2105-13-238. <a href="#fnref1" class="footnote-backref">↩︎</a></p></li><li id="fn2" class="footnote-item"><p>Jain C, Dilthey A, Koren S, Aluru S, Phillippy AM. 2018 Apr 30. A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases. Journal of Computational Biology. doi:10.1089/cmb.2018.0036. [accessed 2018 Jul 2]. <a href="https://www.liebertpub.com/doi/10.1089/cmb.2018.0036" target="_blank" rel="noopener">https://www.liebertpub.com/doi/10.1089/cmb.2018.0036</a>. <a href="#fnref2" class="footnote-backref">↩︎</a></p></li><li id="fn3" class="footnote-item"><p>Li H. 2017 Aug 4. Minimap2: versatile pairwise alignment for nucleotide sequences. arXiv:170801492 [q-bio]. [accessed 2018 Jan 10]. <a href="http://arxiv.org/abs/1708.01492" target="_blank" rel="noopener">http://arxiv.org/abs/1708.01492</a>. <a href="#fnref3" class="footnote-backref">↩︎</a></p></li></ol></section>]]></content>
<summary type="html">
<p>Purpose in short: I’ve got both illumina (PE and MPE) and PacBio reads of an insect for <em>de novo</em> geome assembly. Since whole bodi
</summary>
<category term="reads" scheme="https://yiweiniu.github.io/blog/categories/reads/"/>
<category term="contamination" scheme="https://yiweiniu.github.io/blog/categories/reads/contamination/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="contamination" scheme="https://yiweiniu.github.io/blog/tags/contamination/"/>
<category term="reads" scheme="https://yiweiniu.github.io/blog/tags/reads/"/>
<category term="BBDuk" scheme="https://yiweiniu.github.io/blog/tags/BBDuk/"/>
<category term="MashMap" scheme="https://yiweiniu.github.io/blog/tags/MashMap/"/>
<category term="long-reads" scheme="https://yiweiniu.github.io/blog/tags/long-reads/"/>
<category term="long-reads alignment" scheme="https://yiweiniu.github.io/blog/tags/long-reads-alignment/"/>
</entry>
<entry>
<title>Genome Assembly Pipeline: wtdbg</title>
<link href="https://yiweiniu.github.io/blog/2018/06/Genome-Assembly-Pipeline-wtdbg/"/>
<id>https://yiweiniu.github.io/blog/2018/06/Genome-Assembly-Pipeline-wtdbg/</id>
<published>2018-06-30T07:16:55.000Z</published>
<updated>2018-07-03T10:49:59.000Z</updated>
<content type="html"><![CDATA[<h2 id="introduction"><a class="markdownIt-Anchor" href="#introduction"></a> Introduction</h2><p><code>wtdbg</code> has two git repos: <a href="https://github.com/ruanjue/wtdbg" target="_blank" rel="noopener">wtdbg</a> and <a href="https://github.com/ruanjue/wtdbg-1.2.8" target="_blank" rel="noopener">wtdbg-1.2.8</a>, and the author Jue Ruan (who also developed <a href="https://github.com/ruanjue/smartdenovo" target="_blank" rel="noopener">SMARTdenovo</a>) introduces them as:</p><blockquote><p>wtdbg: A fuzzy Bruijn graph approach to long noisy reads assembly. wtdbg is desiged to assemble huge genomes in very limited time, it requires a PowerPC with multiple-cores and very big RAM (1Tb+). wtdbg can assemble a 100 X human pacbio dataset within one day.</p></blockquote><blockquote><p>wtdbg-1.2.8: Important update of wtdbg</p></blockquote><p>Jue Ruan <a href="https://github.com/ruanjue/wtdbg-1.2.8/issues/2#issuecomment-372202134" target="_blank" rel="noopener">preferred <code>wtdbg-1.2.8</code></a>.</p><blockquote><p>In personal feeling, I like wtdbg-1.2.8 more than SMARTdenovo and wtdbg-1.1.006.</p></blockquote><p>This tool hasn’t been published now (20180307), and I found it in an evaluation paper from BIB:</p><blockquote><p>Jayakumar V, Sakakibara Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Briefings in Bioinformatics. 2017 Nov 3:bbx147-bbx147. doi:10.1093/bib/bbx147</p></blockquote><p>My feelings:</p><ul><li>very fast</li><li>easy to install</li><li>easy to use</li><li>docs and discussions about this tool is limited.</li><li>aggressive</li><li>good N50 (at least in our two genome projects, an insect and a plant)</li><li>relatively bad completeness</li></ul><h2 id="general-usage"><a class="markdownIt-Anchor" href="#general-usage"></a> General usage</h2><p>Because <code>wtdbg</code> has two different versions and I didn’t know which one is more suitable for me, I just tried both.</p><h3 id="wtdbg-v11006"><a class="markdownIt-Anchor" href="#wtdbg-v11006"></a> wtdbg v1.1.006</h3><h4 id="install"><a class="markdownIt-Anchor" href="#install"></a> Install</h4><p>I got <a href="https://github.com/ruanjue/wtdbg/issues/4" target="_blank" rel="noopener">a problem</a> when compile the software. The issue is caused by the <code>CPATH</code> of our OS, and eventually solved with the help of Jue Ruan.</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">git clone https://github.com/ruanjue/wtdbg.git && cd wtdbg</span><br><span class="line">make</span><br></pre></td></tr></table></figure><h4 id="examples-in-the-doc"><a class="markdownIt-Anchor" href="#examples-in-the-doc"></a> Examples in the doc</h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> assembly of contigs</span></span><br><span class="line">wtdbg-1.1.006 -t 96 -i pb-reads.fa -o dbg -H -k 21 -S 1.02 -e 3 2>&1 | tee log.wtdbg</span><br><span class="line"><span class="meta">#</span><span class="bash"> -t: number of threads, please <span class="built_in">type</span> <span class="string">'wtdbg-1.1.006 -h'</span> to get a document</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> -i: you can <span class="built_in">set</span> more than one sequences files, such as -i 1.fa. -i 2.fq -i 3.fa.gz -i 4.fq.gz</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> -o: the prefix of results</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> -S: 1.01 will use all kmers, 1.02 will use half by sumsampling, 1.04 will use 1/4, and so on</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> 2.01 will use half by picking minimizers, but not fully tested</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> -e: <span class="keyword">if</span> too low coverage(< 30 X), try to <span class="built_in">set</span> -e 2</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> please note that dbg.ctg.fa is full of errors from raw reads</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> first round of polishment</span></span><br><span class="line">wtdbg-cns -t 96 -i dbg.ctg.lay -o dbg.ctg.lay.fa -k 15 2>&1 | tee log.cns.1</span><br><span class="line"><span class="meta">#</span><span class="bash"> dbg.ctg.lay.fa is the polished contigs</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> <span class="keyword">if</span> possible, further polishment</span></span><br><span class="line">minimap -t 96 -L 100 dbg.ctg.lay.fa pb-reads.fa 2> >(tee log.minimap) | best_minimap_hit.pl | awk '{print $6"\t"$8"\t"$9"\t"$1"\t"$5"\t"$3"\t"$4}' >dbg.map</span><br><span class="line">map2dbgcns dbg.ctg.lay.fa pb-reads.fa dbg.map >dbg.map.lay</span><br><span class="line">wtdbg-cns -t 96 -i dbg.map.lay -o dbg.map.lay.fa -k 13 2>&1 | tee log.cns.2</span><br><span class="line"><span class="meta">#</span><span class="bash"> you need to concat all reads into one file <span class="keyword">for</span> minimap and map2dbgcns</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> dbg.map.lay.fa is the final contigs</span></span><br></pre></td></tr></table></figure><h3 id="wtdbg-v128"><a class="markdownIt-Anchor" href="#wtdbg-v128"></a> wtdbg v1.2.8</h3><h4 id="install-2"><a class="markdownIt-Anchor" href="#install-2"></a> Install</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">git https://github.com/ruanjue/wtdbg-1.2.8.git && cd wtdbg-1.2.8</span><br><span class="line">make</span><br></pre></td></tr></table></figure><p><strong>For higher error rate long sequences</strong></p><ul><li>Decrease <code>-p</code>. Try <code>-p 19</code> or <code>-p 17</code></li><li>Decrease <code>-S.</code> Try <code>-S</code> 2 or <code>-S</code> 1</li></ul><p>Both will increase computing time.</p><p><strong>For very high coverage</strong></p><ul><li>Increase <code>--edge-min</code>. Try <code>--edge-min 4</code>, or higher.</li></ul><p><strong>For low coverage</strong></p><ul><li>Decrease <code>--edge-min</code>. Try <code>--edge-min 2 --rescue-low-cov-edges</code>.</li></ul><p><strong>Filter reads</strong></p><ul><li><code>--tidy-reads 5000</code>. Will filtered shorter sequences. If names in format of <code>\/\d+_\d+$</code>, will selected the longest subread.</li></ul><p><strong>Consensus</strong></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">wtdbg-cns -t 64 -i dbg.ctg.lay -o dbg.ctg.lay.fa</span><br></pre></td></tr></table></figure><p>The output file <code>dbg.ctg.lay.fa</code> is ready for further polished by <code>PILON</code> or <code>QUIVER</code>.</p><h2 id="in-practice"><a class="markdownIt-Anchor" href="#in-practice"></a> In practice</h2><p>I’ve tried two versions of <code>wtdbg</code> and diferent parameter combinations in two genome assembly projects. The parameters and the logs/stats received are as follows:</p><h3 id="an-insect"><a class="markdownIt-Anchor" href="#an-insect"></a> An insect</h3><ul><li>The species: high heterogeneity, high AT, high repetition.</li><li>Genome size: male 790M, female 830M.</li><li>Data used:about 70X PacBio long-reads.</li><li>OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).</li></ul><h4 id="wtdbg-v11006-2"><a class="markdownIt-Anchor" href="#wtdbg-v11006-2"></a> wtdbg v1.1.006</h4><h5 id="run1-with-h-k-21-s-102-e-3"><a class="markdownIt-Anchor" href="#run1-with-h-k-21-s-102-e-3"></a> run1, with <code>-H -k 21 -S 1.02 -e 3</code>:</h5><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">total base: 607971510</span><br><span class="line">%GC: 32.21</span><br><span class="line">num: 4655</span><br><span class="line">min: 2700</span><br><span class="line">max: 6594103</span><br><span class="line">avg: 130606</span><br><span class="line">N50: 573208</span><br><span class="line">N90: 46228</span><br></pre></td></tr></table></figure><h4 id="wtdbg-v128-2"><a class="markdownIt-Anchor" href="#wtdbg-v128-2"></a> wtdbg v1.2.8</h4><h5 id="run1-with-defalult-k-0-p-21-s-4"><a class="markdownIt-Anchor" href="#run1-with-defalult-k-0-p-21-s-4"></a> run1, with defalult <code>-k 0 -p 21 -S 4</code>:</h5><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">total base: 757804309</span><br><span class="line">%GC: 32.37</span><br><span class="line">num: 20960</span><br><span class="line">min: 2247</span><br><span class="line">max: 3846453</span><br><span class="line">avg: 36154</span><br><span class="line">N50: 103681</span><br><span class="line">N90: 12128</span><br></pre></td></tr></table></figure><h5 id="run2-with-edge-min-2-rescue-low-cov-edges-tidy-reads-5000-because-median-node-depth-6-less-than-20"><a class="markdownIt-Anchor" href="#run2-with-edge-min-2-rescue-low-cov-edges-tidy-reads-5000-because-median-node-depth-6-less-than-20"></a> run2, with <code>--edge-min 2 --rescue-low-cov-edges --tidy-reads 5000</code> (Because median node depth = 6, less than 20)</h5><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">total base: 845834770</span><br><span class="line">%GC: 32.51</span><br><span class="line">num: 19555</span><br><span class="line">min: 2030</span><br><span class="line">max: 2025061</span><br><span class="line">avg: 43254</span><br><span class="line">N50: 158013</span><br><span class="line">N90: 14248</span><br></pre></td></tr></table></figure><h5 id="run3-with-k-15-p-0-s-1-rescue-low-cov-edges-tidy-reads-5000"><a class="markdownIt-Anchor" href="#run3-with-k-15-p-0-s-1-rescue-low-cov-edges-tidy-reads-5000"></a> run3, with <code>-k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000</code></h5><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN795503989</span><br><span class="line">Size_withoutN795503989</span><br><span class="line">Seq_Num12557</span><br><span class="line">Mean_Size63351</span><br><span class="line">Median_Size15690</span><br><span class="line">Longest_Seq7257493</span><br><span class="line">Shortest_Seq2277</span><br><span class="line">GC_Content32.44</span><br><span class="line">N50308340</span><br><span class="line">N9021383</span><br></pre></td></tr></table></figure><h5 id="run4-with-k-0-p-19-s-2-rescue-low-cov-edges-tidy-reads-5000"><a class="markdownIt-Anchor" href="#run4-with-k-0-p-19-s-2-rescue-low-cov-edges-tidy-reads-5000"></a> run4, with <code>-k 0 -p 19 -S 2 --rescue-low-cov-edges --tidy-reads 5000</code></h5><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN780618272</span><br><span class="line">Size_withoutN780618272</span><br><span class="line">Seq_Num11722</span><br><span class="line">Mean_Size66594</span><br><span class="line">Median_Size16335</span><br><span class="line">Longest_Seq8184393</span><br><span class="line">Shortest_Seq2547</span><br><span class="line">GC_Content32.4</span><br><span class="line">N50294217</span><br><span class="line">N9023008</span><br></pre></td></tr></table></figure><h5 id="run5-with-tidy-reads-5000-k-21-p-0-s-2-rescue-low-cov-edges"><a class="markdownIt-Anchor" href="#run5-with-tidy-reads-5000-k-21-p-0-s-2-rescue-low-cov-edges"></a> run5, with <code>--tidy-reads 5000 -k 21 -p 0 -S 2 --rescue-low-cov-edges</code></h5><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN843085698</span><br><span class="line">Size_withoutN843085698</span><br><span class="line">Seq_Num26341</span><br><span class="line">Mean_Size32006</span><br><span class="line">Median_Size18982</span><br><span class="line">Longest_Seq491063</span><br><span class="line">Shortest_Seq2992</span><br><span class="line">GC_Content32.51</span><br><span class="line">N5054544</span><br><span class="line">N9013737</span><br></pre></td></tr></table></figure><h5 id="run6-with-k-0-p-21-s-4-aln-noskip"><a class="markdownIt-Anchor" href="#run6-with-k-0-p-21-s-4-aln-noskip"></a> run6, with <code>-k 0 -p 21 -S 4 --aln-noskip</code></h5><p>After discussion with the author, he suggested me to use <code>--aln-noskip</code>.</p><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN726925732</span><br><span class="line">Size_withoutN726925732</span><br><span class="line">Seq_Num15983</span><br><span class="line">Mean_Size45481</span><br><span class="line">Median_Size12714</span><br><span class="line">Longest_Seq2523944</span><br><span class="line">Shortest_Seq2290</span><br><span class="line">GC_Content32.21</span><br><span class="line">N50164635</span><br><span class="line">N9014464</span><br></pre></td></tr></table></figure><h5 id="run7-with-k-15-p-0-s-1-rescue-low-cov-edges-tidy-reads-5000-aln-noskip"><a class="markdownIt-Anchor" href="#run7-with-k-15-p-0-s-1-rescue-low-cov-edges-tidy-reads-5000-aln-noskip"></a> run7, with <code>-k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000 --aln-noskip</code></h5><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN762713695</span><br><span class="line">Size_withoutN762713695</span><br><span class="line">Seq_Num9803</span><br><span class="line">Mean_Size77804</span><br><span class="line">Median_Size15366</span><br><span class="line">Longest_Seq11163143</span><br><span class="line">Shortest_Seq2449</span><br><span class="line">GC_Content32.22</span><br><span class="line">N50488952</span><br><span class="line">N9025913</span><br></pre></td></tr></table></figure><p>After all the experiments, I’m not sure what to do next (try more or move on). As suggeested by Jue Ruan, N50 contig of ~500kb is good enough for scaffolding and genomic analysis. So I should try to evaluate the assembly and improve it while trying other tools.</p><h3 id="a-plant"><a class="markdownIt-Anchor" href="#a-plant"></a> A plant</h3><ul><li>The species: high heterogeneity, high repetition.</li><li>Genome size: 2.1G.</li><li>Data used:more than 100X PacBio long reads.</li><li>OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).</li></ul><h4 id="wtdbg-v11006-3"><a class="markdownIt-Anchor" href="#wtdbg-v11006-3"></a> wtdbg v1.1.006</h4><p>commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run1, version 1.1.006</span></span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/wtdbg/wtdbg -t <span class="variable">$PPN</span> -i <span class="variable">$WORKDIR</span>/data/Pacbio/all.fq.gz -o run1 -H -k 21 -S 1.02 -e 3 2>&1 | tee log.run1</span></span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/wtdbg/wtdbg-cns -t <span class="variable">$PPN</span> -i run1.ctg.lay -o run1.ctg.lay.fa -k 15 2>&1 | tee log.run1.cns.1</span></span><br></pre></td></tr></table></figure><p>stats:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN2105945650</span><br><span class="line">Size_withoutN2105945650</span><br><span class="line">Seq_Num21871</span><br><span class="line">Mean_Size96289</span><br><span class="line">Median_Size48435</span><br><span class="line">Longest_Seq2968570</span><br><span class="line">Shortest_Seq2531</span><br><span class="line">GC_Content38.27</span><br><span class="line">N50194480</span><br><span class="line">L502523</span><br><span class="line">N9040454</span><br><span class="line">Gap0.0</span><br></pre></td></tr></table></figure><h4 id="wtdbg-v128-3"><a class="markdownIt-Anchor" href="#wtdbg-v128-3"></a> wtdbg v1.2.8</h4><p>commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run2, version 1.2.8</span></span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/wtdbg-1.2.8/wtdbg-1.2.8 -t <span class="variable">$PPN</span> -i <span class="variable">$WORKDIR</span>/data/Pacbio/all.fq.gz -o run2</span></span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/wtdbg-1.2.8/wtdbg-cns -t <span class="variable">$PPN</span> -i run2.ctg.lay -o run2.ctg.lay.fa</span></span><br></pre></td></tr></table></figure><p>stats:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN1924031835</span><br><span class="line">Size_withoutN1924031835</span><br><span class="line">Seq_Num37933</span><br><span class="line">Mean_Size50721</span><br><span class="line">Median_Size14836</span><br><span class="line">Longest_Seq2424157</span><br><span class="line">Shortest_Seq2006</span><br><span class="line">GC_Content38.75</span><br><span class="line">N50184177</span><br><span class="line">L502391</span><br><span class="line">N9017404</span><br><span class="line">Gap0.0</span><br></pre></td></tr></table></figure><h3 id="where-to-go-next"><a class="markdownIt-Anchor" href="#where-to-go-next"></a> Where to go next?</h3><p>I asked Jue Ruan that <a href="https://github.com/ruanjue/wtdbg-1.2.8/issues/5" target="_blank" rel="noopener">if it is necessary to run consensus tools on the results of <code>wtdbg</code> or <code>smartdenovo</code></a>, he said:</p><blockquote><p>The inside consensus tool <code>wtdbg-cns</code> aims to provide a quick way to reduce sequencing errors. It is suggested to use <code>Quiver</code> and/or <code>Pilon</code> to polish the consensus sequences after you feel happy with the assembly. Usually, <code>wtdbg-cns</code> can reduce error rate down to less than 1%, which can be well-aligned by short reads.</p></blockquote><h2 id="useful-links"><a class="markdownIt-Anchor" href="#useful-links"></a> Useful links</h2><ul><li><a href="https://github.com/ruanjue/wtdbg-1.2.8/issues/2" target="_blank" rel="noopener">Discussions about “Optimisation of parameters”</a></li><li><a href="https://github.com/ruanjue/wtdbg-1.2.8/issues/5" target="_blank" rel="noopener">if it is necessary to run consensus tools on the results of wtdbg or smartdenovo</a></li></ul><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180307: create the note.</li><li>20180630: add the ‘A plant’ part.</li></ul>]]></content>
<summary type="html">
<h2 id="introduction"><a class="markdownIt-Anchor" href="#introduction"></a> Introduction</h2>
<p><code>wtdbg</code> has two git repos: <a h
</summary>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/"/>
<category term="TGS pipeline" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/TGS-pipeline/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/tags/genome-assembly/"/>
<category term="TGS genome assembly" scheme="https://yiweiniu.github.io/blog/tags/TGS-genome-assembly/"/>
</entry>
<entry>
<title>Detect Microbial Contamination in Contigs by Kraken</title>
<link href="https://yiweiniu.github.io/blog/2018/06/Detect-Microbial-Contamination-in-Contigs-by-Kraken/"/>
<id>https://yiweiniu.github.io/blog/2018/06/Detect-Microbial-Contamination-in-Contigs-by-Kraken/</id>
<published>2018-06-28T05:45:10.000Z</published>
<updated>2018-08-16T05:24:55.000Z</updated>
<content type="html"><![CDATA[<p>Purpose in short: I want to detect (and remove) potential contaminants in the genome assembly, and Kraken is a tool designed for that.</p><h2 id="introduction"><a class="markdownIt-Anchor" href="#introduction"></a> Introduction</h2><p>From <a href="http://ccb.jhu.edu/software/kraken" target="_blank" rel="noopener">its webpage</a>:</p><blockquote><p>Kraken is a system for assigning taxonomic labels to <strong>short DNA sequences</strong>, usually obtained through metagenomic studies. Previous attempts by other bioinformatics software to accomplish this task have often used sequence alignment or machine learning techniques that were quite slow, leading to the development of less sensitive but much faster abundance estimation programs. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm.<br><br>In its fastest mode of operation, for a simulated metagenome of 100 bp reads, Kraken processed over 4 million reads per minute on a single core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken’s accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision.<br><br>Kraken is written in C++ and Perl, and is designed for use with the Linux operating system. We have also successfully compiled and run it under the Mac OS.</p></blockquote><h2 id="in-practice"><a class="markdownIt-Anchor" href="#in-practice"></a> In practice</h2><p>See <a href="https://github.com/DerrickWood/kraken/blob/master/docs/MANUAL.markdown" target="_blank" rel="noopener">Kraken manual</a> for full instructions.</p><h3 id="install"><a class="markdownIt-Anchor" href="#install"></a> Install</h3><p>Download the latest <a href="https://github.com/DerrickWood/kraken/releases" target="_blank" rel="noopener">release</a>.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> unzip kraken-1.1.zip && <span class="built_in">cd</span> kraken-1.1</span></span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> ./install_kraken.sh <span class="variable">$TOOLDIR</span>/kraken.1.1</span></span><br></pre></td></tr></table></figure><p>If you want to build your own dabase, <a href="http://www.cbcb.umd.edu/software/jellyfish/" target="_blank" rel="noopener">jellyfish</a> <strong>version 1</strong> should be in your <code>PATH</code>.</p><h3 id="build-standard-kraken-database"><a class="markdownIt-Anchor" href="#build-standard-kraken-database"></a> Build standard Kraken database</h3><p>The standard Kraken database includes bacterial, archeal, and viral genomes in Refseq at the time of the build.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">!/bin/sh</span></span><br><span class="line"></span><br><span class="line">path2kraken=$TOOLDIR/kraken.1.1</span><br><span class="line"></span><br><span class="line">DBNAME=Kraken_DB</span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash">path2kraken/kraken-build --standard --threads 16 --db <span class="variable">$DBNAME</span></span></span><br></pre></td></tr></table></figure><p>If any step (including the initial downloads) fails, the build process will abort. However, kraken-build will produce checkpoints throughout the installation process, and will restart the build at the last incomplete step if you attempt to run the same command again on a partially-built database.</p><h3 id="build-custom-kraken-database"><a class="markdownIt-Anchor" href="#build-custom-kraken-database"></a> Build custom Kraken database</h3><p>Because the standard Kraken database doesen’t include fungi and protozoa, which I want to include in my analysis.</p><p>I found <a href="https://github.com/sschmeier/refseq2kraken" target="_blank" rel="noopener">refseq2kraken</a>, which facilitates the downloading and preparation of Kraken database.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line">path2kraken=$TOOLDIR/kraken.1.1</span><br><span class="line"></span><br><span class="line">DBNAME=Kraken_DB_201806</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> install refseq2kraken</span></span><br><span class="line">git clone https://github.com/sschmeier/refseq2kraken.git refseq2kraken</span><br><span class="line">cd refseq2kraken</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> Download refseq => here only <span class="string">"Complete Genome"</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"> assemblies, e.g. the default</span></span><br><span class="line">python getRefseqGenomic.py -p 8</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> convert to kraken format => again only <span class="string">"Complete Genome"</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"> assemblies here, e.g. the default</span></span><br><span class="line">python getKrakenFna.py -p 8 $DBNAME</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> build a new minikraken database</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> download taxonomy</span></span><br><span class="line">kraken-build --download-taxonomy --db $DBNAME</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> <span class="keyword">for</span> each branch, add all fna <span class="keyword">in</span> the directory to the database</span></span><br><span class="line">for dir in bacteria viral archaea fungi protozoa; do</span><br><span class="line"> find $DBNAME/$dir/ -name '*.fna' -print0 | xargs -0 -I{} -n1 -P8 kraken-build --add-to-library {} --db $DBNAME;</span><br><span class="line">done</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> build the actual database</span></span><br><span class="line">kraken-build --build --db $DBNAME</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> remove intermediate files</span></span><br><span class="line">kraken-build --clean --db $DBNAME</span><br></pre></td></tr></table></figure><p>In the following part of this note, this library will be refered as to ‘Non-Masked library’.</p><p>This post <a href="http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/" target="_blank" rel="noopener">Download refseq-genomic data and prepare it for Kraken</a> is also helpful.</p><h4 id="mask-low-complexity-regions"><a class="markdownIt-Anchor" href="#mask-low-complexity-regions"></a> Mask low-complexity regions</h4><p>I noticed a pull request of <code>kraken</code>: <a href="https://github.com/DerrickWood/kraken/pull/57" target="_blank" rel="noopener">Fixed human genome downloading and added auto-masking feature using dustmasker</a>, and realized that I should mask the library! So I re-generated the kraken library (I don’t know why the author of kraken didn’t mention that in their docs.)</p><p>Here is the new script:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">!/bin/sh</span></span><br><span class="line"></span><br><span class="line">path2kraken=$TOOLDIR/kraken.1.1</span><br><span class="line">path2refseq2kraken=$TOOLDIR/refseq2kraken</span><br><span class="line">path2dustmasker=$TOOLDIR/ncbi-blast-2.7.1+/bin/dustmasker</span><br><span class="line"></span><br><span class="line">DBNAME=Kraken_DB_abfpv_1806</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> Download refseq => here only <span class="string">"Complete Genome"</span></span></span><br><span class="line">python $path2refseq2kraken/getRefseqGenomic.py -p 8</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> convert to kraken format => again only <span class="string">"Complete Genome"</span></span></span><br><span class="line">python $path2refseq2kraken/getKrakenFna.py -p 8 $DBNAME</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> build a new minikraken database</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2kraken/kraken-build --download-taxonomy --db <span class="variable">$DBNAME</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> <span class="keyword">for</span> each branch</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> filter with Dustmasker and convert low complexity regions to N<span class="string">'s with Sed (skipping headers)</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"> and add all fna to the database</span></span><br><span class="line">for i in `find $DBNAME \( -name '*.fna' -o -name '*.ffn' \)`</span><br><span class="line">do</span><br><span class="line"> $path2dustmasker -in $i -infmt fasta -outfmt fasta | sed -e '/>/!s/a\|c\|g\|t/N/g' > tempfile</span><br><span class="line"> $path2kraken/kraken-build --add-to-library tempfile --db $DBNAME</span><br><span class="line">done</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> qsub <span class="built_in">jobs</span> below to the computer nodes</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> build the actual database</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2kraken/kraken-build --build --threads 20 --db <span class="variable">$DBNAME</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> remove intermediate files</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2kraken/kraken-build --clean --db <span class="variable">$DBNAME</span></span></span><br></pre></td></tr></table></figure><p>In the following part, this library will be refered as to ‘Masked library’.</p><h3 id="classify-contigs"><a class="markdownIt-Anchor" href="#classify-contigs"></a> Classify contigs</h3><h4 id="non-masked-library"><a class="markdownIt-Anchor" href="#non-masked-library"></a> Non-Masked library</h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> classify</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2kraken/kraken --db <span class="variable">$Kraken_DB_201806</span> --threads <span class="variable">$PPN</span> --fasta-input <span class="variable">$WROKDIR</span>/flye/run2/contigs.fasta --classified-out <span class="variable">$WORKDIR</span>/kraken/run2/flye.run2.classified --unclassified-out <span class="variable">$WORKDIR</span>/kraken/run2/flye.run2.unclassified > <span class="variable">$WORKDIR</span>/kraken/run2/flye.run2.kraken</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> report</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2kraken/kraken-report --db <span class="variable">$Kraken_DB_201806</span> <span class="variable">$WORKDIR</span>/kraken/run2/flye.run2.kraken > <span class="variable">$WORKDIR</span>/kraken/run2/flye.run2.report</span></span><br></pre></td></tr></table></figure><p>The ‘classified’ sequences will be saved in ‘classified.fa’, and the ‘unclassified.fa’ will be the ‘clean’ one, which can be used for downstream analysis.</p><p>But when I looked over the ‘report’, the result quite disappointed me.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> head -5 flye.run2.report</span></span><br><span class="line"> 31.3046474647U0unclassified</span><br><span class="line"> 68.7010199435-1root</span><br><span class="line"> 51.557653166-131567 cellular organisms</span><br><span class="line"> 47.447043471D2 Bacteria</span><br><span class="line"> 24.67366219P1224 Proteobacteria</span><br></pre></td></tr></table></figure><p>This means about 70% contigs were contaminated!</p><p>I was curious about the percentage of contaminant lengths of each read. Then I used a simple <a href="./ratio_of_contaminant_from_kraken.py">Python script</a> to count that.</p><p>Then I used <code>R</code> to visualize the relationship between contig lengths and the contaminated lengths. There were 14846 contigs, among which 12004 (80.86%) contigs, 2788 (18.78%) contigs and 54 (0.36%) contigs has a contaminated rate of blow 1%, between 1% and 10%, and more than 10% respectively.</p><img src="/blog/2018/06/Detect-Microbial-Contamination-in-Contigs-by-Kraken/1534323614_2536.png"><h4 id="masked-library"><a class="markdownIt-Anchor" href="#masked-library"></a> Masked library</h4><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> classify</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2kraken/kraken --db Kraken_DB_abfpv_1806 --threads <span class="variable">$PPN</span> --fasta-input <span class="variable">$WROKDIR</span>/flye/flye/run2/contigs.fasta --classified-out <span class="variable">$WROKDIR</span>/flye/run3/flye.run2.classified --unclassified-out <span class="variable">$WROKDIR</span>/flye/run3/flye.run2.unclassified > <span class="variable">$WROKDIR</span>/flye/run3/flye.run2.kraken</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> report</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2kraken/kraken-report --db Kraken_DB_abfpv_1806 <span class="variable">$WROKDIR</span>/flye/run3/flye.run2.kraken > <span class="variable">$WROKDIR</span>/flye/run3/flye.run2.report</span></span><br></pre></td></tr></table></figure><p>And about 17% of contigs were classified as ‘contaminant’.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> head -5 flye.run2.report</span></span><br><span class="line"> 83.491239512395U0unclassified</span><br><span class="line"> 16.51245115-1root</span><br><span class="line"> 13.5320090D10239 Viruses</span><br><span class="line"> 13.4619990-35237 dsDNA viruses, no RNA stage</span><br><span class="line"> 13.4419960F10482 Polydnaviridae</span><br></pre></td></tr></table></figure><p>There were 14846 contigs, among which 14766 (99.46%) contigs, 61 (0.41%) contigs and 19 (0.13%) contigs has a contaminated rate of blow 1%, between 1% and 10%, and more than 10% respectively.</p><img src="/blog/2018/06/Detect-Microbial-Contamination-in-Contigs-by-Kraken/1534323254_22359.png"><h3 id="replace-contaminant-regions-with-n"><a class="markdownIt-Anchor" href="#replace-contaminant-regions-with-n"></a> Replace contaminant regions with N</h3><p>My collaborator asked me to give him an assembly draft for meta-genomic study, which a ‘clean’ one was needed. Removing all the contaminated sequences was not feasible, since there would be few contigs left.</p><p>Then I checked the output of <code>Kraken</code>, which contained the detailed information of each contig:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> head -3 flye.run2.kraken</span></span><br><span class="line">Ccontig_11021181103955110:71251 28874:1 1:73 35237:1 1241371:1 1:1 694430:4 131567:16 1:15 0:376 158:2 0:1 1:33 2:1 1783272:1 1769:1 0:5975 633697:1 1783272:1 273035:1 1:19 1307:1 0:18402 2:1 1:39 0:14 85655:2 0:152 78219:2 1:18 0:8591 679926:1 131567:2 1:12 1118964:3 0:2539 523841:1 1:31 0:3050 1:27 0:3370 1:26 1267001:1 0:1341 523841:1 1:5 131567:1 1783272:1 2100:2 0:75636 1:38 0:2 864702:2 0:6975 1428:4 131567:1 1:114 1783272:1 203124:1 0:10293 1353243:1 0:1 1:32 2:1 1313292:1 0:48537 1783272:1 273035:1 1:11 2:1 1313292:1 0:10283 1161:1 1507806:1 1:91 131567:1 0:611 1:13 131567:16 694430:4 1:1 1241371:1 35237:1 1:49 35237:3 10401:3 0:628 1605721:1 0:8 1:6 0:20 1605721:8 1:25 0:15 1:83 0:16 1605721:1 0:3 1605721:8 1:47 10239:2 12315:8 0:9 1605721:8 1:100 0:15 1:47 0:15 1:53 0:7 1:22 0:1 1:53 0:7 1:36 35237:1 0:3604 523841:1 1:51 2:2 444612:1 0:667 1654582:4 1:3 118110:9 0:3597 46170:1 1783272:1 1:7 0:25607 118110:7 131567:2 2:1 1898474:2 1783272:2 1:22 2:3 0:732 273035:2 1:35 0:16896 1:23 273035:1 0:5012 28890:2 2207:2 1:30 28890:1 2209:3 0:11916 118110:6 1:3 118110:1 1:13 320432:1 0:13148 39152:3 1:15 0:669 1105113:1 320432:1 1:29 1654582:1 0:6006 10371:3 2:1 1:41 0:5940 85655:2 0:6 864702:2 0:2 1:34 0:14 85655:2 0:301 1307:2 1783272:1 1:2 35237:1 1:2 1654582:1 0:5603 1428:1 1:23 0:4788 456320:1 1:23 0:169 2:2 131567:1 1:10 131567:1 1783272:1 0:1735 10335:1 1:28 118110:1 1:3 118110:4 0:11831 118110:3 131567:2 2:1 1898474:2 1783272:2 1:19 0:1542 118110:3 1:3 118110:1 1:19 216946:1 0:452 444612:1 2:2 1:15 0:1 66266:1 0:1790 118110:8 131567:2 2:1 1898474:2 1783272:2 1:11 0:1 66266:1 0:319 375175:2 0:328 1:17 131567:1 1769:1 0:2798</span><br><span class="line">Ucontig_11021047210:4691</span><br><span class="line">Ccontig_11022118110438060:12538 1:19 131567:1 444612:1 0:5305 118110:9 131567:2 2:1 1898474:1 0:149 1:17 39640:1 0:25732</span><br></pre></td></tr></table></figure><p>Because <code>Kraken</code> uses a k-mer mapping strategy to locate potential contaminants, there would be many contigs within which only a small part could be aligned to bacteria/virul sequences. So I retrieved the precise k-mers that mapped to contamination, and masked them with ‘N’. I discussed this with another guy: <a href="https://www.biostars.org/p/321334/#321479" target="_blank" rel="noopener">Questions about de novo genome assembly from mixed DNA samples</a>, and he also thought it’s a vivid approach.</p><h3 id="finally"><a class="markdownIt-Anchor" href="#finally"></a> Finally</h3><p>But I began to think how the contaminanted sequences would affect the assembly process, and maybe removing contaminants from raw data is a good way. See discussions here: <a href="https://www.researchgate.net/post/Filtering_for_contamination_when_assembling_a_genome_before_or_after_assemby" target="_blank" rel="noopener">Filtering for contamination when assembling a genome, before or after assemby?</a> and <a href="https://www.biostars.org/p/165059/" target="_blank" rel="noopener">Question: How to remove contamination from the transcriptome assembly</a>.</p><h2 id="useful-links"><a class="markdownIt-Anchor" href="#useful-links"></a> Useful links</h2><ul><li><a href="https://github.com/sschmeier/refseq2kraken" target="_blank" rel="noopener">refseq2kraken: download refseq-genomic data and prepare it for Kraken</a></li><li><a href="http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/" target="_blank" rel="noopener">Download refseq-genomic data and prepare it for Kraken</a></li><li><a href="https://groups.google.com/forum/#!msg/kraken-users/jjRe21-qyvw/Kq8DXY45CQAJ" target="_blank" rel="noopener">Building a low-complexity masked database with dustmasker</a></li><li><a href="https://www.biostars.org/p/321334" target="_blank" rel="noopener">Questions about de novo genome assembly from mixed DNA samples</a></li><li><a href="https://www.molecularecologist.com/2017/01/handling-microbial-contamination-in-ngs-data/" target="_blank" rel="noopener">Handling microbial contamination in NGS data</a></li><li><a href="https://www.biostars.org/p/237168/" target="_blank" rel="noopener">Question: Contamination in assembly</a></li><li><a href="https://www.researchgate.net/post/Filtering_for_contamination_when_assembling_a_genome_before_or_after_assemby" target="_blank" rel="noopener">Filtering for contamination when assembling a genome, before or after assemby?</a></li><li><a href="https://www.biostars.org/p/165059/" target="_blank" rel="noopener">Question: How to remove contamination from the transcriptome assembly</a></li></ul><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180424: create the note.</li><li>20180620: add the “refseq2kraken” part.</li><li>20180808: add “Mask low-complexity regions” and respective parts.</li></ul>]]></content>
<summary type="html">
<p>Purpose in short: I want to detect (and remove) potential contaminants in the genome assembly, and Kraken is a tool designed for that.</p
</summary>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/"/>
<category term="contamination" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/contamination/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/tags/genome-assembly/"/>
<category term="contamination" scheme="https://yiweiniu.github.io/blog/tags/contamination/"/>
</entry>
<entry>
<title>Viral Expression in RNA-seq data</title>
<link href="https://yiweiniu.github.io/blog/2018/05/Viral-Expression-in-RNA-seq-data/"/>
<id>https://yiweiniu.github.io/blog/2018/05/Viral-Expression-in-RNA-seq-data/</id>
<published>2018-05-04T02:21:23.000Z</published>
<updated>2018-06-30T07:06:11.000Z</updated>
<content type="html"><![CDATA[<p>Recently I wanted to check viral expression from RNA-seq data.</p><p>I found two good examples:</p><blockquote><p>Cao S, Strong MJ, Wang X, Moss WN, Concha M, Lin Z, O’Grady T, Baddoo M, Fewell C, Renne R, et al. 2015. High-Throughput RNA Sequencing-Based Virome Analysis of 50 Lymphoma Cell Lines from the Cancer Cell Line Encyclopedia Project. J. Virol. 89:713–729. doi:10.1128/JVI.02570-14.<br><br>Wang Zheng, Hao Y, Zhang C, Wang Zhiliang, Liu X, Li G, Sun L, Liang J, Luo J, Zhou D, et al. 2017. The Landscape of Viral Expression Reveals Clinically Relevant Viruses with Potential Capability of Promoting Malignancy in Lower-Grade Glioma. Clinical Cancer Research 23:2177–2185.</p></blockquote><p>Also some useful discussions:</p><ul><li><a href="https://groups.google.com/forum/#!topic/rna-star/QJxXmDzvJXU" target="_blank" rel="noopener">using STAR to map against 100 viral species</a></li><li><a href="https://groups.google.com/forum/#!msg/rna-star/cLpf7BuDnGY/nLXTE_pHDHgJ" target="_blank" rel="noopener">slow mapping to a small genome</a></li></ul><p>Alex (the author of <code>STAR</code>) suggested to combine human genome and viruses. But I already mapped the <code>FASTQ</code> to human genome (hg38), and saved unmapped reads to seperated <code>FASTQ</code> files.</p><p>Step 1, download all virul genomes from <a href="ftp://ftp.ncbi.nih.gov/refseq/release/viral/" target="_blank" rel="noopener">NCBI Refseq Viral Release</a>.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> wget ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> wget ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.2.1.genomic.fna.gz</span></span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> gzip -d *</span></span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> cat viral.1.1.genomic.fna viral.2.1.genomic.fna > viral.refseq.180424.fa</span></span><br></pre></td></tr></table></figure><p>Step 2, build <code>STAR</code> index.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> mkdir STARgenomes</span></span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> /software/STAR-2.5.3a/bin/Linux_x86_64_static/STAR --runThreadN 10 --genomeDir ./STARgenomes --runMode genomeGenerate --genomeFastaFiles viral.refseq.180424.fa</span></span><br></pre></td></tr></table></figure><p>Step 3, align unmapped reads to viral genomes.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">for sample in x1 x2 ...</span><br><span class="line">do</span><br><span class="line"> $TOOLDIR/STAR-2.5.3a/bin/Linux_x86_64_static/STAR --runMode alignReads --runThreadN 24 --genomeDir $STARindex --outSAMtype BAM SortedByCoordinate --outSAMattributes All --readFilesIn $WORKDIR/STAR_out/${sample}_Unmapped.out.mate1.gz $WORKDIR/STAR_out/${sample}_Unmapped.out.mate2.gz --readFilesCommand zcat --outFileNamePrefix $WORKDIR/Viral_expression/${sample}_</span><br><span class="line"> $TOOLDIR/samtools.1.3.1/bin/samtools index $WORKDIR/Viral_expression/${sample}_Aligned.sortedByCoord.out.bam</span><br><span class="line">done</span><br></pre></td></tr></table></figure><p>Step 4, compute viral expression.</p><p>I wanted to use existing read-counting software to quantify the viruses, so I had to create a fake annotation (a fake GTF file).</p><p>The tiny script to create GTF from FASTA file was like this:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#!/usr/bin/evn python</span></span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">purpose: to calculate read count of virus, I want to make a fake GTF of virus fasta.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">usage: python xxx.py virus.fa > virus.gtf</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="keyword">import</span> sys</span><br><span class="line"></span><br><span class="line"><span class="comment"># step 1: read the fasta and put it into a dict</span></span><br><span class="line">seq_dict = {}</span><br><span class="line"><span class="keyword">with</span> open(sys.argv[<span class="number">1</span>], <span class="string">'r'</span>) <span class="keyword">as</span> fin:</span><br><span class="line"><span class="keyword">for</span> line <span class="keyword">in</span> fin:</span><br><span class="line">line = line.strip()</span><br><span class="line"><span class="keyword">if</span> line.startswith(<span class="string">'>'</span>):</span><br><span class="line">seqID = line.split()[<span class="number">0</span>].replace(<span class="string">'>'</span>, <span class="string">''</span>)</span><br><span class="line">seqName = <span class="string">' '</span>.join(line.split(<span class="string">','</span>)[<span class="number">0</span>].split()[<span class="number">1</span>:])</span><br><span class="line">seq_dict[seqID] = [seqName, <span class="string">''</span>]</span><br><span class="line"><span class="keyword">else</span>:</span><br><span class="line">seq_dict[seqID][<span class="number">1</span>] += line</span><br><span class="line"></span><br><span class="line"><span class="comment"># step2: traverse the dict and generate GTF. Use the whole virus as a exon.</span></span><br><span class="line"><span class="keyword">for</span> key <span class="keyword">in</span> seq_dict:</span><br><span class="line">seqLen = len(seq_dict[key][<span class="number">1</span>])</span><br><span class="line">tmp_list = [key, <span class="string">'Virus'</span>, <span class="string">'exon'</span>, <span class="string">'1'</span>, str(seqLen), <span class="string">'.'</span>, <span class="string">'+'</span>, <span class="string">'.'</span>]</span><br><span class="line"><span class="keyword">print</span> <span class="string">'\t'</span>.join(tmp_list) + <span class="string">'\t'</span> + <span class="string">'gene_id "'</span> + key + <span class="string">'"; gene_name "'</span> + seq_dict[key][<span class="number">0</span>] + <span class="string">'";'</span></span><br></pre></td></tr></table></figure><p>And the output looked like this:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> head -3 viral.refseq.180424.fake.gtf</span></span><br><span class="line">NC_003747.2Virusexon14212.+.gene_id "NC_003747.2"; gene_name "Ryegrass mottle virus isolate MAFF. No. 307043 from Japan";</span><br><span class="line">NC_011500.2Virusexon11614.+.gene_id "NC_011500.2"; gene_name "Rotavirus A segment 5";</span><br><span class="line">NC_007737.1Virusexon13055.+.gene_id "NC_007737.1"; gene_name "Liao ning virus segment 2";</span><br></pre></td></tr></table></figure><p>Then I used <code>featureCounts</code> function from <code>Rsubread</code> R package to count the reads of viruses (non-strand specific, 'cause not knowing the transcription direction), and used <code>rpkm</code> function of <code>edgeR</code> to normalize the raw count to viral “FPKM”.</p><p><strong>Note</strong>:</p><ul><li>The expression is a estimation. There maybe lots of errors. Be careful to interpret the results.</li></ul><h2 id="change-notes"><a class="markdownIt-Anchor" href="#change-notes"></a> Change notes</h2><ul><li>20180424: create the note.</li></ul>]]></content>
<summary type="html">
<p>Recently I wanted to check viral expression from RNA-seq data.</p>
<p>I found two good examples:</p>
<blockquote>
<p>Cao S, Strong MJ, Wa
</summary>
<category term="RNA-seq" scheme="https://yiweiniu.github.io/blog/categories/RNA-seq/"/>
<category term="viral expression" scheme="https://yiweiniu.github.io/blog/categories/RNA-seq/viral-expression/"/>
<category term="RNA-seq" scheme="https://yiweiniu.github.io/blog/tags/RNA-seq/"/>
<category term="virus" scheme="https://yiweiniu.github.io/blog/tags/virus/"/>
<category term="viral expression" scheme="https://yiweiniu.github.io/blog/tags/viral-expression/"/>
</entry>
<entry>
<title>Genome Assembly Pipeline: SMARTdenovo</title>
<link href="https://yiweiniu.github.io/blog/2018/04/Genome-Assembly-Pipeline-SMARTdenovo/"/>
<id>https://yiweiniu.github.io/blog/2018/04/Genome-Assembly-Pipeline-SMARTdenovo/</id>
<published>2018-04-23T11:59:54.000Z</published>
<updated>2018-08-18T04:39:57.000Z</updated>
<content type="html"><![CDATA[<h2 id="introduction"><a class="markdownIt-Anchor" href="#introduction"></a> Introduction</h2><p>From its <a href="https://github.com/ruanjue/smartdenovo" target="_blank" rel="noopener">Git Repo</a>:</p><blockquote><p>SMARTdenovo is a <em>de novo</em> assembler for PacBio and Oxford Nanopore (ONT) data. It produces an assembly from all-vs-all raw read alignments without an error correction stage. It also provides tools to generate accurate consensus sequences, though a platform dependent consensus polish tools (e.g. Quiver for PacBio or Nanopolish for ONT) are still required for higher accuracy.<br><br>SMARTdenovo consists of several separate command line tools: <strong>wtzmo</strong> for read overlapping, <strong>wtgbo</strong> to rescue missing overlaps, <strong>wtclp</strong> for identifying low-quality regions and chimaera, and <strong>wtcns</strong> or <strong>wtmsa</strong> to produce better unitig consensus. The <code>smartdenovo.pl</code> script provides a convenient interface to call these programs in one go.</p></blockquote><p>This tool has not been published yet. (20180313)</p><p>My feelings:</p><ul><li>easy to install/use</li><li>not as fast as <code>wtdbg</code>, but fast</li><li>comparatively good results (at least in my case)</li><li>docs and discussions about this tool is limited.</li></ul><h2 id="general-usage"><a class="markdownIt-Anchor" href="#general-usage"></a> General usage</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"># Download sample PacBio from the PBcR website</span><br><span class="line">wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -</span><br><span class="line">awk 'NR%4==1||NR%4==2' selfSampleData/pacbio_filtered.fastq | sed 's/^@/>/g' > reads.fa</span><br><span class="line"># Install SMARTdenovo</span><br><span class="line">git clone https://github.com/ruanjue/smartdenovo.git && (cd smartdenovo; make)</span><br><span class="line"># Assemble (raw unitigs in wtasm.lay.utg; consensus unitigs: wtasm.cns)</span><br><span class="line">smartdenovo/smartdenovo.pl -c 1 reads.fa > wtasm.mak</span><br><span class="line">make -f wtasm.mak</span><br></pre></td></tr></table></figure><h2 id="in-practice"><a class="markdownIt-Anchor" href="#in-practice"></a> In practice</h2><h3 id="an-insect"><a class="markdownIt-Anchor" href="#an-insect"></a> An insect</h3><ul><li>The species: high heterogeneity, high AT, high repetition.</li><li>Genome size: male 790M, female 830M.</li></ul><p>commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run1, default</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2perl <span class="variable">$TOOLDIR</span>/smartdenovo/smartdenovo.pl -t <span class="variable">$PPN</span> -c 1 -p run1 <span class="variable">$DATADIR</span>/third/third_all.fasta > run1.mak</span></span><br><span class="line">make -f run1.mak</span><br></pre></td></tr></table></figure><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN756816708</span><br><span class="line">Size_withoutN756816708</span><br><span class="line">Seq_Num6135</span><br><span class="line">Mean_Size123360</span><br><span class="line">Median_Size55901</span><br><span class="line">Longest_Seq5704487</span><br><span class="line">Shortest_Seq10769</span><br><span class="line">GC_Content31.72</span><br><span class="line">N50240010</span><br><span class="line">N9044546</span><br><span class="line">Gap0.0</span><br></pre></td></tr></table></figure><p><code>SMARTdenovo</code> can also use <code>zmo</code> overlapper. I also test this option, but it generated about 17G genome! (The estimated genome size is about 850M.)</p><h3 id="a-plant"><a class="markdownIt-Anchor" href="#a-plant"></a> A plant</h3><ul><li>The species: high heterogeneity, high repetition.</li><li>Genome size: 2.1G.</li></ul><h4 id="run1-with-about-100x-data"><a class="markdownIt-Anchor" href="#run1-with-about-100x-data"></a> run1, with about 100X data</h4><p>commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run1, default</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2perl <span class="variable">$TOOLDIR</span>/smartdenovo/smartdenovo.pl -t 24 -c 1 -p run1 <span class="variable">$WORKDIR</span>/data/Pacbio/all.fq.gz > run1.mak</span></span><br><span class="line">make -f run1.mak</span><br></pre></td></tr></table></figure><p>And the stats I got:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN2103140368</span><br><span class="line">Size_withoutN2103140368</span><br><span class="line">Seq_Num6164</span><br><span class="line">Mean_Size341197</span><br><span class="line">Median_Size163362</span><br><span class="line">Longest_Seq9288681</span><br><span class="line">Shortest_Seq12171</span><br><span class="line">GC_Content38.16</span><br><span class="line">N50703465</span><br><span class="line">L50809</span><br><span class="line">N90151138</span><br><span class="line">Gap0.0</span><br></pre></td></tr></table></figure><h4 id="run2-with-about-50x-data"><a class="markdownIt-Anchor" href="#run2-with-about-50x-data"></a> run2, with about 50X data</h4><p>commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run2, 50X</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2perl <span class="variable">$TOOLDIR</span>/smartdenovo/smartdenovo.pl -t <span class="variable">$PPN</span> -c 1 -p run2 <span class="variable">$WORKDIR</span>/data/Pacbio/Pacbio_50x.fasta > run2.mak</span></span><br><span class="line">make -f run2.mak</span><br></pre></td></tr></table></figure><p>And the stats I got:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN2028605527</span><br><span class="line">Size_withoutN2028605527</span><br><span class="line">Seq_Num5811</span><br><span class="line">Mean_Size349097</span><br><span class="line">Median_Size170070</span><br><span class="line">Longest_Seq10046321</span><br><span class="line">Shortest_Seq24367</span><br><span class="line">GC_Content38.18</span><br><span class="line">N50708215</span><br><span class="line">L50758</span><br><span class="line">N90147345</span><br><span class="line">Gap0.0</span><br></pre></td></tr></table></figure><p>This was a very good N50 size! And the assembled size was close to the expected one.</p><h2 id="change-notes"><a class="markdownIt-Anchor" href="#change-notes"></a> Change notes</h2><ul><li>20180423: create the note.</li></ul>]]></content>
<summary type="html">
<h2 id="introduction"><a class="markdownIt-Anchor" href="#introduction"></a> Introduction</h2>
<p>From its <a href="https://github.com/ruanj
</summary>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/"/>
<category term="TGS pipeline" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/TGS-pipeline/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/tags/genome-assembly/"/>
<category term="long-read genome assembly" scheme="https://yiweiniu.github.io/blog/tags/long-read-genome-assembly/"/>
</entry>
<entry>
<title>Identify circRNAs and Fusions from RNA-seq Using STARChip</title>
<link href="https://yiweiniu.github.io/blog/2018/04/Identify-circRNAs-and-Fusions-from-RNA-seq-Using-STARChip/"/>
<id>https://yiweiniu.github.io/blog/2018/04/Identify-circRNAs-and-Fusions-from-RNA-seq-Using-STARChip/</id>
<published>2018-04-23T08:56:25.000Z</published>
<updated>2018-06-30T07:05:31.000Z</updated>
<content type="html"><![CDATA[<p>I want to find out some <a href="https://en.wikipedia.org/wiki/Circular_RNA" target="_blank" rel="noopener">circRNAs</a> from RNA-seq data (total RNA-seq, not poly-A enriched).</p><p>There are many tools for this mission. Here is a good review paper <sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> talking about computational methods for analyzing circRNAs, both identification and downstream analysis. Also another review paper about identifying circRNAs <sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>. There are also two evaluation papers for the identification tools <sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup><sup class="footnote-ref"><a href="#fn4" id="fnref4">[4]</a></sup>.</p><p>From all the tools I know, <code>CIRCexplorer2</code> <sup class="footnote-ref"><a href="#fn5" id="fnref5">[5]</a></sup> and <code>CIRI</code> <sup class="footnote-ref"><a href="#fn6" id="fnref6">[6]</a></sup> are well matained. But I want to try something new: <code>STARChip</code> <sup class="footnote-ref"><a href="#fn7" id="fnref7">[7]</a></sup>.</p><blockquote><p>STARChip is short for Star Chimeric Post, written by Dr. Nicholas Kipp Akers as part of his work in Bojan Losic’s group at the Icahn Institute of Genomics and Multiscale Biology at Mount Sinai School of Medicine<br><br>This software is designed to take the chimeric output from the STAR alignment tool and discover high confidence <strong>fusions</strong> and <strong>circular RNA</strong> in the data. Before running, you must have used a recent version of STAR with chimeric output turned on, to align your RNA-Seq data.</p></blockquote><p>So, it can <strong>identify fusions and circRNAs at the same time</strong>. According to its paper, for circRNA detection, “STARChip achieves the best precision of all tools tested and nearly the best sensitivity. This does not appear to come at an increased resource cost. Both CIRI and CIRCexplorer had competitive precision and sensitivity values; STARChip required 43 and 179% of the runtimes of these packages, respectively, and ∼72% of the memory requirements.”; for fusions, “With STARChip, we have attempted to emphasize precision at the expense of sensitivity in these particular gold-standard studies, reasoning that such hyper-tuning inflates type I error in mining novel datasets.”</p><p>I’ve discussed with the author Kipp Akers about the precision: <a href="https://github.com/LosicLab/starchip/issues/9#issuecomment-381181507" target="_blank" rel="noopener">https://github.com/LosicLab/starchip/issues/9#issuecomment-381181507</a>. He said:</p><blockquote><p>To your final question, my goal with STARChip was to develop a tool that focused on precision. There are a dozen fusion finders out there that sacrifice everything to get the highest sensitivity. For my projects, this was not too helpful. However, STARChip’s read requirement settings can be set manually and because it runs so quickly, it’s easy to play with the settings to turn up sensitivity and turn down precision and see what you get. Feel free to do so, and let me know what you find!</p></blockquote><p>I agree with the designing purpose of <code>STARChip</code>, so I decide to give it a shot.</p><p>There are two main modules in <code>STARChip</code>:</p><ul><li>starchip-fusions is for fusion detection. It runs on individual samples.<ul><li><code>/path/to/starchip/starchip-fusions.pl output_seed Chimeric.out.junction Paramters.txt</code></li></ul></li><li>starchip-circles is for circRNA detection. It runs on groups of samples.<ul><li><code>/path/to/starchip/starchip-circles.pl STARdirs.txt Parameters.txt</code></li><li><code>/path/to/starchip/starchip-circles.pl fastq_files.txt parameters.txt</code></li></ul></li></ul><p>Notes below are more for my own convenience. See its <a href="https://github.com/LosicLab/starchip" target="_blank" rel="noopener">git repo</a> for full usage.</p><h2 id="prepare"><a class="markdownIt-Anchor" href="#prepare"></a> prepare</h2><blockquote><p>STARChip is written to be an extension of the STAR read aligner. It is optional for STARChip to run STAR on your samples. In most instances to run STARChip you must first run star on each of your samples. See the STAR documentation for installation, as well as building or downloading a STAR genome index. It is absolutely critical however, that you follow the STAR manual’s instructions and build a genome using all chromosomes plus unplaced contigs. Not doing so will strongly inflate your false positives rate, because reads that map perfectly to an unplaced contig will instead find the next best alignment, often a chimeric alignment. Run STAR with the following parameters required for chimeric output: –chimSegmentMin X –chimJunctionOverhangMin X (where X is an integer). Your project will have it’s own requirements, but a good starting point for your star alignments might look like:<br><br><code>STAR --genomeDir /path/to/starIndex/ --readFilesIn file1_1.fastq.gz file1_2.fastq.gz --runThreadN 11 --outReadsUnmapped Fastx --quantMode GeneCounts --chimSegmentMin 15 --chimJunctionOverhangMin 15 --outSAMstrandField intronMotif --readFilesCommand zcat --outSAMtype BAM Unsorted</code></p></blockquote><h3 id="referencebed-files"><a class="markdownIt-Anchor" href="#referencebed-files"></a> reference/BED files</h3><blockquote><p>STARChip makes use of gtf files for annotating fusions and circRNA with gene names.</p></blockquote><p>First, download the package and prepare annotation files:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> git <span class="built_in">clone</span> https://github.com/LosicLab/starchip.git && <span class="built_in">cd</span> starchip</span></span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> mkdir starchip_ref && ./setup.sh ~/RefData/Homo_sapiens/GENCODE_v27/gencode.v27.annotation.gtf ~/RefData/Homo_sapiens/GRCh38_no_alt/genome.fa ./starchip_ref</span></span><br></pre></td></tr></table></figure><h3 id="additional-files-for-fusions"><a class="markdownIt-Anchor" href="#additional-files-for-fusions"></a> additional files for Fusions</h3><p><code>starchip-fusions</code> filters using the location of known repeats in bed format as well. Following the instructions in the picture to download repeats from UCSC genome browser.</p><ol><li>Go to <a href="http://genome.ucsc.edu/cgi-bin/hgTables" target="_blank" rel="noopener">http://genome.ucsc.edu/cgi-bin/hgTables</a></li><li>Change ‘genome’ to your desired genome</li><li>Change the following settings:<ol><li>group: Repeat</li><li>track: RepeatMasker</li><li>region: genome</li><li>output format: BED</li><li>output file: some reasonable name.bed</li></ol></li><li>Click ‘get output’ to download your bed file.</li><li>On your local machine sort the bed file: <code>sort -k1,1 -k2,2n repeats.bed > repeats.sorted.bed</code></li></ol><p><strong>If you’re working on <code>hg19</code> or <code>hg38</code>, you don’t have to do the following things</strong>. The files needed are already included in the directory of <code>STARChip</code>.</p><p><code>starchip-fusions</code> can also make use of known antibody parts, and copy number variants. These files come with starchip for human hg19 and hg38 in the reference directory. For other species you can create your own in the simple format: <code>Chromosome StartPosition EndPosition</code></p><p>Finally, starchip-fusions uses known gene families and known/common false-positive pairs to filter out fusions which are likely mapping errors or PCR artifacts. Family data can be downloaded from ensembl biomart:</p><ol><li>Go to <a href="http://www.ensembl.org/biomart/martview" target="_blank" rel="noopener">http://www.ensembl.org/biomart/martview</a></li><li>Database: Ensembl Genes</li><li>Dataset: Your species</li><li>Click Attributes on the left hand side.<ol><li>Under GENE dropdown, select only “Gene Name”</li><li>Under PROTEIN FAMILIES AND DOMAINS dropdown select Ensembl Protein Family ID.</li></ol></li><li>Click Results at the top.</li><li>Export the file. It should have two columns, Family ID and Gene ID.</li></ol><p>Known false positives are stored within data/pseudogenes.txt. In practice, we’ve found that pseudogenes and tissue specific highly expressed genes are commonly “fused” via PCR template switching errors. Feel free to put add any additional lines that result from your data to this file in the format: <code>Gene1Name Gene2Name</code></p><h2 id="run-starchip"><a class="markdownIt-Anchor" href="#run-starchip"></a> run STARChip</h2><p>Since my previous run of <code>STAR</code> didn’t use parameters <code>--chimSegmentMin</code> and <code>--chimJunctionOverhangMin</code>, I have to start with <code>Fastq</code> files.</p><p><code>starchip-circles</code> can run from <code>Fastq</code> files, but <code>starchip-fusions</code> starts from <code>Chimeric.out.junction</code>. I’ll first run <code>starchip-circles</code> then run <code>starchip-fusions</code>.</p><p>First of all, I prepare dirs for <code>STARChip</code> under my WORKDIR like this:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">STARChip/</span><br><span class="line">├── STARChip-circRNA</span><br><span class="line">│ ├── starchip-circles.fastqfiles # the fastq files</span><br><span class="line">│ └── starchip-circles.params # starchip-circles parameters</span><br><span class="line">└── STARChip-fusions</span><br><span class="line"> └── starchip-fusions.param # starchip-fusions parameters</span><br></pre></td></tr></table></figure><h3 id="run-starchip-circles"><a class="markdownIt-Anchor" href="#run-starchip-circles"></a> run starchip-circles</h3><p>The parameter file and <code>Fastq</code> file:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> cat starchip-circles.params </span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">#Parameters for starchimp-circles</span></span></span><br><span class="line">readsCutoff = 5</span><br><span class="line">minSubjectLimit = 10</span><br><span class="line">cpus = 20</span><br><span class="line">do_splice = True</span><br><span class="line">cpmCutoff = 0</span><br><span class="line">subjectCPMcutoff = 0 </span><br><span class="line">annotate = true</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash">Reference Files</span></span><br><span class="line">refbed = /software/starchip/starchip_ref/gencode.v27.annotation.gtf.bed #use setup.sh to create this from a gtf. </span><br><span class="line">refFasta = /RefData/Homo_sapiens/GRCh38_no_alt/genome.fa</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash">STAR Parameters</span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Do you use a prefix for your STAR output?</span></span></span><br><span class="line">starprefix = </span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Are you starting from fastq and need to run STAR alignment? </span></span></span><br><span class="line">runSTAR = True</span><br><span class="line">STARgenome = /RefData/Homo_sapiens/GRCh38_no_alt/STARgenomes #not necassary if runSTAR != True</span><br><span class="line">STARreadcommand = zcat #cat for fastq, zcat for fastq.gz etc. not necassaryif runSTAR != True</span><br><span class="line">IDstepsback = 1 ## this is the position from the right of your path of the name of your files. </span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">#for example: /path/to/sample1/star/2.4.2/output/Chimeric.out.junction </span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">#sample1 is 4 steps back.</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">#or /path/to/star/2.4.2/sample1/Chimeric.out.junction</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash">sample1 is 1 step back.</span></span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> head -3 starchip-circles.fastqfiles </span></span><br><span class="line">XJ-3-1-25_R1.fastq.gz XJ-3-1-25_R2.fastq.gz</span><br><span class="line">XJ-2-1-25_R1.fastq.gz XJ-2-1-25_R2.fastq.gz</span><br><span class="line">XJ-7-1-25_R1.fastq.gz XJ-7-1-25_R2.fastq.gz</span><br></pre></td></tr></table></figure><p>Then go into the <code>$WORKDIR/STARChip/STARChip-circRNA</code> and run <code>starchip-circles</code> to generate scripts:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> <span class="variable">$path2circles</span> starchip-circles.fastqfiles starchip-circles.params</span></span><br><span class="line">Using the following parameters: Circular RNA must have at least 5 reads in at least 10 subjects/output files. Using 20 CPUs.</span><br><span class="line">Rscript must be callable. Requiring 0 subjects/outputs with 0 Counts per million circular reads to count a given circular RNA</span><br><span class="line">Other requirements are bedtools (>= 2.24.0)</span><br><span class="line">You have indicated you would like STARChip to perform STAR alignments. starchip-circles.fastqfiles should contain a list of fastq files; 1 sample per line, multiple files separated by a comma, and paired end files separated by a space.</span><br><span class="line">STARChip run scripts generated, please run ./Step1.sh through Step4.sh to detect and quantify circRNA</span><br></pre></td></tr></table></figure><p>There will be four scripts:</p><ul><li><code>Step1.sh</code>: align</li><li><code>Step2.sh</code>: discover circRNA</li><li><code>Step3.sh</code>: re-align</li><li><code>Step4.sh</code>: quantify/annotate</li></ul><p><code>Step2.sh</code> and <code>Step3.sh</code> use <code>STAR</code> in the system <code>PATH</code>, but I want to use another one:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sed -i '3,$s|^|/software/STAR-2.5.3a/bin/Linux_x86_64_static/|g' Step1.sh</span><br><span class="line">sed -i '3,$s|^|/software/STAR-2.5.3a/bin/Linux_x86_64_static/|g' Step3.sh</span><br></pre></td></tr></table></figure><p>I’m working on a PBS grid system, then I create a script to submit these scripts:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">!/bin/bash</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -V</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -j eo</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -N STARChip-circles</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -q Blade</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -l nodes=1:ppn=20</span></span><br><span class="line"></span><br><span class="line">echo Start time is `date +%Y/%m/%d--%H:%M`</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> work dir</span></span><br><span class="line">WORKDIR=/STARChip/STARChip-circRNA</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> starchip-circles</span></span><br><span class="line">sh Step1.sh</span><br><span class="line">sh Step2.sh</span><br><span class="line">sh Step3.sh</span><br><span class="line">sh Step4.sh</span><br><span class="line"></span><br><span class="line">echo Finish time is `date +%Y/%m/%d--%H:%M`</span><br></pre></td></tr></table></figure><p>In my samples, only four circRNAs were identified by <code>STARChip-circles</code>.</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">$ cat circRNA.5reads.10ind.countmatrix</span><br><span class="line">8_4-10_R1.fastq.gz8_4-11_R1.fastq.gz8_4-3_R1.fastq.gz8_4-4_R1.fastq.gz8_4-5_R1.fastq.gz8_4-6_R1.fastq.gz8_4-7_R1.fastq.gz8_4-8_R1.fastq.gz 8_4-9_R1.fastq.gzXJ-10-1-25_R1.fastq.gzXJ-11-1-25_R1.fastq.gzXJ-1-1-25_R1.fastq.gzXJ-12-1-25_R1.fastq.gzXJ-13-1-25_R1.fastq.gzXJ-2-1-25_R1.fastq.gzXJ-3-1-25_R1.fastq.gXJ-4-1-25_R1.fastq.gzXJ-5-1-25_R1.fastq.gzXJ-6-1-25_R1.fastq.gzXJ-7-1-25_R1.fastq.gzXJ-8-1-25_R1.fastq.gzXJ-9-1-25_R1.fastq.gz</span><br><span class="line">chr1:117402186-11744232500020040638004322200 3</span><br><span class="line">chr15:101213315-10121667800382080000002600586018122523 0</span><br><span class="line">chr15:90217439-90219891040100001010520102019 4</span><br><span class="line">chr9:111786793-111787947003318623000010100052942141515 0</span><br></pre></td></tr></table></figure><h3 id="run-starchip-fusions"><a class="markdownIt-Anchor" href="#run-starchip-fusions"></a> run starchip-fusions</h3><p>The parameter file:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"><span class="comment">## Parameters for fusions-from-star.pl</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Describing Your Data:</span></span></span><br><span class="line"></span><br><span class="line">pairedend = TRUE #TRUE means paired end data. any other value means single end. $spancutoff should be 0 if data is single end.</span><br><span class="line">consensus = TRUE # anything but TRUE will make this skip the consensus sequence generation for each fusion.</span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Filters, Cutoffs </span></span></span><br><span class="line"></span><br><span class="line">splitReads = auto # number of minimum read support at jxn. Minimum 2. Greatly impacts running time. Other options are: "auto" , "highsensitivity" , and "highprecision" </span><br><span class="line">uniqueReads = 2 # number of unique read support values (higher indicates more likely to be real. lower is more likely amplification artifact).</span><br><span class="line">spancutoff = 1 #minimum number of non-split reads support. If single end data, this must be 0 or auto. Other options are: "auto" , "highsensitivity" , and "highprecision"</span><br><span class="line">wiggle = 500 #number of base-pairs of 'wiggle-room' when determining the location of a fusion (for spanning read counts)</span><br><span class="line">overlapLimit = 5 #wiggle room for joining very closely called fusion sites.</span><br><span class="line">samechrom_wiggle = 20000 #this is the distance that fusions have to be from each other if on the same chromosome. Set to 0 if you want no filtering of same-chromosome pr</span><br><span class="line">lopsidedupper = 10 # (topsidereads + 0.1) / (bottomsidereads + 0.1) must be below this value. set very high to disable. Reccomended setting 5</span><br><span class="line">lopsidedlower = 0.1 # (topsidereads + 0.1) / (bottomsidereads + 0.1) must be above this value. set to 0 to disable. Reccomended setting 0.2</span><br><span class="line">cnvwiggle = 1000 #we skip fusions that can be explained by known cnvs. how close to the edges of the cnv must our fusion be?</span><br><span class="line">circlesize = 100000 #we skip fusions that look more like circular rna/backsplices. how big (bp) could a circle be? </span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># Local Reference Files:</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash">refbed is a bed format version of a gtf. This should probably be derived from the same GTF that STAR aligned with using setup.sh. </span></span><br><span class="line">refbed=/software/starchip/starchip_ref/gencode.v27.annotation.gtf.bed</span><br><span class="line"> # a bed format list of known repeats</span><br><span class="line">repeatbed=/RefData/RepeatMasker/hg38.ucsc.180414.rmsk.sorted.bed</span><br><span class="line"> # fasta reference. should be indexed (run 'samtools faidx file.fa')</span><br><span class="line">refFasta = /RefData/Homo_sapiens/GRCh38_no_alt/genome.fa</span><br><span class="line">abparts = reference/hg38.abparts</span><br><span class="line">cnvs = reference/conrad_hg38.cnvs</span><br><span class="line">familyfile = reference/ensfams.txt</span><br><span class="line">falsepositives = reference/knownFP.txt </span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash">Scoring Parameters (feel free to tweak).</span></span><br><span class="line">splitscoremod = 10</span><br><span class="line">spanscoremod = 20</span><br><span class="line">skewpenalty = 4</span><br><span class="line">repeatpenalty = 0.5 # score = score*(repeatpenalty^repeats) --> a fusion can have 0,1,or 2 sites fall into repeat regions.</span><br></pre></td></tr></table></figure><p>Based on the output of previous <code>STAR</code> running for <code>starchip-circles</code>, the script to run <code>starchip-fusions</code> contains:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">!/bin/bash</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -V</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -j eo</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -N STARChip-fusions</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -q Blade</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -l nodes=1:ppn=2</span></span><br><span class="line"></span><br><span class="line">echo Start time is `date +%Y/%m/%d--%H:%M`</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> work dir</span></span><br><span class="line">WORKDIR=/STARChip</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> starchip-fusions</span></span><br><span class="line">TOOLDIR=/software</span><br><span class="line">path2fusions=$TOOLDIR/starchip/starchip-fusions.pl</span><br><span class="line"></span><br><span class="line">for sample in for sample in XJ-3-1-25 XJ-2-1-25 XJ-7-1-25 XJ-4-1-25 XJ-5-1-25 XJ-6-1-25 XJ-1-1-25 XJ-10-1-25 XJ-9-1-25 XJ-13-1-25 XJ-11-1-25 XJ-12-1-25 XJ-8-1-25 8_4-3 8_4-4 8_4-5 8_4-6 8_4-7 8_4-8 8_4-9 8_4-10 8_4-11</span><br><span class="line">do</span><br><span class="line">/usr/bin/perl $path2fusions $sample $WORKDIR/STARChip-circRNA/STARout/${sample}_R1.fastq.gz/Chimeric.out.junction starchip-fusions.param</span><br><span class="line">done</span><br><span class="line"></span><br><span class="line">echo Finish time is `date +%Y/%m/%d--%H:%M`</span><br></pre></td></tr></table></figure><p>In my samples, no fusions were found by <code>STARChip-fusions</code>, and I don’t want to tweak parameters to improve sensitivity.</p><h2 id="change-notes"><a class="markdownIt-Anchor" href="#change-notes"></a> Change notes</h2><ul><li>20180413: create the note.</li></ul><hr class="footnotes-sep"><section class="footnotes"><ol class="footnotes-list"><li id="fn1" class="footnote-item"><p>Gao Y, Zhao F. 2018 Jan 12. Computational Strategies for Exploring Circular RNAs. Trends in Genetics. doi:10.1016/j.tig.2017.12.016. [accessed 2018 Jan 15]. <a href="https://www.sciencedirect.com/science/article/pii/S0168952517302366" target="_blank" rel="noopener">https://www.sciencedirect.com/science/article/pii/S0168952517302366</a>. <a href="#fnref1" class="footnote-backref">↩︎</a></p></li><li id="fn2" class="footnote-item"><p>Szabo L, Salzman J. 2016. Detecting circular RNAs: bioinformatic and experimental challenges. Nat Rev Genet 17:679–692. doi:10.1038/nrg.2016.114. <a href="#fnref2" class="footnote-backref">↩︎</a></p></li><li id="fn3" class="footnote-item"><p>Hansen TB, Ven? MT, Damgaard CK, Kjems J. 2016. Comparison of circular RNA prediction tools. Nucleic Acids Research 44:e58–e58. doi:10.1093/nar/gkv1458. <a href="#fnref3" class="footnote-backref">↩︎</a></p></li><li id="fn4" class="footnote-item"><p>Zeng X, Lin W, Guo M, Zou Q. 2017. A comprehensive overview and evaluation of circular RNA detection tools. PLOS Computational Biology 13:e1005420. doi:10.1371/journal.pcbi.1005420. <a href="#fnref4" class="footnote-backref">↩︎</a></p></li><li id="fn5" class="footnote-item"><p>Zhang X-O, Dong R, Zhang Y, Zhang J-L, Luo Z, Zhang J, Chen L-L, Yang L. 2016. Diverse alternative back-splicing and alternative splicing landscape of circular RNAs. Genome Res. 26:1277–1287. doi:10.1101/gr.202895.115. <a href="#fnref5" class="footnote-backref">↩︎</a></p></li><li id="fn6" class="footnote-item"><p>Gao Y, Wang J, Zhao F. 2015. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biology 16:4. doi:10.1186/s13059-014-0571-3. <a href="#fnref6" class="footnote-backref">↩︎</a></p></li><li id="fn7" class="footnote-item"><p>Akers NK, Schadt EE, Losic B. 2018 Feb 20. STAR Chimeric Post for rapid detection of circular RNA and fusion transcripts. Bioinformatics:bty091–bty091. doi:10.1093/bioinformatics/bty091. <a href="#fnref7" class="footnote-backref">↩︎</a></p></li></ol></section>]]></content>
<summary type="html">
<p>I want to find out some <a href="https://en.wikipedia.org/wiki/Circular_RNA" target="_blank" rel="noopener">circRNAs</a> from RNA-seq dat
</summary>
<category term="RNA-seq" scheme="https://yiweiniu.github.io/blog/categories/RNA-seq/"/>
<category term="circRNA" scheme="https://yiweiniu.github.io/blog/categories/RNA-seq/circRNA/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="RNA-seq" scheme="https://yiweiniu.github.io/blog/tags/RNA-seq/"/>
<category term="circRNA" scheme="https://yiweiniu.github.io/blog/tags/circRNA/"/>
</entry>
<entry>
<title>How to Install Perl Modules</title>
<link href="https://yiweiniu.github.io/blog/2018/04/How-to-install-perl-modules/"/>
<id>https://yiweiniu.github.io/blog/2018/04/How-to-install-perl-modules/</id>
<published>2018-04-13T15:28:04.000Z</published>
<updated>2019-04-02T07:13:05.000Z</updated>
<content type="html"><![CDATA[<p>Installing <code>perl</code> modules can be troublesome, especially when you’re not a ROOT user. After a lot of “pain”, I decide to document the two ways to install <code>perl</code> modules (it’s not my creation, just for a memo).</p><h2 id="check-installed"><a class="markdownIt-Anchor" href="#check-installed"></a> Check installed</h2><p>First, check if the module has been installed:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> system perl</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> <span class="built_in">which</span> perl</span></span><br><span class="line">/usr/bin/perl</span><br><span class="line"><span class="meta">$</span><span class="bash"> perl -e <span class="string">'use DBD::Oracle; print $DBD::Oracle::VERSION;'</span></span></span><br><span class="line">Can't locate DBD/Oracle.pm in @INC (@INC contains: /home/niuyw/software/perl.5.24.0/lib/site_perl/5.24.0/x86_64-linux /home/software/lib64/perl5 /home/software/share/perl5/ /home/software/vcftools-0.1.15/src/perl /home/niuyw/bin/perl_lib/share/perl5 /home/software/lib64/perl5 /home/software/share/perl5/ /home/software/vcftools-0.1.15/src/perl /home/niuyw/bin/perl_lib/share/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at -e line 1.</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> perl installed under my own dirctory</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> ~/software/perl.5.24.0/bin/perl -e <span class="string">'use DBD::Oracle; print $DBD::Oracle::VERSION;'</span></span></span><br><span class="line">Can't locate DBD/Oracle.pm in @INC (you may need to install the DBD::Oracle module) (@INC contains: /home/niuyw/software/perl.5.24.0/lib/site_perl/5.24.0/x86_64-linux /home/software/lib64/perl5 /home/software/share/perl5/ /home/software/vcftools-0.1.15/src/perl /home/niuyw/bin/perl_lib/share/perl5 /home/software/lib64/perl5 /home/software/share/perl5/ /home/software/vcftools-0.1.15/src/perl /home/niuyw/bin/perl_lib/share/perl5 /home/niuyw/software/perl.5.24.0/lib/site_perl/5.24.0/x86_64-linux /home/niuyw/software/perl.5.24.0/lib/site_perl/5.24.0 /home/niuyw/software/perl.5.24.0/lib/5.24.0/x86_64-linux /home/niuyw/software/perl.5.24.0/lib/5.24.0 .) at -e line 1.</span><br><span class="line">BEGIN failed--compilation aborted at -e line 1.</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> <span class="keyword">if</span> the module has been installed</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> ~/software/perl.5.24.0/bin/perl -e <span class="string">'use URI::Escape; print $URI::Escape::VERSION;\n'</span></span></span><br><span class="line">3.31</span><br></pre></td></tr></table></figure><h2 id="scenario-1-youre-a-root-user-or-use-your-own-perl"><a class="markdownIt-Anchor" href="#scenario-1-youre-a-root-user-or-use-your-own-perl"></a> Scenario 1: you’re a ROOT user OR use your own perl</h2><p>In this case, the installation is simple.</p><h3 id="use-cpan-i-module_name"><a class="markdownIt-Anchor" href="#use-cpan-i-module_name"></a> Use <code>cpan -i module_name</code></h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> ~/software/perl.5.24.0/bin/cpan -i Net::Server</span></span><br></pre></td></tr></table></figure><h3 id="use-perl-mcpan-e-shell"><a class="markdownIt-Anchor" href="#use-perl-mcpan-e-shell"></a> Use <code>perl -MCPAN -e shell</code></h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> ~/software/perl.5.24.0/bin/perl -MCPAN -e shell</span></span><br><span class="line">Terminal does not support AddHistory.</span><br><span class="line"></span><br><span class="line">cpan shell -- CPAN exploration and modules installation (v2.16)</span><br><span class="line">Enter 'h' for help.</span><br><span class="line"></span><br><span class="line"><span class="meta">cpan[1]></span><span class="bash"> h</span></span><br><span class="line"></span><br><span class="line">Display Information (ver 2.16)</span><br><span class="line"> command argument description</span><br><span class="line"> a,b,d,m WORD or /REGEXP/ about authors, bundles, distributions, modules</span><br><span class="line"> i WORD or /REGEXP/ about any of the above</span><br><span class="line"> ls AUTHOR or GLOB about files in the author's directory</span><br><span class="line"> (with WORD being a module, bundle or author name or a distribution</span><br><span class="line"> name of the form AUTHOR/DISTRIBUTION)</span><br><span class="line"></span><br><span class="line">Download, Test, Make, Install...</span><br><span class="line"> get download clean make clean</span><br><span class="line"> make make (implies get) look open subshell in dist directory</span><br><span class="line"> test make test (implies make) readme display these README files</span><br><span class="line"> install make install (implies test) perldoc display POD documentation</span><br><span class="line"></span><br><span class="line">Upgrade installed modules</span><br><span class="line"> r WORDs or /REGEXP/ or NONE report updates for some/matching/all</span><br><span class="line"> upgrade WORDs or /REGEXP/ or NONE upgrade some/matching/all modules</span><br><span class="line"></span><br><span class="line">Pragmas</span><br><span class="line"> force CMD try hard to do command fforce CMD try harder</span><br><span class="line"> notest CMD skip testing</span><br><span class="line"></span><br><span class="line">Other</span><br><span class="line"> h,? display this menu ! perl-code eval a perl command</span><br><span class="line"> o conf [opt] set and query options q quit the cpan shell</span><br><span class="line"> reload cpan load CPAN.pm again reload index load newer indices</span><br><span class="line"> autobundle Snapshot recent latest CPAN uploads</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> search modules using keyword</span></span><br><span class="line"><span class="meta">cpan[2]></span><span class="bash"> i /scws/</span></span><br><span class="line">Distribution XUERON/Text-Scws-0.01.tar.gz</span><br><span class="line">Module < Text::Scws (XUERON/Text-Scws-0.01.tar.gz)</span><br><span class="line">2 items found</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> install modules</span></span><br><span class="line"><span class="meta">cpan[3]></span><span class="bash"> install Net::Server</span></span><br><span class="line">Net::Server is up to date (2.009).</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> quit</span></span><br><span class="line"><span class="meta">cpan[5]></span><span class="bash"> q/quit/<span class="built_in">exit</span></span></span><br></pre></td></tr></table></figure><h2 id="scenario-2-youre-a-common-user-but-want-to-use-the-system-perl"><a class="markdownIt-Anchor" href="#scenario-2-youre-a-common-user-but-want-to-use-the-system-perl"></a> Scenario 2: you’re a common user but want to use the system perl</h2><h3 id="install-packages-from-source"><a class="markdownIt-Anchor" href="#install-packages-from-source"></a> Install packages from source</h3><p>This process can be tedious, especially if the modules you want to install depend on other modules.</p><p>Sometimes, even we have installed own <code>perl</code>, but still need to use the <code>perl</code> under <code>/usr/bin/perl</code>.</p><p>First, create a directory for <code>perl</code> modules. This directory would be used to store perl modules.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mkdir -p /home/niuyw/bin/perl_lib</span><br></pre></td></tr></table></figure><p>Second, download a module and install it from source code locally. Go <a href="https://www.cpan.org/" target="_blank" rel="noopener">CPAN</a> to search and download the modules you want to install.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> tar zxf Capture-Tiny-0.46.tar.gz && <span class="built_in">cd</span> Capture-Tiny-0.46</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> <span class="built_in">which</span> perl</span></span><br><span class="line">/usr/bin/perl</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> specify the path to install</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> perl Makefile.PL PREFIX=/home/niuyw/bin/perl_lib</span></span><br><span class="line">Checking if your kit is complete...</span><br><span class="line">Looks good</span><br><span class="line">Generating a Unix-style Makefile</span><br><span class="line">Writing Makefile for Capture::Tiny</span><br><span class="line">Writing MYMETA.yml and MYMETA.json</span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> make && make install</span></span><br><span class="line">cp lib/Capture/Tiny.pm blib/lib/Capture/Tiny.pm</span><br><span class="line">Manifying 1 pod document</span><br><span class="line">Manifying 1 pod document</span><br><span class="line">Appending installation info to /home/niuyw/bin/perl_lib/lib64/perl5/perllocal.pod</span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> perl -e <span class="string">'use Capture::Tiny; print $Capture::Tiny::VERSION;'</span></span></span><br><span class="line">0.46</span><br></pre></td></tr></table></figure><p>Third, add the path above to your <code>.bashrc</code>. <strong>Notice the format</strong>.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">export PERL5LIB=$PERL5LIB:/home/niuyw/bin/perl_lib/share/perl5:/home/niuyw/bin/perl_lib/lib64/perl5</span><br></pre></td></tr></table></figure><h3 id="use-cpanm"><a class="markdownIt-Anchor" href="#use-cpanm"></a> Use cpanm</h3><p><a href="https://metacpan.org/pod/distribution/App-cpanminus/bin/cpanm" target="_blank" rel="noopener">cpanm</a> is a perl module to help users to get, unpack build and install modules from CPAN.</p><p>First, install <code>cpanm</code></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> download</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> wget https://cpan.metacpan.org/authors/id/M/MI/MIYAGAWA/App-cpanminus-1.7044.tar.gz</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> install</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> tar zxf App-cpanminus-1.7044.tar.gz</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> <span class="built_in">cd</span> App-cpanminus-1.7044</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> perl Makefile.PL PREFIX=/home/niuyw/bin/perl_lib</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> make && make install</span></span><br></pre></td></tr></table></figure><p>Help information.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> ~/bin/perl_lib/bin/cpanm -h</span></span><br><span class="line">Usage: cpanm [options] Module [...]</span><br><span class="line"></span><br><span class="line">Options:</span><br><span class="line"> -v,--verbose Turns on chatty output</span><br><span class="line"> -q,--quiet Turns off the most output</span><br><span class="line"> --interactive Turns on interactive configure (required for Task:: modules)</span><br><span class="line"> -f,--force force install</span><br><span class="line"> -n,--notest Do not run unit tests</span><br><span class="line"> --test-only Run tests only, do not install</span><br><span class="line"> -S,--sudo sudo to run install commands</span><br><span class="line"> --installdeps Only install dependencies</span><br><span class="line"> --showdeps Only display direct dependencies</span><br><span class="line"> --reinstall Reinstall the distribution even if you already have the latest version installed</span><br><span class="line"> --mirror Specify the base URL for the mirror (e.g. http://cpan.cpantesters.org/)</span><br><span class="line"> --mirror-only Use the mirror's index file instead of the CPAN Meta DB</span><br><span class="line"> -M,--from Use only this mirror base URL and its index file</span><br><span class="line"> --prompt Prompt when configure/build/test fails</span><br><span class="line"> -l,--local-lib Specify the install base to install modules</span><br><span class="line"> -L,--local-lib-contained Specify the install base to install all non-core modules</span><br><span class="line"> --self-contained Install all non-core modules, even if they're already installed.</span><br><span class="line"> --auto-cleanup Number of days that cpanm's work directories expire in. Defaults to 7</span><br><span class="line"></span><br><span class="line">Commands:</span><br><span class="line"> --self-upgrade upgrades itself</span><br><span class="line"> --info Displays distribution info on CPAN</span><br><span class="line"> --look Opens the distribution with your SHELL</span><br><span class="line"> -U,--uninstall Uninstalls the modules (EXPERIMENTAL)</span><br><span class="line"> -V,--version Displays software version</span><br><span class="line"></span><br><span class="line">Examples:</span><br><span class="line"></span><br><span class="line"> cpanm Test::More # install Test::More</span><br><span class="line"> cpanm MIYAGAWA/Plack-0.99_05.tar.gz # full distribution path</span><br><span class="line"> cpanm http://example.org/LDS/CGI.pm-3.20.tar.gz # install from URL</span><br><span class="line"> cpanm ~/dists/MyCompany-Enterprise-1.00.tar.gz # install from a local file</span><br><span class="line"> cpanm --interactive Task::Kensho # Configure interactively</span><br><span class="line"> cpanm . # install from local directory</span><br><span class="line"> cpanm --installdeps . # install all the deps for the current directory</span><br><span class="line"> cpanm -L extlib Plack # install Plack and all non-core deps into extlib</span><br><span class="line"> cpanm --mirror http://cpan.cpantesters.org/ DBI # use the fast-syncing mirror</span><br><span class="line"> cpanm -M https://cpan.metacpan.org App::perlbrew # use only this secure mirror and its index</span><br><span class="line"></span><br><span class="line">You can also specify the default options in PERL_CPANM_OPT environment variable in the shell rc:</span><br><span class="line"></span><br><span class="line"> export PERL_CPANM_OPT="--prompt --reinstall -l ~/perl --mirror http://cpan.cpantesters.org"</span><br><span class="line"></span><br><span class="line">Type `man cpanm` or `perldoc cpanm` for the more detailed explanation of the options.</span><br></pre></td></tr></table></figure><p>Install packages.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">~/bin/perl_lib/bin/cpanm -l /home/niuyw/bin/perl_lib Log::Log4perl Math::CDF</span><br></pre></td></tr></table></figure><h2 id="footnote"><a class="markdownIt-Anchor" href="#footnote"></a> Footnote</h2><p>Thanks Quan Kang for teaching me how to install perl modules from source code.</p><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180413: create the note</li><li>20180414: change the setting of <code>PERL5LIB</code></li><li>20190402: add the section “use cpanm”</li></ul>]]></content>
<summary type="html">
<p>Installing <code>perl</code> modules can be troublesome, especially when you’re not a ROOT user. After a lot of “pain”, I decide to docum
</summary>
<category term="mixture" scheme="https://yiweiniu.github.io/blog/categories/mixture/"/>
<category term="perl" scheme="https://yiweiniu.github.io/blog/tags/perl/"/>
<category term="install" scheme="https://yiweiniu.github.io/blog/tags/install/"/>
</entry>
<entry>
<title>Genome Assembly Pipeline: Flye</title>
<link href="https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-Flye/"/>
<id>https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-Flye/</id>
<published>2018-03-29T16:33:08.000Z</published>
<updated>2018-07-02T01:34:06.000Z</updated>
<content type="html"><![CDATA[<h2 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h2><p>From <a href="https://github.com/fenderglass/Flye" target="_blank" rel="noopener">its git repo</a>:</p><blockquote><p>Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and Oxford Nanopore Technologies. The algorithm uses an A-Bruijn graph to find the overlaps between reads and does not require them to be error-corrected. After the initial assembly, Flye performs an extra repeat classification and analysis step to improve the structural accuracy of the resulting sequence. The package also includes a polisher module, which produces the final assembly of high nucleotide-level quality.</p></blockquote><p>This tool is now on biRxiv:</p><blockquote><p>Kolmogorov M, Yuan J, Lin Y, Pevzner P. Assembly of Long Error-Prone Reads Using Repeat Graphs. bioRxiv. 2018 Jan 12:247148. doi:10.1101/247148</p></blockquote><p>My feelings:</p><ul><li>easy to use</li><li>comparatively good results, good N50, good completeness</li><li>not too many parameters to be tested</li></ul><h2 id="general-usage"><a class="markdownIt-Anchor" href="#general-usage"></a> General usage</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br></pre></td><td class="code"><pre><span class="line"># install</span><br><span class="line">git clone https://github.com/fenderglass/Flye Flye-2.3.3</span><br><span class="line">cd Flye-2.3.3</span><br><span class="line">python setup.py build</span><br><span class="line"></span><br><span class="line"># usage</span><br><span class="line">$ ./bin/flye -h</span><br><span class="line">usage: flye (--pacbio-raw | --pacbio-corr | --nano-raw |</span><br><span class="line"> --nano-corr | --subassemblies) file1 [file_2 ...]</span><br><span class="line"> --genome-size size --out-dir dir_path [--threads int]</span><br><span class="line"> [--iterations int] [--min-overlap int] [--resume]</span><br><span class="line"> [--debug] [--version] [--help]</span><br><span class="line"></span><br><span class="line">Assembly of long and error-prone reads</span><br><span class="line"></span><br><span class="line">optional arguments:</span><br><span class="line"> -h, --help show this help message and exit</span><br><span class="line"> --pacbio-raw path [path ...]</span><br><span class="line"> PacBio raw reads</span><br><span class="line"> --pacbio-corr path [path ...]</span><br><span class="line"> PacBio corrected reads</span><br><span class="line"> --nano-raw path [path ...]</span><br><span class="line"> ONT raw reads</span><br><span class="line"> --nano-corr path [path ...]</span><br><span class="line"> ONT corrected reads</span><br><span class="line"> --subassemblies path [path ...]</span><br><span class="line"> high-quality contig-like input</span><br><span class="line"> -g size, --genome-size size</span><br><span class="line"> estimated genome size (for example, 5m or 2.6g)</span><br><span class="line"> -o path, --out-dir path</span><br><span class="line"> Output directory</span><br><span class="line"> -t int, --threads int</span><br><span class="line"> number of parallel threads (default: 1)</span><br><span class="line"> -i int, --iterations int</span><br><span class="line"> number of polishing iterations (default: 1)</span><br><span class="line"> -m int, --min-overlap int</span><br><span class="line"> minimum overlap between reads (default: auto)</span><br><span class="line"> --resume resume from the last completed stage</span><br><span class="line"> --resume-from stage_name</span><br><span class="line"> resume from a custom stage</span><br><span class="line"> --debug enable debug output</span><br><span class="line"> -v, --version show program's version number and exit</span><br><span class="line"></span><br><span class="line">Input reads could be in FASTA or FASTQ format, uncompressed</span><br><span class="line">or compressed with gz. Currenlty, raw and corrected reads</span><br><span class="line">from PacBio and ONT are supported. Additionally, --subassemblies</span><br><span class="line">option does a consensus assembly of high-quality input contigs.</span><br><span class="line">You may specify multiple fles with reads (separated by spaces).</span><br><span class="line">Mixing different read types is not yet supported.</span><br><span class="line"></span><br><span class="line">You must provide an estimate of the genome size as input,</span><br><span class="line">which is used for solid k-mers selection. The estimate could</span><br><span class="line">be rough (e.g. withing 0.5x-2x range) and does not affect</span><br><span class="line">the other assembly stages. Standard size modificators are</span><br><span class="line">supported (e.g. 5m or 2.6g)</span><br><span class="line"></span><br><span class="line"># E. coli P6-C4 PacBio data</span><br><span class="line">wget https://zenodo.org/record/1172816/files/E.coli_PacBio_40x.fasta</span><br><span class="line">flye --pacbio-raw E.coli_PacBio_40x.fasta --out-dir out_pacbio --genome-size 5m --threads 4</span><br><span class="line"></span><br><span class="line"># E. coli Oxford Nanopore Technologies data</span><br><span class="line">wget https://zenodo.org/record/1172816/files/Loman_E.coli_MAP006-1_2D_50x.fasta</span><br><span class="line">flye --nano-raw Loman_E.coli_MAP006-1_2D_50x.fasta --out-dir out_nano --genome-size 5m --threads 4</span><br></pre></td></tr></table></figure><p>See <a href="https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md" target="_blank" rel="noopener">Flye manual</a> for full usage.</p><h2 id="in-practice"><a class="markdownIt-Anchor" href="#in-practice"></a> In practice</h2><h3 id="an-insect"><a class="markdownIt-Anchor" href="#an-insect"></a> An insect</h3><ul><li>The species: high heterogeneity, high AT, high repetition.</li><li>Genome size: male 790M, female 830M.</li><li>Data used:about 70X PacBio long-reads.</li><li>OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).</li></ul><h4 id="version-232-gd46edb7"><a class="markdownIt-Anchor" href="#version-232-gd46edb7"></a> version 2.3.2-gd46edb7</h4><ul><li><code>Flye</code> version: 2.3.2-gd46edb7</li></ul><p>I didn’t test all the parameters. Below is the results based on default settings.</p><p>command:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">flye --pacbio-raw $DATADIR/third/third_all.fasta --out-dir run1 --genome-size 850m --threads 24</span><br></pre></td></tr></table></figure><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"># contig</span><br><span class="line">Size_includeN 724744485</span><br><span class="line">Size_withoutN 724744485</span><br><span class="line">Seq_Num 17602</span><br><span class="line">Mean_Size 41173</span><br><span class="line">Median_Size 17572</span><br><span class="line">Longest_Seq 1648999</span><br><span class="line">Shortest_Seq 55</span><br><span class="line">GC_Content 31.55</span><br><span class="line">N50 91066</span><br><span class="line">N90 17456</span><br><span class="line">Gap 0.0</span><br><span class="line"></span><br><span class="line"># scaffold</span><br><span class="line">Size_includeN 724753785</span><br><span class="line">Size_withoutN 724744485</span><br><span class="line">Seq_Num 17509</span><br><span class="line">Mean_Size 41393</span><br><span class="line">Median_Size 17532</span><br><span class="line">Longest_Seq 1648999</span><br><span class="line">Shortest_Seq 55</span><br><span class="line">GC_Content 31.55</span><br><span class="line">N50 92367</span><br><span class="line">N90 17530</span><br><span class="line">Gap 0.0</span><br></pre></td></tr></table></figure><h4 id="version-233-g47cdd0b"><a class="markdownIt-Anchor" href="#version-233-g47cdd0b"></a> Version 2.3.3-g47cdd0b</h4><p><code>Flye</code> 2.3.3 have two updates appealing to me:</p><ul><li>Automatic selection of minimum overlap parameter based on read length</li><li>Minimap2 updated</li></ul><p>Because I’ve run <code>Canu</code> before, and <code>Flye</code> can start from raw data and corrected data, I’ll test <code>Flye</code> for both.</p><h5 id="from-raw-data"><a class="markdownIt-Anchor" href="#from-raw-data"></a> From raw data</h5><p>Commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/Flye-2.3.3/bin/flye --pacbio-raw third_all.fasta --out-dir run2 --genome-size 830m --threads 40</span></span><br></pre></td></tr></table></figure><p>Stats:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> contigs</span></span><br><span class="line">Size_includeN 787629166</span><br><span class="line">Size_withoutN 787629166</span><br><span class="line">Seq_Num 14846</span><br><span class="line">Mean_Size 53053</span><br><span class="line">Median_Size 20542</span><br><span class="line">Longest_Seq 1636300</span><br><span class="line">Shortest_Seq 12</span><br><span class="line">GC_Content 31.6</span><br><span class="line">N50 121564</span><br><span class="line">L50 1699</span><br><span class="line">N90 25419</span><br><span class="line">Gap 0.0</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> scaffolds</span></span><br><span class="line">Size_includeN 787632766</span><br><span class="line">Size_withoutN 787629166</span><br><span class="line">Seq_Num 14810</span><br><span class="line">Mean_Size 53182</span><br><span class="line">Median_Size 20465</span><br><span class="line">Longest_Seq 1692734</span><br><span class="line">Shortest_Seq 12</span><br><span class="line">GC_Content 31.6</span><br><span class="line">N50 122313</span><br><span class="line">L50 1680</span><br><span class="line">N90 25437</span><br><span class="line">Gap 0.0</span><br></pre></td></tr></table></figure><h5 id="from-corrected-data-from-canu-about-33x"><a class="markdownIt-Anchor" href="#from-corrected-data-from-canu-about-33x"></a> From corrected data from <code>Canu</code> (about 33X)</h5><p>Commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/Flye-2.3.3/bin/flye --pacbio-corr canu.correctedReads.fasta.gz --out-dir run3 --genome-size 830m --threads 40</span></span><br></pre></td></tr></table></figure><p>Stats:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> contigs</span></span><br><span class="line">Size_includeN 833065987</span><br><span class="line">Size_withoutN 833065987</span><br><span class="line">Seq_Num 17536</span><br><span class="line">Mean_Size 47506</span><br><span class="line">Median_Size 26593</span><br><span class="line">Longest_Seq 1145994</span><br><span class="line">Shortest_Seq 518</span><br><span class="line">GC_Content 31.47</span><br><span class="line">N50 88680</span><br><span class="line">L50 2594</span><br><span class="line">N90 22129</span><br><span class="line">Gap 0.0</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> scaffolds</span></span><br><span class="line">Size_includeN 833070387</span><br><span class="line">Size_withoutN 833065987</span><br><span class="line">Seq_Num 17492</span><br><span class="line">Mean_Size 47625</span><br><span class="line">Median_Size 26602</span><br><span class="line">Longest_Seq 1145994</span><br><span class="line">Shortest_Seq 518</span><br><span class="line">GC_Content 31.47</span><br><span class="line">N50 89165</span><br><span class="line">L50 2581</span><br><span class="line">N90 22242</span><br><span class="line">Gap 0.0</span><br></pre></td></tr></table></figure><h3 id="a-plant"><a class="markdownIt-Anchor" href="#a-plant"></a> A plant</h3><ul><li>The species: high heterogeneity, high repetition.</li><li>Genome size: 2.1G.</li><li>Data used:more than 100X PacBio long reads.</li><li>OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).</li></ul><h4 id="run1-more-than-100x-data"><a class="markdownIt-Anchor" href="#run1-more-than-100x-data"></a> run1, more than 100X data</h4><p>commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash">path2flye --pacbio-raw <span class="variable">$WORKDIR</span>/data/Pacbio/all.fasta --out-dir run1 --genome-size 2g --threads 30</span></span><br></pre></td></tr></table></figure><p>But I came across a memory issue: <a href="https://github.com/fenderglass/Flye/issues/46" target="_blank" rel="noopener">ERROR: Caught unhandled exception: std::bad_alloc in both 2.3.2 and 2.3.3</a>. And the author suggested me to downsample the data.</p><p>And I asked him that what’s the difference: using all raw data (say 100X) and using downsampling data (say longest 50X)? He said “You might have extra connectivity information in these 100x reads (you can resolve more repeats, for example). But some studies suggest (Canu paper, for example) that you don’t really need more than 40x in general (but it, of course, also depends on the genome complexity, ploidy etc…). Plus, extra coverage helps to get a good final consensus.”</p><h4 id="run2-with-about-50x-data"><a class="markdownIt-Anchor" href="#run2-with-about-50x-data"></a> run2, with about 50X data</h4><p>I used <a href="https://github.com/yechengxi/AssemblyUtility" target="_blank" rel="noopener">SelectLongestReads</a> to downsample about 50X data and ran <code>Flye</code> again.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run1.1, extract 50X data</span></span><br><span class="line">SelectLongestReads sum 100000000000 longest 1 o Pacbio_50x.fasta f all.fasta</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> remove @ from the fasta</span></span><br><span class="line">replace_@_in_fasta_header.py Pacbio_50x.fasta > Pacbio_50x_no_@.fasta</span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash">path2flye --pacbio-raw Pacbio_50x_no_@.fasta --out-dir run1.1 --genome-size 2g --threads 30 --resume</span></span><br></pre></td></tr></table></figure><p>The reason why I removed the <code>@</code> from the headers was because I encountered another problem: <a href="https://github.com/fenderglass/Flye/issues/48" target="_blank" rel="noopener">ERROR: parse error in 1-consensus/consensus.fasta on line 1: empty sequence</a>. It seemed that <code>Flye</code> would ignore these headers.</p><p>And the stats I got:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> contigs</span></span><br><span class="line">Size_includeN 1640872256</span><br><span class="line">Size_withoutN 1640872256</span><br><span class="line">Seq_Num 9843</span><br><span class="line">Mean_Size 166704</span><br><span class="line">Median_Size 72841</span><br><span class="line">Longest_Seq 8808184</span><br><span class="line">Shortest_Seq 139</span><br><span class="line">GC_Content 37.78</span><br><span class="line">N50 398108</span><br><span class="line">L50 1119</span><br><span class="line">N90 83615</span><br><span class="line">Gap 0.0</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> scaffolds</span></span><br><span class="line">Size_includeN 1640874656</span><br><span class="line">Size_withoutN 1640872256</span><br><span class="line">Seq_Num 9819</span><br><span class="line">Mean_Size 167112</span><br><span class="line">Median_Size 72812</span><br><span class="line">Longest_Seq 8808184</span><br><span class="line">Shortest_Seq 139</span><br><span class="line">GC_Content 37.78</span><br><span class="line">N50 399898</span><br><span class="line">L50 1114</span><br><span class="line">N90 83809</span><br><span class="line">Gap 0.0</span><br></pre></td></tr></table></figure><p>Not bad.</p><h4 id="run3-with-corrected-data-from-canu-about-37x"><a class="markdownIt-Anchor" href="#run3-with-corrected-data-from-canu-about-37x"></a> run3, with corrected data from <code>Canu</code> (about 37X)</h4><p>The <code>Canu</code> version was <code>1.7</code>.</p><p>commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run2, pacbio corrected by canu, defalut</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2flye --pacbio-corr canu.correctedReads.fasta.gz --out-dir run2 --genome-size 2g --threads 30</span></span><br></pre></td></tr></table></figure><p>stats:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> contigs</span></span><br><span class="line">Size_includeN 1886541389</span><br><span class="line">Size_withoutN 1886541389</span><br><span class="line">Seq_Num 13177</span><br><span class="line">Mean_Size 143169</span><br><span class="line">Median_Size 70083</span><br><span class="line">Longest_Seq 2690265</span><br><span class="line">Shortest_Seq 110</span><br><span class="line">GC_Content 38.01</span><br><span class="line">N50 310885</span><br><span class="line">L50 1669</span><br><span class="line">N90 70697</span><br><span class="line">Gap 0.0</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> scaffolds</span></span><br><span class="line">Size_includeN 1886554189</span><br><span class="line">Size_withoutN 1886541389</span><br><span class="line">Seq_Num 13049</span><br><span class="line">Mean_Size 144574</span><br><span class="line">Median_Size 70202</span><br><span class="line">Longest_Seq 2690265</span><br><span class="line">Shortest_Seq 110</span><br><span class="line">GC_Content 38.01</span><br><span class="line">N50 315687</span><br><span class="line">L50 1648</span><br><span class="line">N90 71378</span><br><span class="line">Gap 0.0</span><br></pre></td></tr></table></figure><h2 id="useful-links"><a class="markdownIt-Anchor" href="#useful-links"></a> Useful links</h2><ul><li><a href="https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md" target="_blank" rel="noopener">Flye manual</a></li></ul><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180314: create the note.</li><li>20180428: test version 2.3.3, and run from corrected reads of <code>Canu</code></li><li>20180630: add the part of ‘A plant’.</li></ul>]]></content>
<summary type="html">
<h2 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h2>
<p>From <a href="https://github.com/fenderglass/Flye" target="_bla
</summary>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/"/>
<category term="TGS pipeline" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/TGS-pipeline/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="genome assembly pipeline" scheme="https://yiweiniu.github.io/blog/tags/genome-assembly-pipeline/"/>
<category term="TGS genome assembly" scheme="https://yiweiniu.github.io/blog/tags/TGS-genome-assembly/"/>
</entry>
<entry>
<title>Genome Assembly Pipeline: miniasm & Racon</title>
<link href="https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-miniasm-Racon/"/>
<id>https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-miniasm-Racon/</id>
<published>2018-03-29T16:23:24.000Z</published>
<updated>2018-04-23T09:06:23.000Z</updated>
<content type="html"><![CDATA[<p><code>miniasm + Racon</code> is a long-read <em>de novo</em> genome assembly pipeline.</p><h2 id="miniasm-racon-assembly-pipeline"><a class="markdownIt-Anchor" href="#miniasm-racon-assembly-pipeline"></a> miniasm + Racon assembly pipeline</h2><p>There are two good examples:</p><ul><li><a href="http://inf-biox121.readthedocs.io/en/2017/Assembly/practicals/07_Assembly_using_minasm+racon.html" target="_blank" rel="noopener">Assembly using miniasm+racon</a></li><li><a href="http://onsnetwork.org/kubu4/2017/10/04/genome-assembly-minimapminiasmracon-overview/" target="_blank" rel="noopener">Genome Assembly – minimap/miniasm/racon Overview</a></li></ul><p>and a paper based on <code>miniasm</code>, actually, it is a consensus tool called <a href="https://github.com/isovic/racon" target="_blank" rel="noopener">Racon</a> <sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>.</p><p>The <code>miniasm + Racon</code> pipeline consists of the following steps:</p><ul><li>using <code>minimap</code>/<code>minimap2</code> <sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup> for fast all-vs-all overlap of raw reads (<code>Minimap</code> for overlap detection, Overlap)</li><li>using <code>miniasm</code>, this “simply concatenates pieces of read sequences to generate the final sequences. Thus the per-base error rate is similar to the raw input reads.” (<code>Miniasm</code> layout for generating raw contigs, Layout)</li><li>mapping the raw reads back to the assembly using <code>minimap</code> again (<code>Minimap</code> for mapping of raw reads to raw contigs, Consensus)</li><li>using <code>racon</code> (‘rapid consensus’) for consensus calling (<code>Racon</code> for generating high-quality consensus sequences, Consensus)</li></ul><p>Compared with general pipelines, it achieves ‘similar or better quliaty’ while ‘being an order of magnitude faster’.</p><p>As described in the <code>miniasm</code> paper <sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup>, published long-read assembly pipelines all include four stages:</p><ol><li>all-vs-all raw read mapping</li><li>raw read error correction</li><li>assembly of error corrected reads (may involve all-vs-all read mapping again, but as the error rate is much reduced at this step, it is easier and faster than stage 1)</li><li>contig consensus polish</li></ol><h2 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h2><h3 id="minimap"><a class="markdownIt-Anchor" href="#minimap"></a> minimap</h3><blockquote><p>Minimap is an experimental tool to efficiently find multiple approximate mapping positions between two sets of long sequences, such as between reads and reference genomes, between genomes and between long noisy reads. By default, it is tuned to have high sensitivity to 2kb matches around 20% divergence but with low specificity. Minimap does not generate alignments as of now and because of this, it is usually tens of times faster than mainstream aligners. With four CPU cores, minimap can map 1.6Gbp PacBio reads to human in 2.5 minutes, 1Gbp PacBio E. coli reads to pre-indexed 9.6Gbp bacterial genomes in 3 minutes, to pre-indexed >100Gbp nt database in ~1 hour (of which ~20 minutes are spent on loading index from the network filesystem; peak RAM: 10GB), map 2800 bacteria to themselves in 1 hour, and map 1Gbp E. coli reads against themselves in a couple of minutes.<br><br>Minimap does not replace mainstream aligners, but it can be useful when you want to quickly identify long approximate matches at moderate divergence among a huge collection of sequences. For this task, it is much faster than most existing tools.</p></blockquote><h3 id="minimap2"><a class="markdownIt-Anchor" href="#minimap2"></a> minimap2</h3><blockquote><p>Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.<br><br>For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 preprint.</p></blockquote><h3 id="miniasm"><a class="markdownIt-Anchor" href="#miniasm"></a> miniasm</h3><p><a href="https://github.com/lh3/miniasm" target="_blank" rel="noopener">miniasm</a> was developed by <a href="https://lh3.github.io/" target="_blank" rel="noopener">Heng Li</a>.</p><p>From its git repo:</p><blockquote><p>Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.<br><br>So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.<br><br>Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.</p></blockquote><h4 id="algorithm-overview"><a class="markdownIt-Anchor" href="#algorithm-overview"></a> Algorithm Overview</h4><blockquote><ul><li>Crude read selection. For each read, find the longest contiguous region covered by three good mappings. Get an approximate estimate of read coverage.</li><li>Fine read selection. Use the coverage information to find the good regions again but with more stringent thresholds. Discard contained reads.</li><li>Generate a string graph. Prune tips, drop weak overlaps and collapse short bubbles. These procedures are similar to those implemented in short-read assemblers.</li><li>Merge unambiguous overlaps to produce unitig sequences.</li></ul></blockquote><h4 id="limitations"><a class="markdownIt-Anchor" href="#limitations"></a> Limitations</h4><blockquote><ul><li>Consensus base quality is similar to input reads (may be fixed with a consensus tool).</li><li>Only tested on a dozen of high-coverage PacBio/ONT data sets (more testing needed).</li><li>Prone to collapse repeats or segmental duplications longer than input reads (hard to fix without error correction).</li></ul></blockquote><p>Since <code>miniasm</code> is not a stand-alone genome assembly tool, it depends on <a href="https://github.com/lh3/minimap" target="_blank" rel="noopener">minimap</a> or <a href="https://github.com/lh3/minimap2" target="_blank" rel="noopener">minimap2</a>. <code>minimap</code> had been archived by the author, and <code>minimap2</code> now is the successor. But <code>minimap</code> is also worth a try.</p><p>In this note I only used <code>minimap</code> or <code>minimap2</code> as a read overlapper for assembly. Go see the <a href="https://github.com/lh3/minimap2" target="_blank" rel="noopener">docs of minimap2</a> for full instructions.</p><h3 id="racon"><a class="markdownIt-Anchor" href="#racon"></a> Racon</h3><p><a href="https://github.com/isovic/racon" target="_blank" rel="noopener">Racon</a> is a consensus module for raw de novo DNA assembly of long uncorrected reads.</p><blockquote><p>Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. The goal of Racon is to generate genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods. It supports data produced by both Pacific Biosciences and Oxford Nanopore Technologies.<br><br>Racon can be used as a polishing tool after the assembly with either Illumina data or data produced by third generation of sequencing. The type of data inputed is automatically detected.<br><br>Racon takes as input only three files: contigs in FASTA/FASTQ format, reads in FASTA/FASTQ format and overlaps/alignments between the reads and the contigs in MHAP/PAF/SAM format. Output is a set of polished contigs in FASTA format printed to stdout. All input files can be compressed with gzip.<br><br>Racon can also be used as a read error-correction tool. In this scenario, the MHAP/PAF/SAM file needs to contain pairwise overlaps between reads with dual overlaps.<br><br>A wrapper script is also available to enable easier usage to the end-user for large datasets. It has the same interface as racon but adds two additional features from the outside. Sequences can be subsampled to decrease the total execution time (accuracy might be lower) while target sequences can be split into smaller chunks and run sequentially to decrease memory consumption. Both features can be run at the same time as well.</p></blockquote><p>My feelings (about <code>miniasm</code>):</p><ul><li>very fast</li><li>comparatively good results</li><li>The docs are not good enough</li><li>huge memory consumption (may not suitable for large genome)</li><li>bugs (at least for <code>miniasm</code>)</li></ul><h2 id="general-usage"><a class="markdownIt-Anchor" href="#general-usage"></a> General usage</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"># Install minimap and miniasm (requiring gcc and zlib)</span><br><span class="line">git clone https://github.com/lh3/minimap && (cd minimap && make)</span><br><span class="line">git clone https://github.com/lh3/minimap2 && (cd minimap2 && make)</span><br><span class="line">git clone https://github.com/lh3/miniasm && (cd miniasm && make)</span><br><span class="line"></span><br><span class="line"># Install Racon (requiring gcc 4.8+ or clang 3.4+, and cmake 3.2+)</span><br><span class="line">git clone --recursive https://github.com/isovic/racon.git racon</span><br><span class="line">cd racon</span><br><span class="line">mkdir build</span><br><span class="line">cd build</span><br><span class="line">cmake -DCMAKE_BUILD_TYPE=Release ..</span><br><span class="line">make</span><br><span class="line"></span><br><span class="line"># All-vs-all PacBio read Overlap with minimap</span><br><span class="line">minimap/minimap -Sw5 -L100 -m0 -t 8 reads.fq reads.fq | gzip -1 > reads.paf.gz</span><br><span class="line"># or minimap2</span><br><span class="line">minimap2/minimap2 -x ava-pb -t 8 reads.fq reads.fq | gzip -1 > reads.paf.gz</span><br><span class="line"></span><br><span class="line"># Layout</span><br><span class="line">miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa</span><br><span class="line"></span><br><span class="line"># Consensus</span><br><span class="line">## GFA to fasta</span><br><span class="line">awk '$1 ~/S/ {print ">"$2"\n"$3}' reads.gfa > reads.fasta</span><br><span class="line"></span><br><span class="line">## Correction 1</span><br><span class="line">minimap/minimap2 -t 8 reads.fasta reads.fq > reads.gfa1.paf</span><br><span class="line">racon -t 8 reads.fq reads.gfa1.paf reads.fasta reads.racon1.fasta</span><br><span class="line"></span><br><span class="line">## Correction 2 (optional)</span><br><span class="line">minimap/minimap2 -t 8 reads.racon1.fasta reads.fq > reads.gfa2.paf</span><br><span class="line">racon -t 8 reads.fq reads.gfa2.paf reads.racon1.fasta reads.racon2.fasta</span><br></pre></td></tr></table></figure><h2 id="in-practice"><a class="markdownIt-Anchor" href="#in-practice"></a> In practice</h2><ul><li><p>OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).</p></li><li><p><code>minimap2</code> version: 2.8-r672</p></li><li><p><code>miniasm</code> version: 0.2-r168-dirty</p></li><li><p><code>Racon</code> version: 0.5.0</p></li></ul><h3 id="run-1-with-about-50x-data"><a class="markdownIt-Anchor" href="#run-1-with-about-50x-data"></a> run 1, with about 50X data</h3><p>commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run1, 171224 third data</span></span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/minimap2-2.8_x64-linux/minimap2 -t <span class="variable">$PPN</span> -x ava-pb <span class="variable">$DATADIR</span>/171224.fasta <span class="variable">$DATADIR</span>/171224.fasta | gzip -1 > reads.paf.gz</span></span><br><span class="line">/home/zhangll/software/minimap/miniasm/miniasm -f $DATADIR/171224.fasta reads.paf.gz > reads.gfa</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> racon</span></span><br><span class="line">awk '$1 ~/S/ {print ">"$2"\n"$3}' reads.gfa > reads.fasta</span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/minimap2-2.8_x64-linux/minimap2 -t <span class="variable">$PPN</span> reads.fasta <span class="variable">$DATADIR</span>/171224.fasta | /home/zhangll/software/racon/bin/racon -t <span class="variable">$PPN</span> <span class="variable">$DATADIR</span>/171224.fastq - reads.gfa racon1.fasta</span></span><br></pre></td></tr></table></figure><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN1475310871</span><br><span class="line">Size_withoutN1475310871</span><br><span class="line">Seq_Num20186</span><br><span class="line">Mean_Size73085</span><br><span class="line">Median_Size52525</span><br><span class="line">Longest_Seq1183822</span><br><span class="line">Shortest_Seq689</span><br><span class="line">GC_Content31.66</span><br><span class="line">N5097955</span><br><span class="line">L504346</span><br><span class="line">N9033406</span><br><span class="line">Gap0.0</span><br></pre></td></tr></table></figure><h3 id="run2-with-about-70x-data"><a class="markdownIt-Anchor" href="#run2-with-about-70x-data"></a> run2, with about 70X data</h3><p>Didn’t run <code>Racon</code>.</p><p>commands:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> run2, all thid data</span></span><br><span class="line"><span class="meta">$</span><span class="bash">TOOLDIR/minimap2-2.8_x64-linux/minimap2 -t <span class="variable">$PPN</span> -x ava-pb <span class="variable">$DATADIR</span>/third_all.fasta <span class="variable">$DATADIR</span>/third_all.fasta | gzip -1 > reads.paf.gz</span></span><br><span class="line">/home/zhangll/software/minimap/miniasm/miniasm -f $DATADIR/third_all.fasta reads.paf.gz > reads.gfa</span><br></pre></td></tr></table></figure><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN1592482491</span><br><span class="line">Size_withoutN1592482491</span><br><span class="line">Seq_Num23120</span><br><span class="line">Mean_Size68879</span><br><span class="line">Median_Size49376</span><br><span class="line">Longest_Seq1002424</span><br><span class="line">Shortest_Seq689</span><br><span class="line">GC_Content33.1</span><br><span class="line">N5092250</span><br><span class="line">L505029</span><br><span class="line">N9031666</span><br><span class="line">Gap0.0</span><br></pre></td></tr></table></figure><p>The assembling size were larger than the estimated genome size (~850M) in both runs. But this pipeline is very fast.</p><hr class="footnotes-sep"><section class="footnotes"><ol class="footnotes-list"><li id="fn1" class="footnote-item"><p>Vaser R, Sovic I, Nagarajan N, Sikic M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research. 2017 Jan 18:gr.214270.116. doi:10.1101/gr.214270.116 <a href="#fnref1" class="footnote-backref">↩︎</a></p></li><li id="fn2" class="footnote-item"><p>Li H. Minimap2: versatile pairwise alignment for nucleotide sequences. arXiv:1708.01492 [q-bio]. 2017 Aug 4 [accessed 2018 Jan 10]. <a href="http://arxiv.org/abs/1708.01492" target="_blank" rel="noopener">http://arxiv.org/abs/1708.01492</a> <a href="#fnref2" class="footnote-backref">↩︎</a></p></li><li id="fn3" class="footnote-item"><p>Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32(14):2103–2110. doi:10.1093/bioinformatics/btw152 <a href="#fnref3" class="footnote-backref">↩︎</a></p></li></ol></section>]]></content>
<summary type="html">
<p><code>miniasm + Racon</code> is a long-read <em>de novo</em> genome assembly pipeline.</p>
<h2 id="miniasm-racon-assembly-pipeline"><a cl
</summary>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/"/>
<category term="TGS pipeline" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/TGS-pipeline/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="genome assembly pipeline" scheme="https://yiweiniu.github.io/blog/tags/genome-assembly-pipeline/"/>
<category term="TGS genome assembly" scheme="https://yiweiniu.github.io/blog/tags/TGS-genome-assembly/"/>
</entry>
<entry>
<title>Genome Assembly Pipeline: OPERA-LG</title>
<link href="https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-OPERA-LG/"/>
<id>https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-OPERA-LG/</id>
<published>2018-03-29T16:18:15.000Z</published>
<updated>2018-06-30T07:13:53.000Z</updated>
<content type="html"><![CDATA[<h1 id="genome-assembly-pipeline-opera-lg"><a class="markdownIt-Anchor" href="#genome-assembly-pipeline-opera-lg"></a> Genome assembly pipeline: OPERA-LG</h1><p>tags: bio-tools, genome assembly pipeline, hybrid genome assembly, scaffloding</p><p>category: genome assembly, hyrid pipeline</p><h2 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h2><p>From <a href="https://sourceforge.net/p/operasf/wiki/The%20OPERA%20wiki/" target="_blank" rel="noopener">The OPERA wiki </a></p><blockquote><p>OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program (<a href="http://en.wikipedia.org/wiki/Sequence_assembly" target="_blank" rel="noopener">http://en.wikipedia.org/wiki/Sequence_assembly</a>). It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011).</p></blockquote><blockquote><p>Note that since the original publication, we have made significant changes to OPERA (v1.0 onwards) including refinements to its basic algorithm (to reduce local errors, improve efficiency etc.) and incorporated features that are important for scaffolding large genomes (multi-library support, better repeat-handling etc.), in addition to other scalability and usability improvements (bam and gzip support, smaller memory footprint). We therefore encourage you to download and use our latest version: OPERA-LG. In our benchmarks, it has significantly improved corrected N50 and reduced the number of scaffolding errors. Furthermore, our latest release contains the wrapper script OPERA-long-read that enables scaffolding with long-reads from third-generation sequencing technologies (PacBio or Oxford Nanopore). The manuscript describing the new features and algorithms is available at Genome Biology. We look forward to getting your feedback to improve it further.</p></blockquote><p>Its paper</p><blockquote><p>Gao S, Bertrand D, Chia BKH, Nagarajan N. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology. 2016;17:102. doi:10.1186/s13059-016-0951-y</p></blockquote><p>My feelings:</p><ul><li>too many dependencies</li><li>not so easy to use</li><li>have bugs</li><li>support re-scaffolding</li><li>can’t use NGS reads and long-reads simultaneously.</li></ul><h2 id="in-practice"><a class="markdownIt-Anchor" href="#in-practice"></a> In practice</h2><p>See <a href="https://sourceforge.net/p/operasf/wiki/The%20OPERA%20wiki/" target="_blank" rel="noopener">The OPERA wiki</a> for full docs.</p><p>Scripts used:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">!/bin/sh</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -N OPERA-LG</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -j eo</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -q Test</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -l nodes=1:ppn=8</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -d /DenovoSeq/OPERA-LG</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -V</span></span><br><span class="line"></span><br><span class="line">echo Start time is `date +%Y/%m/%d--%H:%M`</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> scaffold with short reads</span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># preprocess reads</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="keyword">for</span> sample <span class="keyword">in</span> 270B 500B 800B 3k_1 5k-1 5k-2 10k; <span class="keyword">do</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash">perl /software/OPERA-LG_v2.0.6/bin/preprocess_reads.pl --contig /DenovoSeq/MEGAHIT/megahit_out.no270/final.contigs.fa --illumina-read1 /DenovoSeq/trimmomatic/<span class="variable">${sample}</span>_R_1P.fastq --illumina-read2 /DenovoSeq/trimmomatic/<span class="variable">${sample}</span>_R_2P.fastq --out <span class="variable">${sample}</span>.map --tool-dir /software/bwa-0.7.15 --samtools-dir /software/samtools-0.1.19</span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="keyword">done</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># with all libraries</span></span></span><br><span class="line">/software/OPERA-LG_v2.0.6/bin/OPERA-LG /DenovoSeq/MEGAHIT/megahit_out.no270/final.contigs.fa 270B.map,500B.map,800B.map,3k_1.map,5k-1.map,5k-2.map,10k.map ./opera /software/samtools-0.1.19</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># without 270 library</span></span></span><br><span class="line">/software/OPERA-LG_v2.0.6/bin/OPERA-LG /DenovoSeq/MEGAHIT/megahit_out.no270/final.contigs.fa 500B.map,800B.map,3k_1.map,5k-1.map,5k-2.map,10k.map ./opera.no270 /software/samtools-0.1.19</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> This is the first run of OPERA-LG, with 270 library, and megahit<span class="string">'s contigs</span></span></span><br><span class="line">perl /software/OPERA-LG_v2.0.6/bin/OPERA-long-read.pl --contig-file /DenovoSeq/MEGAHIT/megahit_out/final.contigs.fa --illumina-read1 10k_R1.fasta --illumina-read2 10k_R2.fasta --long-read-file av_20k.fasta --output-prefix 10k.lr --output-directory ./ --num-of-processors 40 --blasr /src/wgs-8.3rc2/Linux-amd64/bin --short-read-tooldir /software/bwa-0.7.15 --opera /software/OPERA-LG_v2.0.6/bin</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> This the second run of OPERA-LG, re-scaffold the results of SOAP-fusion. ins_270 library</span></span><br><span class="line">perl /software/OPERA-LG_v2.0.6/bin/OPERA-long-read.pl --contig-file /DenovoSeq/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq --illumina-read1 /DenovoSeq/trimmomatic/270B_R_1P.fasta --illumina-read2 /DenovoSeq/trimmomatic/270B_R_2P.fasta --long-read-file /DenovoSeq/Third_rawData/av_20k.fasta --output-prefix 270B.lr --output-directory ./270B --num-of-processors 10 --blasr /software/src/wgs-8.3rc2/Linux-amd64/bin --short-read-tooldir /software/bwa-0.7.15 --opera /software/OPERA-LG_v2.0.6/bin --samtools-dir /software/samtools-0.1.19/</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> This is the third run of OPERA-LG, re-scaffold the results of SOAP-fusion. ins_500 library</span></span><br><span class="line">perl /software/OPERA-LG_v2.0.6/bin/OPERA-long-read.pl --contig-file /DenovoSeq/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq --illumina-read1 /DenovoSeq/trimmomatic/500B_R_1P.fasta --illumina-read2 /DenovoSeq/trimmomatic/500B_R_2P.fasta --long-read-file /DenovoSeq/Third_rawData/av_20k.fasta --output-prefix 500B.lr --output-directory ./500B --num-of-processors 10 --blasr /software/src/wgs-8.3rc2/Linux-amd64/bin --short-read-tooldir /software/bwa-0.7.15 --opera /software/OPERA-LG_v2.0.6/bin --samtools-dir /software/samtools-0.1.19/</span><br><span class="line"></span><br><span class="line">echo Finish time is `date +%Y/%m/%d--%H:%M`</span><br></pre></td></tr></table></figure><p>The stats I got:</p><h3 id="opera"><a class="markdownIt-Anchor" href="#opera"></a> OPERA</h3><p>with all libraries</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 679262960</span><br><span class="line">Size_withoutN: 679262960</span><br><span class="line">Seq_Num: 782765</span><br><span class="line">Mean_Size: 867</span><br><span class="line">Median_Size: 429</span><br><span class="line">Longest_Seq: 61533</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 32.31</span><br><span class="line">N50: 1529</span><br><span class="line">N90: 348</span><br><span class="line">Gap: 0.0</span><br></pre></td></tr></table></figure><p>without ins_270 library:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 679262960</span><br><span class="line">Size_withoutN: 679262960</span><br><span class="line">Seq_Num: 782765</span><br><span class="line">Mean_Size: 867</span><br><span class="line">Median_Size: 429</span><br><span class="line">Longest_Seq: 61533</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 32.31</span><br><span class="line">N50: 1529</span><br><span class="line">N90: 348</span><br><span class="line">Gap: 0.0</span><br></pre></td></tr></table></figure><h3 id="opera-lg"><a class="markdownIt-Anchor" href="#opera-lg"></a> OPERA-LG</h3><p>First run with long-reads, with 270 library, and megahit’s contigs:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN767662393</span><br><span class="line">Size_withoutN767662393</span><br><span class="line">Seq_Num989298</span><br><span class="line">Mean_Size775</span><br><span class="line">Median_Size428</span><br><span class="line">Longest_Seq80889</span><br><span class="line">Shortest_Seq200</span><br><span class="line">GC_Content32.64</span><br><span class="line">N501115</span><br><span class="line">N90340</span><br><span class="line">Gap0.0</span><br></pre></td></tr></table></figure><p>Second run, re-scaffold the results of SOAP-fusion. ins_270 library</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 765246417</span><br><span class="line">Size_withoutN: 574675978</span><br><span class="line">Seq_Num: 377325</span><br><span class="line">Mean_Size: 2028</span><br><span class="line">Median_Size: 393</span><br><span class="line">Longest_Seq: 436254</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 31.42</span><br><span class="line">N50: 33478</span><br><span class="line">N90: 439</span><br><span class="line">Gap: 24.9</span><br></pre></td></tr></table></figure><p>Third run, re-scaffold the results of SOAP-fusion. ins_500 library</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 765246417</span><br><span class="line">Size_withoutN: 574675978</span><br><span class="line">Seq_Num: 377325</span><br><span class="line">Mean_Size: 2028</span><br><span class="line">Median_Size: 393</span><br><span class="line">Longest_Seq: 436254</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 31.42</span><br><span class="line">N50: 33478</span><br><span class="line">N90: 439</span><br><span class="line">Gap: 24.9</span><br></pre></td></tr></table></figure><p>What did this software do? The <code>scaffold N50</code> of SOAPdenovo-fusion is 33478 … What a waste of time!</p><p>This note can serve as a reference in case I will have to use it again…</p>]]></content>
<summary type="html">
<h1 id="genome-assembly-pipeline-opera-lg"><a class="markdownIt-Anchor" href="#genome-assembly-pipeline-opera-lg"></a> Genome assembly pipel
</summary>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/"/>
<category term="TGS pipeline" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/TGS-pipeline/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="genome assembly pipeline" scheme="https://yiweiniu.github.io/blog/tags/genome-assembly-pipeline/"/>
<category term="scaffolding" scheme="https://yiweiniu.github.io/blog/tags/scaffolding/"/>
<category term="hybrid genome assembly" scheme="https://yiweiniu.github.io/blog/tags/hybrid-genome-assembly/"/>
</entry>
<entry>
<title>Genome Assembly Pipeline: BESST</title>
<link href="https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-BESST/"/>
<id>https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-BESST/</id>
<published>2018-03-29T16:15:04.000Z</published>
<updated>2018-06-30T07:10:32.000Z</updated>
<content type="html"><![CDATA[<h2 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h2><p>From the introduction of <a href="https://github.com/ksahlin/BESST" target="_blank" rel="noopener">BESST git repo</a>:</p><blockquote><p>BESST is a package for scaffolding genomic assemblies.</p></blockquote><p>It paper</p><blockquote><p>Sahlin K, Chikhi R, Arvestad L. Assembly scaffolding with PE-contaminated mate-pair libraries. Bioinformatics. 2016;32(13):1925–1932. doi:10.1093/bioinformatics/btw064</p></blockquote><p>My feeling:</p><ul><li>too many steps</li><li>awkward</li><li>only support NGS reads</li><li>not so good results (at least in my case)</li></ul><h2 id="in-practice"><a class="markdownIt-Anchor" href="#in-practice"></a> In practice</h2><p><a href="https://github.com/ksahlin/BESST" target="_blank" rel="noopener">BESST git repo</a> has full docs.</p><p>Scripts used:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash">!/bin/sh</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -N BESST</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -j eo</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -q Test</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -l nodes=1:ppn=20</span></span><br><span class="line"><span class="meta">#</span><span class="bash">PBS -V</span></span><br><span class="line"></span><br><span class="line">echo Start time is `date +%Y/%m/%d--%H:%M`</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> align the PE/MP reads to contigs with BWA MEM</span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="keyword">for</span> sample <span class="keyword">in</span> 270B 500B 800B 5k-1 10k; <span class="keyword">do</span></span></span><br><span class="line"><span class="meta">#</span><span class="bash">/software/bwa-0.7.15/bwa mem -t 40 /DenovoSeq/MEGAHIT/megahit_out/final.contigs.fa /DenovoSeq/raw_data/<span class="variable">${sample}</span>_R1.fastq /DenovoSeq/raw_data/<span class="variable">${sample}</span>_R2.fastq | samtools view -uS - | samtools sort -@ 8 -m 4G - -T sam_sort_tmp -o ./bwaout/<span class="variable">${sample}</span>.sorted.bam</span></span><br><span class="line"><span class="meta">#</span><span class="bash">samtools index ./bwaout/<span class="variable">${sample}</span>.sorted.bam</span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="keyword">done</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> Damn, I forgot why I use repair.sh.</span></span><br><span class="line">for sample in 3k_1 5k-2; do</span><br><span class="line">/software/bbmap/repair.sh in1=/DenovoSeq/raw_data/${sample}_R1.fastq in2=/DenovoSeq/raw_data/${sample}_R2.fastq out1=/DenovoSeq/raw_data/${sample}_R1.fixed.fastq out2=/DenovoSeq/raw_data/${sample}_R2.fixed.fastq</span><br><span class="line">/software/bwa-0.7.15/bwa mem -t 40 /DenovoSeq/MEGAHIT/megahit_out/final.contigs.fa /DenovoSeq/raw_data/${sample}_R1.fixed.fastq /DenovoSeq/raw_data/${sample}_R2.fixed.fastq | samtools view -uS - | samtools sort -@ 8 -m 4G - -T sam_sort_tmp -o ./bwaout/${sample}.sorted.bam</span><br><span class="line">samtools index ./bwaout/${sample}.sorted.bam</span><br><span class="line">done</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> scaffold the contigs of MEGAHIT</span></span><br><span class="line">export PATH=/software/Python.2.7.13/bin:$PATH</span><br><span class="line"></span><br><span class="line">/software/BESST/runBESST -plots -q -c /DenovoSeq/MEGAHIT/megahit_out/final.contigs.fa -f ./bwaout/270B.sorted.bam ./bwaout/500B.sorted.bam ./bwaout/800B.sorted.bam ./bwaout/3k_1.sorted.bam ./bwaout/5k-1.sorted.bam ./bwaout/5k-2.sorted.bam ./bwaout/10k.sorted.bam -orientation fr fr fr rf rf rf rf</span><br><span class="line"></span><br><span class="line">/software/BESST/runBESST -plots -q -c /DenovoSeq/MEGAHIT/megahit_out.no270/final.contigs.fa -f ./bwaout/500B.sorted.bam ./bwaout/800B.sorted.bam ./bwaout/3k_1.sorted.bam ./bwaout/5k-1.sorted.bam ./bwaout/5k-2.sorted.bam ./bwaout/10k.sorted.bam -orientation fr fr rf rf rf rf -o ./no270</span><br><span class="line"></span><br><span class="line">echo Finish time is `date +%Y/%m/%d--%H:%M`</span><br></pre></td></tr></table></figure><p>Tested with or without ins_270 library.</p><p>And the stats I got:</p><p>with ins_270 library:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 778878802</span><br><span class="line">Size_withoutN: 755401923</span><br><span class="line">Seq_Num: 811939</span><br><span class="line">Mean_Size: 959</span><br><span class="line">Median_Size: 393</span><br><span class="line">Longest_Seq: 561369</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 32.65</span><br><span class="line">N50: 2715</span><br><span class="line">N90: 342</span><br><span class="line">Gap: 3.01</span><br></pre></td></tr></table></figure><p>without ins_270 library:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 654690437</span><br><span class="line">Size_withoutN: 626185453</span><br><span class="line">Seq_Num: 618555</span><br><span class="line">Mean_Size: 1058</span><br><span class="line">Median_Size: 438</span><br><span class="line">Longest_Seq: 327376</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 32.31</span><br><span class="line">N50: 2277</span><br><span class="line">N90: 365</span><br><span class="line">Gap: 4.35</span><br></pre></td></tr></table></figure><p>Though not been fully tested, using <code>BESST</code> got worse results than <code>SOAPdenovo</code>. Then I gave up this tool.</p><p>This note can serve as a reference in case I will have to use it again…</p>]]></content>
<summary type="html">
<h2 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h2>
<p>From the introduction of <a href="https://github.com/ksahlin/BE
</summary>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/"/>
<category term="NGS pipeline" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/NGS-pipeline/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="genome assembly pipeline" scheme="https://yiweiniu.github.io/blog/tags/genome-assembly-pipeline/"/>
<category term="NGS genome assembly" scheme="https://yiweiniu.github.io/blog/tags/NGS-genome-assembly/"/>
<category term="scaffolding" scheme="https://yiweiniu.github.io/blog/tags/scaffolding/"/>
</entry>
<entry>
<title>Genome Assembly Pipeline: LINKS</title>
<link href="https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-LINKS/"/>
<id>https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-LINKS/</id>
<published>2018-03-29T16:07:28.000Z</published>
<updated>2018-08-18T03:32:58.000Z</updated>
<content type="html"><![CDATA[<h2 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h2><p>From its <a href="http://www.bcgsc.ca/platform/bioinfo/software/links" target="_blank" rel="noopener">Git Repo</a></p><blockquote><p>LINKS is a scalable genomics application for scaffolding or re-scaffolding genome assembly drafts with long reads, such as those produced by Oxford Nanopore Technologies Ltd and Pacific Biosciences. It provides a generic alignment-free framework for scaffolding and can work on any sequences. It is versatile and supports not only long sequences as a source of long-range information, but also MPET pairs and linked-reads, such as those from the 10X Genomics GemCode and Chromium platform, via ARCS (<a href="http://www.bcgsc.ca/platform/bioinfo/software/arcs" target="_blank" rel="noopener">http://www.bcgsc.ca/platform/bioinfo/software/arcs</a>). Fill gaps in LINKS-derived scaffolds using Sealer (<a href="http://www.bcgsc.ca/platform/bioinfo/software/sealer" target="_blank" rel="noopener">http://www.bcgsc.ca/platform/bioinfo/software/sealer</a>).</p></blockquote><p>Its paper:</p><blockquote><p>Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, Birol I. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015;4:35. doi:10.1186/s13742-015-0076-3</p></blockquote><p>My feelings:</p><ul><li>easy to use</li><li>support scaffolding and re-scaffolding</li><li>fast</li><li>comparatively good results (at least in my case, compared to <code>BESST</code> and <code>OPERA-LG</code>)</li><li>huge RAM consumption</li></ul><h2 id="general-usage"><a class="markdownIt-Anchor" href="#general-usage"></a> General usage</h2><h3 id="install"><a class="markdownIt-Anchor" href="#install"></a> Install</h3><p>Dowload the latest version from its <a href="https://github.com/bcgsc/LINKS/releases" target="_blank" rel="noopener">release</a> page.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> decompress</span></span><br><span class="line">tar -zxvf links_v1-8-6.tar.gz; cd links_v1.8.6</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> build BloomFilter PERL module</span></span><br><span class="line">cd lib/bloomfilter/swig</span><br><span class="line">swig -Wall -c++ -perl5 BloomFilter.i</span><br><span class="line">g++ -c BloomFilter_wrap.cxx -I/usr/lib64/perl5/CORE -fPIC -Dbool=char -O3</span><br><span class="line">g++ -Wall -shared BloomFilter_wrap.o -o BloomFilter.so -O3</span><br></pre></td></tr></table></figure><h3 id="parameter-setting"><a class="markdownIt-Anchor" href="#parameter-setting"></a> Parameter setting</h3><p><a href="https://github.com/bcgsc/LINKS#how-it-works" target="_blank" rel="noopener">Here</a> gives a general description of the way <code>LINKS</code> works.</p><p>Setting the parameters is crucial for scaffolding. Below I summarize some tips or practices when setting these parameters.</p><ol><li>scaffolding control</li></ol><ul><li><code>-a</code>: the maximum ratio between the best two contig pairs for a given seed/contig being extended (default: 0.3).</li><li><code>-l</code>: the minimum number of links (read pairs) a valid contig pair MUST have to be considered (default: 5).</li><li><blockquote><p>For example, contig A shares 4 links with B and 2 links with C, in this orientation. contig rA (reverse) also shares 3 links with D. When it’s time to extend contig A (with the options -l and -a set to 2 and 0.7, respectively), both contig pairs AB and AC are considered. Since C (second-best) has 2 links and B (best) has 4 (2/4) = 0.5 below the maximum ratio of 0.7, A will be linked with B in the scaffold and C will be kept for another extension. If AC had 3 links the resulting ratio (0.75), above the user-defined maximum 0.7 would have caused the extension to terminate at A, with both B and C considered for a different scaffold. A maximum links ratio of 1 (not recommended) means that the best two candidate contig pairs have the same number of links – LINKS will accept the first one since both have a valid gap/overlap. When a scaffold extension is terminated on one side, the scaffold is extended on the “left”, by looking for contig pairs that involve the reverse of the seed (in this example, rAD). With AB and AC having 4 and 2 links, respectively and rAD being the only pair on the left, the final scaffolds outputted by LINKS would be: rD-A-B and C.</p></blockquote></li></ul><ol start="2"><li>k-mer length: <code>-k</code></li></ol><ul><li><code>-k</code>: k-mer value (default 15).</li><li>LINKS is a k-mer scaffolder, and the <code>-k</code> parameter controls the k-mer length.</li><li>Exploration of vast kmer space is expected to yield better scaffolding results.</li><li>You may increase <code>-k</code> to <code>21</code> while working with pacbio reads.</li><li>I also recommend correcting ONT reads if you can, it will allow you to choose higher k values and increase the specificity.</li><li>The sweet spot will be somewhere k 15-19 for 2Gb genome (assuming raw ONT reads).</li></ul><ol start="3"><li>k-mer pairs extraction: <code>-d</code>, <code>-e</code> and <code>-t</code>. And <code>-t</code> and <code>-d</code> are important for memory usage.</li></ol><ul><li><code>-e</code>: error (%) allowed on -d distance e.g. -e 0.1 == distance +/- 10% (default: 0.1)<ul><li>In theory, the -e parameter will play an important role limiting linkages outside of the target range -d (+/-) -e %. This is especially true when using raw MPET for scaffolding, to limit spurious linkages by contaminating PETs.</li></ul></li><li><code>-d</code>: distance between the 5’-end of each pairs (default: 4000)</li><li><code>-t</code>: sliding window when extracting k-mer pairs from long reads (default: 2)</li><li>Because you want want to start with a low -d for scaffolding, you have to estimate how many minimum links (-l) would fit in a -d window +/- error -e given sliding window -t. For instance, it may not make sense to use -t 200, -d 500 at low coverages BUT if you have at least 10-fold coverage it might since, in principle, you should be able to derive sufficient k-mer pairs within same locus if there’s no bias in genome sequencing.</li><li>On the data side of things, reducing the coverage (using less long reads), and limiting to only the highest quality reads would help decrease RAM usage.</li><li>WARNING: Specifying many distances will require large amount of RAM, especially with low -t values.</li><li>As <code>-d</code> increases, <code>-t</code> must decrease (otherwise you’ll end up with too few pairs for scaffolding over larger kmer distances).</li></ul><h4 id="an-example"><a class="markdownIt-Anchor" href="#an-example"></a> An example</h4><p>The power of <code>LINKS</code> is in scaffolding using various distance constraints, iteratively.</p><ol><li>Running links multiple times (when working with large long read dataset / genomes are very big)</li></ol><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">nohup ./runIterativeLINKS.sh 17 beluga.fa 5 0.3 &</span><br><span class="line">./LINKS -f $2 -s ont.fof -b links1 -d 1000 -t 10 -k $1 -l $3 -a $4</span><br><span class="line">./LINKS -f links1.scaffolds.fa -s ont.fof -b links2 -d 2500 -t 5 -k $1 -l $3 -a $4 -o 1 -r links1.bloom</span><br><span class="line">./LINKS -f links2.scaffolds.fa -s ont.fof -b links3 -d 5000 -t 5 -k $1 -l $3 -a $4 -o 2 -r links1.bloom</span><br><span class="line">./LINKS -f links3.scaffolds.fa -s ont.fof -b links4 -d 7500 -t 4 -k $1 -l $3 -a $4 -o 3 -r links1.bloom</span><br><span class="line">./LINKS -f links4.scaffolds.fa -s ont.fof -b links5 -d 10000 -t 4 -k $1 -l $3 -a $4 -o 4 -r links1.bloom</span><br><span class="line">./LINKS -f links5.scaffolds.fa -s ont.fof -b links6 -d 12500 -t 3 -k $1 -l $3 -a $4 -o 5 -r links1.bloom</span><br><span class="line">./LINKS -f links6.scaffolds.fa -s ont.fof -b links7 -d 15000 -t 3 -k $1 -l $3 -a $4 -o 6 -r links1.bloom</span><br><span class="line">./LINKS -f links7.scaffolds.fa -s ont.fof -b links8 -d 30000 -t 2 -k $1 -l $3 -a $4 -o 7 -r links1.bloom</span><br></pre></td></tr></table></figure><ol start="2"><li>Running links iteratively in a single command (works best when genomes are small/RAM not limiting)</li></ol><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./LINKS -f $2 -s ont.fof -b links1 -d 1000,5000,7500,10000,12500,15000,30000 -t 10,5,5,4,4,3,3,2 -k $1 -l $3 -a $4</span><br></pre></td></tr></table></figure><ul><li>the value of <code>-t</code> will determine the #kmer pairs extracted and incidentally the RAM used. this will vary from one dataset to the next.</li></ul><p>The example was from: <a href="https://github.com/bcgsc/LINKS/issues/20" target="_blank" rel="noopener">working with 2 Gb genome; stuck on bloom filter being built</a>.</p><p>It seems <code>LINKS</code> needs huge RAM, and you may also want to try <a href="https://github.com/bcgsc/LINKS/issues/20" target="_blank" rel="noopener">lrscaf</a>, another long reads scaffolding tool.</p><h2 id="in-practice"><a class="markdownIt-Anchor" href="#in-practice"></a> In practice</h2><h3 id="background"><a class="markdownIt-Anchor" href="#background"></a> background</h3><ul><li>An insect</li><li>The species: high heterogeneity, high AT, high repetition.</li><li>Genome size: male 790M, female 830M.</li></ul><h3 id="data"><a class="markdownIt-Anchor" href="#data"></a> data</h3><p>The Illumina data I used:</p><table><thead><tr><th>Source</th><th>Insert size (bp)</th><th>Avg. read size (bp)</th><th>Raw bases (G)</th><th>Raw reads (M)</th><th>Sequencing depth</th></tr></thead><tbody><tr><td>AV1, M</td><td>270</td><td>150</td><td>44.1</td><td>293.6</td><td>55.5</td></tr><tr><td>AV2, F</td><td>500</td><td>150</td><td>24.4</td><td>162.8</td><td>29.4</td></tr><tr><td>AV2, F</td><td>800</td><td>150</td><td>15.8</td><td>105.4</td><td>19.0</td></tr><tr><td>AV2, F</td><td>3k</td><td>114</td><td>10.4</td><td>91.8</td><td>12.5</td></tr><tr><td>AV2, F</td><td>5k</td><td>150</td><td>29.8</td><td>198.7</td><td>35.9</td></tr><tr><td>AV2, F</td><td>5k</td><td>114</td><td>11.5</td><td>101.2</td><td>13.8</td></tr><tr><td>AV2, F</td><td>10k</td><td>150</td><td>17.5</td><td>116.8</td><td>21.1</td></tr><tr><td>Total</td><td>-</td><td>-</td><td>153.5</td><td>1070.3</td><td>187.3</td></tr></tbody></table><p>And the PacBio data:</p><table><thead><tr><th>Source</th><th>Raw bases (G)</th><th>Raw reads (M)</th><th>Sequencing depth</th><th>Avg.read size (bp)</th><th>N50 (bp)</th><th>N90 (bp)</th><th>Note</th></tr></thead><tbody><tr><td>AV3, F</td><td>15.2</td><td>20.1</td><td>18.31</td><td>7550</td><td>10046</td><td>4558</td><td>20170111</td></tr><tr><td>AV4, F</td><td>45.2</td><td>4.6</td><td>54.46</td><td>9798</td><td>17348</td><td>5702</td><td>20171224</td></tr><tr><td>Total</td><td>60.4</td><td>24.7</td><td>72.77</td><td>9115</td><td>14630</td><td>5310</td><td>-</td></tr></tbody></table><p>Because we didn’t receive the data at the same time (first all Illumina data, then 20X PacBio data, finally another 50X PacBio data), we tried many different assembly strategies: Illumina-dominant, and PacBio-dominant. We found ‘PacBio-dominant pipelines’ produced significantly good results, so we gradually gave up other pipelines and focused on exploring PacBio assemblers.</p><p>This note is one of the attempts of ‘Illumina-dominant pipelines’, and I ran LINKS as a scaffoling tool, after runing MEGAHIT-SOAPdenovo successfully. I’ve tried so many with MEGAHIT, and I ran LINKS after every attempt. See <a href="https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-MEGAHIT-SOAPdenovo-fusion/">Genome Assembly Pipeline: MEGAHIT & SOAPdenovo-fusion</a>. I put two representative results here as a memo.</p><p>PS: I encountered a problem and solved with the help of the author: <a href="https://github.com/bcgsc/LINKS/issues/12" target="_blank" rel="noopener">LINKS termineted with no error message when I used hybrid reads to scaffold</a>.</p><p>The <code>-t</code> and <code>-d</code> are important for RAM consumption. I’ve tried <code>-t</code> with <code>5, 10, 20</code>, and <code>-t 20</code> works. <code>-d</code> didn’t matter in my case, so I just set lots of <code>-d</code>.</p><p>Scripts I used and the stats I got:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">LINKS_HOME=/software/links_v1.8.5</span><br><span class="line">WORK_DIR=/DenovoSeq</span><br><span class="line"></span><br><span class="line">echo Start time is `date +%Y/%m/%d--%H:%M`</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> re-scafflod the output of MEGAHIT small insert size and SOAP-fusion. use about 50X data</span></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="variable">$LINKS_HOME</span>/LINKS -k 21 -f <span class="variable">$WORK_DIR</span>/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq -s Pacbio.fof -b reLINKS2 -t 20 -d 4000,5000,6000,8000,10000,15000,20000</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"> re-scaffold the output of MEGAHIT small insert size and SOAP-fusion, use about 70X data</span></span><br><span class="line"><span class="meta">$</span><span class="bash">LINKS_HOME/LINKS -k 21 -f <span class="variable">$WORK_DIR</span>/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq -s Pacbio_171231.fof -b reLINKS_171231 -t 20 -d 4000,5000,6000,8000,10000,15000,20000</span></span><br><span class="line"></span><br><span class="line">echo Finish time is `date +%Y/%m/%d--%H:%M`</span><br></pre></td></tr></table></figure><p>And the input data:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># first batch PB data, about 20X</span></span></span><br><span class="line"><span class="meta">$</span><span class="bash"> cat Pacbio.fof</span></span><br><span class="line">/DenovoSeq/Third_rawData/av_20k.fasta</span><br><span class="line"></span><br><span class="line"><span class="meta">#</span><span class="bash"><span class="comment"># all PB data, about 70X</span></span></span><br><span class="line"><span class="meta">$</span><span class="bash"> cat Pacbio_171231.fof</span></span><br><span class="line">/DenovoSeq/Third_rawData/third_all.fasta</span><br></pre></td></tr></table></figure><p>The stats I got:</p><p>re-scaffloding the output of MEGAHIT small insert size and SOAP-fusion, use ~20X PB data:</p><table><thead><tr><th></th><th>MEGAHIT</th><th>SOAP-fusion</th><th>LINKS</th></tr></thead><tbody><tr><td>Size_includeN</td><td>582099585</td><td>765246417</td><td>776147223</td></tr><tr><td>Size_withoutN</td><td>582099585</td><td>574675978</td><td>574675978</td></tr><tr><td>Seq_Num</td><td>575808</td><td>377325</td><td>370033</td></tr><tr><td>Mean_Size</td><td>1010</td><td>2028</td><td>2097</td></tr><tr><td>Median_Size</td><td>433</td><td>393</td><td>389</td></tr><tr><td>Longest_Seq</td><td>55064</td><td>436254</td><td>621272</td></tr><tr><td>Shortest_Seq</td><td>200</td><td>200</td><td>200</td></tr><tr><td>GC_Content</td><td>31.43</td><td>31.42</td><td>31.42</td></tr><tr><td>N50</td><td>2158</td><td>33478</td><td>42519</td></tr><tr><td>N90</td><td>358</td><td>439</td><td>444</td></tr><tr><td>Gap</td><td>0</td><td>24.9</td><td>25.96</td></tr></tbody></table><p>re-scaffolding the output of MEGAHIT small insert size and SOAP-fusion, use ~70X PB data.</p><table><thead><tr><th></th><th>MEGAHIT</th><th>SOAP-fusion</th><th>LINKS</th></tr></thead><tbody><tr><td>Size_includeN</td><td>582099585</td><td>765246417</td><td>798803587</td></tr><tr><td>Size_withoutN</td><td>582099585</td><td>574675978</td><td>574675978</td></tr><tr><td>Seq_Num</td><td>575808</td><td>377325</td><td>359653</td></tr><tr><td>Mean_Size</td><td>1010</td><td>2028</td><td>2221</td></tr><tr><td>Median_Size</td><td>433</td><td>393</td><td>384</td></tr><tr><td>Longest_Seq</td><td>55064</td><td>436254</td><td>732454</td></tr><tr><td>Shortest_Seq</td><td>200</td><td>200</td><td>200</td></tr><tr><td>GC_Content</td><td>31.43</td><td>31.42</td><td>31.42</td></tr><tr><td>N50</td><td>2158</td><td>33478</td><td>68195</td></tr><tr><td>N90</td><td>358</td><td>439</td><td>453</td></tr><tr><td>Gap</td><td>0</td><td>24.9</td><td>28.06</td></tr></tbody></table><p>The last run took about <code>27 hours</code>.</p><p>As can be seen, adding 50X PacBio data did not help a lot. After this, we started to use ‘PacBio-dominant pipelines’ and have tried many different assemblers. But, LINKS is a good scaffolding tool if you assemble the genome with illumina data and want to further scaffold with <20X long-reads.</p><p>This note can serve as a reference in case I will have to use it again.</p><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180307: createt the note.</li><li>20180709: update the ‘background information’.</li><li>20180817: change the ‘General usage’ part, add contents about ‘parameter setting’.</li></ul>]]></content>
<summary type="html">
<h2 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h2>
<p>From its <a href="http://www.bcgsc.ca/platform/bioinfo/software
</summary>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/"/>
<category term="hybrid pipeline" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/hybrid-pipeline/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="genome assembly pipeline" scheme="https://yiweiniu.github.io/blog/tags/genome-assembly-pipeline/"/>
<category term="scaffolding" scheme="https://yiweiniu.github.io/blog/tags/scaffolding/"/>
<category term="hybrid genome assembly" scheme="https://yiweiniu.github.io/blog/tags/hybrid-genome-assembly/"/>
</entry>
<entry>
<title>Genome Assembly Pipeline: MEGAHIT & SOAPdenovo-fusion</title>
<link href="https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-MEGAHIT-SOAPdenovo-fusion/"/>
<id>https://yiweiniu.github.io/blog/2018/03/Genome-assembly-pipeline-MEGAHIT-SOAPdenovo-fusion/</id>
<published>2018-03-29T15:54:59.000Z</published>
<updated>2018-07-09T02:32:52.000Z</updated>
<content type="html"><![CDATA[<p><code>MEGAHIT</code> can be used to assemble contigs, and <code>SOAPdenovo-fusion</code> can be used for scaffolding. Since they were developed by the same team, I just put them together.</p><p>This note is more about <code>MEGAHIT</code> and its performance, because you can choose not to use <code>SOAPdevovo-fusion</code>. <code>SOAPdenovo-fusion</code> had comparatively good performance in my case, so why not give it a try?</p><h2 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h2><p>From <a href="https://github.com/voutcn/megahit" target="_blank" rel="noopener">MEGAHIT git repo</a></p><blockquote><p>MEGAHIT is a single node assembler for large and complex metagenomics NGS reads, such as soil. It makes use of succinct de Bruijn graph (SdBG) to achieve low memory assembly. MEGAHIT can optionally utilize a CUDA-enabled GPU to accelerate its SdBG contstruction. The GPU-accelerated version of MEGAHIT has been tested on NVIDIA GTX680 (4G memory) and Tesla K40c (12G memory) with CUDA 5.5, 6.0 and 6.5. MEGAHIT v1.0 or greater also supports IBM Power PC and has been tested on IBM POWER8.</p></blockquote><p>Its paper</p><blockquote><p>Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31(10):1674–1676. doi:10.1093/bioinformatics/btv033</p></blockquote><p>My feelings:</p><ul><li>very easy to use</li><li>fast enough</li><li>better than SOAPdenovo2</li><li>no need to designate k-mer</li></ul><h2 id="general-usage"><a class="markdownIt-Anchor" href="#general-usage"></a> General usage</h2><p>See <a href="https://github.com/voutcn/megahit/wiki" target="_blank" rel="noopener">MEGAHIT wiki</a> for full docs.</p><ul><li><a href="https://github.com/voutcn/megahit/wiki/Assembly-Tips" target="_blank" rel="noopener">Assembly Tips</a></li><li><a href="https://github.com/voutcn/megahit/wiki/An-example-of-real-assembly" target="_blank" rel="noopener">An example of real assembly</a></li></ul><h2 id="in-practice-megahit"><a class="markdownIt-Anchor" href="#in-practice-megahit"></a> In practice - MEGAHIT</h2><ul><li>An insect</li><li>The species: high heterogeneity, high AT, high repetition.</li><li>Genome size: male 790M, female 830M.</li></ul><h3 id="data"><a class="markdownIt-Anchor" href="#data"></a> data</h3><p>The Illumina data I used:</p><table><thead><tr><th>Source</th><th>Insert size (bp)</th><th>Avg. read size (bp)</th><th>Raw bases (G)</th><th>Raw reads (M)</th><th>Sequencing depth</th></tr></thead><tbody><tr><td>AV1, M</td><td>270</td><td>150</td><td>44.1</td><td>293.6</td><td>55.5</td></tr><tr><td>AV2, F</td><td>500</td><td>150</td><td>24.4</td><td>162.8</td><td>29.4</td></tr><tr><td>AV2, F</td><td>800</td><td>150</td><td>15.8</td><td>105.4</td><td>19.0</td></tr><tr><td>AV2, F</td><td>3k</td><td>114</td><td>10.4</td><td>91.8</td><td>12.5</td></tr><tr><td>AV2, F</td><td>5k</td><td>150</td><td>29.8</td><td>198.7</td><td>35.9</td></tr><tr><td>AV2, F</td><td>5k</td><td>114</td><td>11.5</td><td>101.2</td><td>13.8</td></tr><tr><td>AV2, F</td><td>10k</td><td>150</td><td>17.5</td><td>116.8</td><td>21.1</td></tr><tr><td>Total</td><td>-</td><td>-</td><td>153.5</td><td>1070.3</td><td>187.3</td></tr></tbody></table><p>I’ve tried <code>MEGAHIT</code> with raw/trimmed data, with/without ins_270 library, with all (PE and MPE)/PE libraries, and here are the scripts I used and stats received. The reason why I tried with/without ins_270 library was because it’s from a male but other libraries were from females.</p><p>The reason why I tried with all (PE and MPE)/PE libraries was because I ran <code>MEGAHIT</code> with all data I had, and then the author recommended only to use PE libraries. See the discussions with the authors.</p><ul><li><a href="https://github.com/aquaskyline/SOAPdenovo2/issues/27" target="_blank" rel="noopener">Segmentation fault with scaff step when use different MEGAHIT’s output</a></li><li><a href="https://github.com/aquaskyline/SOAPdenovo2/issues/26" target="_blank" rel="noopener">How to use SOAPdenovo-fusion scaffold the output of MEGAHIT?</a></li></ul><h3 id="run1-all-raw-data"><a class="markdownIt-Anchor" href="#run1-all-raw-data"></a> run1, all raw data</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 20 --no-mercy -1 /DenovoSeq/raw_data/270B_R1.fastq,/DenovoSeq/raw_data/500B_R1.fastq,/DenovoSeq/raw_data/800B_R1.fastq,/DenovoSeq/raw_data/3k_1_R1.fastq,/DenovoSeq/raw_data/5k-1_R1.fastq,/DenovoSeq/raw_data/5k-2_R1.fastq,/DenovoSeq/raw_data/10k_R1.fastq -2 /DenovoSeq/raw_data/270B_R2.fastq,/DenovoSeq/raw_data/500B_R2.fastq,/DenovoSeq/raw_data/800B_R2.fastq,/DenovoSeq/raw_data/3k_1_R2.fastq,/DenovoSeq/raw_data/5k-1_R2.fastq,/DenovoSeq/raw_data/5k-2_R2.fastq,/DenovoSeq/raw_data/10k_R2.fastq -o megahit_out1</span><br></pre></td></tr></table></figure><p>and the stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 894816039</span><br><span class="line">Size_withoutN: 894816039</span><br><span class="line">Seq_Num: 1292253</span><br><span class="line">Mean_Size: 692</span><br><span class="line">Median_Size: 418</span><br><span class="line">Longest_Seq: 42542</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 32.9</span><br><span class="line">N50: 834</span><br><span class="line">N90: 333</span><br></pre></td></tr></table></figure><h3 id="run2-all-trimmed-data-by-trimmomatic"><a class="markdownIt-Anchor" href="#run2-all-trimmed-data-by-trimmomatic"></a> run2, all trimmed data (by Trimmomatic)</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 38 --no-mercy -1 /DenovoSeq/trimmomatic/270B_R_1P.fastq,/DenovoSeq/trimmomatic/500B_R_1P.fastq,/DenovoSeq/trimmomatic/800B_R_1P.fastq,/DenovoSeq/trimmomatic/3k_1_R_1P.fastq,/DenovoSeq/trimmomatic/5k-1_R_1P.fastq,/DenovoSeq/trimmomatic/5k-2_R_1P.fastq,/DenovoSeq/trimmomatic/10k_R_1P.fastq -2 /DenovoSeq/trimmomatic/270B_R_2P.fastq,/DenovoSeq/trimmomatic/500B_R_2P.fastq,/DenovoSeq/trimmomatic/800B_R_2P.fastq,/DenovoSeq/trimmomatic/3k_1_R_2P.fastq,/DenovoSeq/trimmomatic/5k-1_R_2P.fastq,/DenovoSeq/trimmomatic/5k-2_R_2P.fastq,/DenovoSeq/trimmomatic/10k_R_2P.fastq</span><br></pre></td></tr></table></figure><p>and the stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 767662393</span><br><span class="line">Size_withoutN: 767662393</span><br><span class="line">Seq_Num: 989298</span><br><span class="line">Mean_Size: 775</span><br><span class="line">Median_Size: 428</span><br><span class="line">Longest_Seq: 80889</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 32.64</span><br><span class="line">N50: 1115</span><br><span class="line">N90: 340</span><br></pre></td></tr></table></figure><h3 id="run3-all-raw-data-without-ins_270"><a class="markdownIt-Anchor" href="#run3-all-raw-data-without-ins_270"></a> run3, all raw data, without ins_270</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 20 --no-mercy -1 /DenovoSeq/raw_data/500B_R1.fastq,/DenovoSeq/raw_data/800B_R1.fastq,/DenovoSeq/raw_data/3k_1_R1.fastq,/DenovoSeq/raw_data/5k-1_R1.fastq,/DenovoSeq/raw_data/5k-2_R1.fastq,/DenovoSeq/raw_data/10k_R1.fastq -2 /DenovoSeq/raw_data/500B_R2.fastq,/DenovoSeq/raw_data/800B_R2.fastq,/DenovoSeq/raw_data/3k_1_R2.fastq,/DenovoSeq/raw_data/5k-1_R2.fastq,/DenovoSeq/raw_data/5k-2_R2.fastq,/DenovoSeq/raw_data/10k_R2.fastq -o megahit_out.no2701</span><br></pre></td></tr></table></figure><p>and the stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 800012379</span><br><span class="line">Size_withoutN: 800012379</span><br><span class="line">Seq_Num: 1045419</span><br><span class="line">Mean_Size: 765</span><br><span class="line">Median_Size: 422</span><br><span class="line">Longest_Seq: 52915</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 32.64</span><br><span class="line">N50: 1074</span><br><span class="line">N90: 340</span><br></pre></td></tr></table></figure><h3 id="run4-all-trimmed-data-without-ins_270"><a class="markdownIt-Anchor" href="#run4-all-trimmed-data-without-ins_270"></a> run4, all trimmed data, without ins_270</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 679262960</span><br><span class="line">Size_withoutN: 679262960</span><br><span class="line">Seq_Num: 782765</span><br><span class="line">Mean_Size: 867</span><br><span class="line">Median_Size: 429</span><br><span class="line">Longest_Seq: 61533</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 32.31</span><br><span class="line">N50: 1529</span><br><span class="line">N90: 348</span><br></pre></td></tr></table></figure><h3 id="run5-use-only-with-all-pe-libraries"><a class="markdownIt-Anchor" href="#run5-use-only-with-all-pe-libraries"></a> run5, use only with all PE libraries</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 40 --no-mercy -1 /DenovoSeq/trimmomatic/270B_R_1P.fastq,/DenovoSeq/trimmomatic/500B_R_1P.fastq,/DenovoSeq/trimmomatic/800B_R_1P.fastq -2 /DenovoSeq/trimmomatic/270B_R_2P.fastq,/DenovoSeq/trimmomatic/500B_R_2P.fastq,/DenovoSeq/trimmomatic/800B_R_2P.fastq -o small_insert</span><br></pre></td></tr></table></figure><p>and the stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 672319141</span><br><span class="line">Size_withoutN: 672319141</span><br><span class="line">Seq_Num: 777747</span><br><span class="line">Mean_Size: 864</span><br><span class="line">Median_Size: 428</span><br><span class="line">Longest_Seq: 71939</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 31.93</span><br><span class="line">N50: 1505</span><br><span class="line">N90: 346</span><br></pre></td></tr></table></figure><h3 id="run6-use-only-with-all-pe-libraries-but-ins_270"><a class="markdownIt-Anchor" href="#run6-use-only-with-all-pe-libraries-but-ins_270"></a> run6, use only with all PE libraries but ins_270</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 40 --no-mercy -1 /DenovoSeq/trimmomatic/500B_R_1P.fastq,/DenovoSeq/trimmomatic/800B_R_1P.fastq -2 /DenovoSeq/trimmomatic/500B_R_2P.fastq,/DenovoSeq/trimmomatic/800B_R_2P.fastq -o small_insert.no270</span><br></pre></td></tr></table></figure><p>and the stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 582099585</span><br><span class="line">Size_withoutN: 582099585</span><br><span class="line">Seq_Num: 575808</span><br><span class="line">Mean_Size: 1010</span><br><span class="line">Median_Size: 433</span><br><span class="line">Longest_Seq: 55064</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 31.43</span><br><span class="line">N50: 2158</span><br><span class="line">N90: 358</span><br></pre></td></tr></table></figure><h3 id="conclusions"><a class="markdownIt-Anchor" href="#conclusions"></a> conclusions</h3><p>Though not been fully tested, I can draw some simple conclusions</p><ul><li>trimmed data generates better results than raw data. (but the way trimming data will influce the results)</li><li>using only PE libraries generates better results than using all libraries (PE, MPE)</li></ul><h2 id="in-practice-soapdenovo-fusion"><a class="markdownIt-Anchor" href="#in-practice-soapdenovo-fusion"></a> In practice - SOAPdenovo-fusion</h2><p>I’ve asked the author that <a href="https://github.com/aquaskyline/SOAPdenovo2/issues/26" target="_blank" rel="noopener">How to use SOAPdenovo-fusion scaffold the output of MEGAHIT?</a>.</p><p>I first tried <code>SOAPdenovo-fusion</code> with/without ins_270 library, and found not using ins_270 library got better results (tested with <code>k-mer=63</code>). Then I tested different kmer: <code>37, 41, 43, 45, 55, 61, 63, 71, 75</code> and found that <code>kmer=41</code> got best results. I’ve also tried with/without <code>-F</code> parameter, but I didn’t understand the diference completely.</p><p>I just put the config, scripts and stats here when using <code>kmer = 41</code>.</p><p>config</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br></pre></td><td class="code"><pre><span class="line">#maximal read length</span><br><span class="line">max_rd_len=151</span><br><span class="line">[LIB]</span><br><span class="line">avg_ins=500</span><br><span class="line">reverse_seq=0</span><br><span class="line">asm_flags=2</span><br><span class="line">#in which order the reads are used while scaffolding</span><br><span class="line">rank=1</span><br><span class="line"># cutoff of pair number for a reliable connection (at least 3 for short insert size)</span><br><span class="line">pair_num_cutoff=3</span><br><span class="line">#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)</span><br><span class="line">map_len=32</span><br><span class="line">#a pair of fastq file, read 1 file should always be followed by read 2 file</span><br><span class="line">q1=/DenovoSeq/trimmomatic/500B_R_1P.fastq</span><br><span class="line">q2=/DenovoSeq/trimmomatic/500B_R_2P.fastq</span><br><span class="line">[LIB]</span><br><span class="line">#average insert size</span><br><span class="line">avg_ins=800</span><br><span class="line">#if sequence needs to be reversed</span><br><span class="line">reverse_seq=0</span><br><span class="line">#in which part(s) the reads are used</span><br><span class="line">asm_flags=2</span><br><span class="line">#in which order the reads are used while scaffolding</span><br><span class="line">rank=3</span><br><span class="line"># cutoff of pair number for a reliable connection (at least 3 for short insert size)</span><br><span class="line">pair_num_cutoff=3</span><br><span class="line">#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)</span><br><span class="line">map_len=32</span><br><span class="line">#a pair of fastq file, read 1 file should always be followed by read 2 file</span><br><span class="line">q1=/DenovoSeq/trimmomatic/800B_R_1P.fastq</span><br><span class="line">q2=/DenovoSeq/trimmomatic/800B_R_2P.fastq</span><br><span class="line">[LIB]</span><br><span class="line">avg_ins=3000</span><br><span class="line">reverse_seq=1</span><br><span class="line">asm_flags=2</span><br><span class="line">rank=3</span><br><span class="line"># cutoff of pair number for a reliable connection (at least 5 for large insert size)</span><br><span class="line">pair_num_cutoff=4</span><br><span class="line">#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)</span><br><span class="line">map_len=35</span><br><span class="line">q1=/DenovoSeq/trimmomatic/3k_1_R_1P.fastq</span><br><span class="line">q2=/DenovoSeq/trimmomatic/3k_1_R_2P.fastq</span><br><span class="line">[LIB]</span><br><span class="line">avg_ins=5000</span><br><span class="line">reverse_seq=1</span><br><span class="line">asm_flags=2</span><br><span class="line">rank=4</span><br><span class="line"># cutoff of pair number for a reliable connection (at least 5 for large insert size)</span><br><span class="line">pair_num_cutoff=5</span><br><span class="line">#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)</span><br><span class="line">map_len=35</span><br><span class="line">q1=/DenovoSeq/trimmomatic/5k-1_R_1P.fastq</span><br><span class="line">q2=/DenovoSeq/trimmomatic/5k-1_R_2P.fastq</span><br><span class="line">[LIB]</span><br><span class="line">avg_ins=5000</span><br><span class="line">reverse_seq=1</span><br><span class="line">asm_flags=2</span><br><span class="line">rank=4</span><br><span class="line"># cutoff of pair number for a reliable connection (at least 5 for large insert size)</span><br><span class="line">pair_num_cutoff=5</span><br><span class="line">#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)</span><br><span class="line">map_len=35</span><br><span class="line">q1=/DenovoSeq/trimmomatic/5k-2_R_1P.fastq</span><br><span class="line">q2=/DenovoSeq/trimmomatic/5k-2_R_2P.fastq</span><br><span class="line">[LIB]</span><br><span class="line">avg_ins=10000</span><br><span class="line">reverse_seq=1</span><br><span class="line">asm_flags=2</span><br><span class="line">rank=5</span><br><span class="line"># cutoff of pair number for a reliable connection (at least 5 for large insert size)</span><br><span class="line">pair_num_cutoff=5</span><br><span class="line">#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)</span><br><span class="line">map_len=35</span><br><span class="line">q1=/DenovoSeq/trimmomatic/10k_R_1P.fastq</span><br><span class="line">q2=/DenovoSeq/trimmomatic/10k_R_2P.fastq</span><br></pre></td></tr></table></figure><h3 id="run1-without-f"><a class="markdownIt-Anchor" href="#run1-without-f"></a> run1, without <code>-F</code></h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">/software/SOAPdenovo2-r241/SOAPdenovo-fusion -D -s config -p 40 -K 41 -g k41 -c ../final.contigs.fa</span><br><span class="line">/software/SOAPdenovo2-r241/SOAPdenovo-127mer map -s config -p 40 -g k41</span><br><span class="line">/software/SOAPdenovo2-r241/SOAPdenovo-127mer scaff -p 40 -g k41</span><br></pre></td></tr></table></figure><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 765246417</span><br><span class="line">Size_withoutN: 574675978</span><br><span class="line">Seq_Num: 377325</span><br><span class="line">Mean_Size: 2028</span><br><span class="line">Median_Size: 393</span><br><span class="line">Longest_Seq: 436254</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 31.42</span><br><span class="line">N50: 33478</span><br><span class="line">N90: 439</span><br><span class="line">Gap: 24.9</span><br></pre></td></tr></table></figure><h3 id="run2-with-f"><a class="markdownIt-Anchor" href="#run2-with-f"></a> run2, with <code>-F</code></h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">/software/SOAPdenovo2-r241/SOAPdenovo-fusion -D -s config -p 40 -K 41 -g k41_1 -c ../final.contigs.fa</span><br><span class="line">/software/SOAPdenovo2-r241/SOAPdenovo-127mer map -s config -p 40 -g k41_1</span><br><span class="line">/software/SOAPdenovo2-r241/SOAPdenovo-127mer scaff -p 40 -g k41_1 -F</span><br></pre></td></tr></table></figure><p>stats:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">Size_includeN: 764869730</span><br><span class="line">Size_withoutN: 579451220</span><br><span class="line">Seq_Num: 377325</span><br><span class="line">Mean_Size: 2027</span><br><span class="line">Median_Size: 393</span><br><span class="line">Longest_Seq: 436104</span><br><span class="line">Shortest_Seq: 200</span><br><span class="line">GC_Content: 31.45</span><br><span class="line">N50: 33463</span><br><span class="line">N90: 439</span><br><span class="line">Gap: 24.24</span><br></pre></td></tr></table></figure><p>It seems that <code>-F</code> parameter didn’t help much (the %gap).</p><p>This note can be a reference in case I will have to use it again.</p><h2 id="change-log"><a class="markdownIt-Anchor" href="#change-log"></a> Change log</h2><ul><li>20180308: create the note.</li></ul>]]></content>
<summary type="html">
<p><code>MEGAHIT</code> can be used to assemble contigs, and <code>SOAPdenovo-fusion</code> can be used for scaffolding. Since they were dev
</summary>
<category term="genome assembly" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/"/>
<category term="NGS pipeline" scheme="https://yiweiniu.github.io/blog/categories/genome-assembly/NGS-pipeline/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="genome assembly pipeline" scheme="https://yiweiniu.github.io/blog/tags/genome-assembly-pipeline/"/>
<category term="NGS genome assembly" scheme="https://yiweiniu.github.io/blog/tags/NGS-genome-assembly/"/>
</entry>
<entry>
<title>Upstream Analysis of TCR/BCR Repertoires in RNA-seq Data</title>
<link href="https://yiweiniu.github.io/blog/2018/03/Upstream-analysis-of-TCR-BCR-repertoires-in-RNA-seq-data/"/>
<id>https://yiweiniu.github.io/blog/2018/03/Upstream-analysis-of-TCR-BCR-repertoires-in-RNA-seq-data/</id>
<published>2018-03-29T15:25:12.000Z</published>
<updated>2018-04-23T10:27:55.000Z</updated>
<content type="html"><![CDATA[<p><strong>Bacground</strong>: because the transcriptome data I recently worked with is highly related with immune. So I want to dig out something about immune.</p><p>When I viewed <a href="https://github.com/crazyhottommy/RNA-seq-analysis#immnune-related" target="_blank" rel="noopener">RNAseq analysis notes from Tommy Tang</a>, I found <a href="https://github.com/mandricigor/imrep/wiki" target="_blank" rel="noopener">ImReP</a>. It’s a tool designed for profiling TCR/BCR repertoire in regular RNA-seq data. That was something I nerver heard before.</p><p>Then I found more tools through the paper of <code>ImRep</code> <sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>. They are:</p><ul><li><code>MiXCR</code> <sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup></li><li><code>TRUST</code> <sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup></li><li><code>TraCeR</code> <sup class="footnote-ref"><a href="#fn4" id="fnref4">[4]</a></sup></li><li><code>V'DJer</code> <sup class="footnote-ref"><a href="#fn5" id="fnref5">[5]</a></sup></li><li><code>IgBlast-basedpipeline</code> <sup class="footnote-ref"><a href="#fn6" id="fnref6">[6]</a></sup></li><li><code>iSSAKE</code> <sup class="footnote-ref"><a href="#fn7" id="fnref7">[7]</a></sup></li></ul><p>And the paper of <code>ImRep</code> introduce them as follows:</p><blockquote><p>TRUST and TraCeR do not support the analysis of BCR sequences and were excluded from the comparison based for the IGH data. iSSAKE is no longer supported and was not recommended for use. Unfortunately, we obtained empty output after running V’DJer, and increasing coverage in the simulated data did not solve the problem. Alternative approaches, such as IMSEQ, cannot be applied directly to RNA-Seq reads because they were originally designed for targeted sequencing of B or T cell receptor loci. Thus, to independently assess and compare accuracy with ImReP, we only ran IMSEQ with the simulated reads derived from BCR or TCR transcripts (Figure S1). Scripts and commands to run all tools used in this study are provided in the Extended Experimental Procedures and are available online at <a href="https://github.com/smangul1/Profiling-adaptive-immune-repertoires-across-multiple-humantissues-by-RNA-Sequencing" target="_blank" rel="noopener">https://github.com/smangul1/Profiling-adaptive-immune-repertoires-across-multiple-humantissues-by-RNA-Sequencing</a>. ImReP consistently outperformed existing methods on IGH data in both recall and precision rates for the majority of simulated parameters. ImReP and MiXCR show similar performance on TCRA data and outperform other methods. Notably, ImReP was the only method with acceptable performance on IGH data at 50bp read length, reconstructing with a higher precision rate significantly more CDR3 clonotypes than other methods.</p></blockquote><p>Because I only have regular RNA-seq data (non-enriched and/or randomly-shred ©DNA libraries), <code>ImRep</code> and <code>MiXCR</code> were the only two software I want to try.</p><h2 id="mixcr"><a class="markdownIt-Anchor" href="#mixcr"></a> MiXCR</h2><h3 id="intro"><a class="markdownIt-Anchor" href="#intro"></a> Intro</h3><blockquote><p>MiXCR is a universal framework that processes big immunome data from raw sequences to quantitated clonotypes. MiXCR efficiently handles paired- and single-end reads, considers sequence quality, corrects PCR errors and identifies germline hypermutations. The software supports both partial- and full-length profiling and employs all available RNA or DNA information, including sequences upstream of V and downstream of J gene segments. (<a href="https://mixcr.readthedocs.io/en/latest/" target="_blank" rel="noopener">https://mixcr.readthedocs.io/en/latest/</a>)</p></blockquote><p><code>MiXCR</code> has very nice docs: <a href="https://mixcr.readthedocs.io/en/latest" target="_blank" rel="noopener">https://mixcr.readthedocs.io/en/latest</a>. See them for full instructions.</p><blockquote><p>Typical MiXCR workflow consists of three main processing steps:</p><ul><li>align: align sequencing reads to reference V, D, J and C genes of T- or B- cell receptors</li><li>assemble: assemble clonotypes using alignments obtained on previous step (in order to extract specific gene regions e.g. CDR3)</li><li>export: export alignment (<code>exportAlignments</code>) or clones (<code>exportClones</code>) to human-readable text file</li></ul></blockquote><img src="/blog/2018/03/Upstream-analysis-of-TCR-BCR-repertoires-in-RNA-seq-data/mixcr_work_1521857914_11153.png" title="mixcr_workflow"><p><strong>Enriched RepSeq Data</strong></p><blockquote><p>Here is a very simple usage example that will extract repertoire data (in the form of clonotypes list) from raw sequencing data of enriched RepSeq library:</p><blockquote><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">mixcr align -r log.txt input_R1.fastq.gz input_R2.fastq.gz alignments.vdjca</span><br><span class="line">mixcr assemble -r log.txt alignments.vdjca clones.clns</span><br><span class="line">mixcr exportClones clones.clns clones.txt</span><br></pre></td></tr></table></figure></blockquote></blockquote><blockquote><p>this will produce a tab-delimited list of clones (<code>clones.txt</code>) assembled by their CDR3 sequences with extensive information on their abundances, V, D and J genes, mutations in germline regions, topology of VDJ junction etc.</p></blockquote><p><strong>Repertoire extraction from RNA-Seq</strong></p><blockquote><p>MiXCR is equally effective in extraction of repertoire information from non-enriched data, like RNA-Seq or WGS. This example illustrates usage for RNA-Seq:</p><blockquote><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">mixcr align -p rna-seq -r log.txt input_R1.fastq.gz input_R2.fastq.gz alignments.vdjca</span><br><span class="line">mixcr assemblePartial alignments.vdjca alignment_contigs.vdjca</span><br><span class="line">mixcr assemble -r log.txt alignment_contigs.vdjca clones.clns</span><br><span class="line">mixcr exportClones clones.clns clones.txt</span><br></pre></td></tr></table></figure></blockquote></blockquote><h3 id="install"><a class="markdownIt-Anchor" href="#install"></a> Install</h3><ul><li>download the latest stable <code>MiXCR</code> build from <a href="https://github.com/milaboratory/mixcr/releases/latest" target="_blank" rel="noopener">release page</a></li></ul><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">unzip mixcr-2.1.9.zip && cd mixcr-2.1.9</span><br><span class="line">$ $ ./mixcr -h</span><br><span class="line">Usage: mixcr [options] [command] [command options]</span><br><span class="line"> Options:</span><br><span class="line"> -h, --help</span><br><span class="line"> Displays this help message.</span><br></pre></td></tr></table></figure><h3 id="run"><a class="markdownIt-Anchor" href="#run"></a> Run</h3><p>Pipelines from <a href="https://github.com/milaboratory/mixcr" target="_blank" rel="noopener">https://github.com/milaboratory/mixcr</a> and <a href="http://mixcr.readthedocs.io/en/latest/rnaseq.html" target="_blank" rel="noopener">http://mixcr.readthedocs.io/en/latest/rnaseq.html</a> are not exactly the same. I used the latter.</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">path2MiXCR=$TOOLDIR/mixcr-2.1.9/mixcr</span><br><span class="line">MiXCR_output=${WORKDIR}/MiXCR/human/${sample}</span><br><span class="line"></span><br><span class="line">PPN=20</span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash">path2MiXCR align -t <span class="variable">$PPN</span> -r <span class="variable">$MiXCR_output</span>/log.txt -p rna-seq -s hsa -OallowPartialAlignments=<span class="literal">true</span> <span class="variable">$WORKDIR</span>/clean_fastq/human/<span class="variable">${sample}</span>_R1.fastq.gz <span class="variable">$WORKDIR</span>/clean_fastq/human/<span class="variable">${sample}</span>_R2.fastq.gz <span class="variable">$MiXCR_output</span>/alignments.vdjca</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2MiXCR assemblePartial -r <span class="variable">$MiXCR_output</span>/log.txt <span class="variable">$MiXCR_output</span>/alignments.vdjca <span class="variable">$MiXCR_output</span>/alignments_rescued_1.vdjca</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2MiXCR assemblePartial -r <span class="variable">$MiXCR_output</span>/log.txt <span class="variable">$MiXCR_output</span>/alignments_rescued_1.vdjca <span class="variable">$MiXCR_output</span>/alignments_rescued_2.vdjca</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2MiXCR extendAlignments -r <span class="variable">$MiXCR_output</span>/log.txt <span class="variable">$MiXCR_output</span>/alignments_rescued_2.vdjca <span class="variable">$MiXCR_output</span>/alignments_rescued_2_extended.vdjca</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2MiXCR assemble -r <span class="variable">$MiXCR_output</span>/log.txt -t <span class="variable">$PPN</span> <span class="variable">$MiXCR_output</span>/alignments_rescued_2_extended.vdjca <span class="variable">$MiXCR_output</span>/clones.clns</span></span><br><span class="line"><span class="meta">$</span><span class="bash">path2MiXCR exportClones <span class="variable">$MiXCR_output</span>/clones.clns <span class="variable">$MiXCR_output</span>/clones.txt</span></span><br></pre></td></tr></table></figure><p>And the output looks like this:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> head -n 3 clones.txt</span></span><br><span class="line">cloneIdcloneCountcloneFractionclonalSequenceclonalSequenceQualityallVHitsWithScoreallDHitsWithScoreallJHitsWithScoreallCHitsWithScoreallVAlignmentsallDAlignmentsallJAlignmentsallCAlignmentsnSeqFR1minQualFR1nSeqCDR1minQualCDR1nSeqFR2minQualFR2nSeqCDR2minQualCDR2nSeqFR3minQualFR3 nSeqCDR3minQualCDR3nSeqFR4minQualFR4aaSeqFR1aaSeqCDR1aaSeqFR2aaSeqCDR2aaSeqFR3aaSeqCDR3aaSeqFR4refPoints</span><br><span class="line">01300.008044056679660912TGCTGCTCATATGCAGGCAGCTACACTTGGGTGTTCFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFIGLV2-11*00(498.3)IGLJ3*00(155.3)IGLC2*00(628.6),IGLC3*00(628.3),IGLC7*00(558)324|352|374|0|28||140.020|30|58|26|36||50.0;;TGCTGCTCATATGCAGGCAGCTACACTTGGGTGTTC37CCSYAGSYTWVF:::::::::0:-2:28:::::26:0:36:::</span><br><span class="line">1620.003836396262607512TGCAGCTCATATACAAGCAGCAGCACTTTCGTCTTCFFFFFFFFFFFFFFFFFFFFFFFFFNNNNNNNNNNNIGLV2-14*00(599.8)IGLJ1*00(118.3)IGLC1*00(446.2)324|355|374|0|31|SC351T|139.024|30|58|30|36||30.0TGCAGCTCATATACAAGCAGCAGCACTTTCGTCTTC37CSSYTSSSTFVF:::::::::0:1:31:::::30:-4:36:::</span><br></pre></td></tr></table></figure><p>No ideas what to do next… I will try some post-analysis tools to explore the clonetypes.</p><h2 id="imrep"><a class="markdownIt-Anchor" href="#imrep"></a> ImRep</h2><h3 id="intro-2"><a class="markdownIt-Anchor" href="#intro-2"></a> Intro</h3><blockquote><p>ImReP is a method for rapid and accurate profiling of the adaptive immune repertoires from regular RNA-Seq data.</p></blockquote><img src="/blog/2018/03/Upstream-analysis-of-TCR-BCR-repertoires-in-RNA-seq-data/imrep_work_1521858447_27255.png" title="imrep_workflow"><h3 id="install-2"><a class="markdownIt-Anchor" href="#install-2"></a> Install</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">git clone https://github.com/mandricigor/imrep.git && cd imrep</span><br><span class="line">./install.sh</span><br><span class="line"></span><br><span class="line"><span class="meta">$</span><span class="bash"> python imrep.py -h</span></span><br><span class="line"></span><br><span class="line">usage: python2 imrep.py [-h] [--fastq] [--bam] [--chrFormat2] [--hg38]</span><br><span class="line"> [-a ALLREADS] [--digGold] [-s SPECIES] [-o OVERLAPLEN]</span><br><span class="line"> [--noOverlapStep] [--extendedOutput] [-c CHAINS]</span><br><span class="line"> [--noCast] [-f FILTERTHRESHOLD]</span><br><span class="line"> [--minOverlap1 MINOVERLAP1]</span><br><span class="line"> [--minOverlap2 MINOVERLAP2] [--misMatch1 MISMATCH1]</span><br><span class="line"> [--misMatch2 MISMATCH2]</span><br><span class="line"> reads_file output_clones</span><br></pre></td></tr></table></figure><h3 id="run-2"><a class="markdownIt-Anchor" href="#run-2"></a> Run</h3><p>Then I was caught in an embarrassing situation.</p><p><code>ImRep</code> now is designed to handle two cases:</p><ul><li>When you have saved <strong>mapped and unmapped</strong> reads in one <code>BAM</code> file, <code>ImRep</code> can accept one <code>BAM</code> as input.</li></ul><blockquote><p>Given the bam file with mapped and unmapped reads, you can run ImReP using this command.</p><blockquote><p><code>python imrep.py --bam example/toyExample.bam example/toyExample.cdr3</code></p></blockquote></blockquote><ul><li>When you forgot to save unmapped reads, <code>ImRep</code> can accept <code>BAM</code> file with mapped reads and all raw <code>FASTQ</code> files as input.</li></ul><blockquote><p>Forgot to save unmapped reads, we got you covered. Use <code>--digGold</code> and <code>-a</code> options. For example:</p><blockquote><p><code>python imrep.py --digGold -a example/toyExample_allReads.fastq example/toyExample_onlyMapped.bam example/toyExample.cdr3</code></p></blockquote></blockquote><p>I’ve aligned the <code>FASTQ</code> files to genome with <code>STAR</code>, and saved unmapped reads in <code>FASTQ</code> format (using <code>--outReadsUnmapped Fastx</code>).</p><p>And the author said:</p><blockquote><p>Some mapping tools produce partially-mapped reads (i.e. STAR). In case read is mapped to BCR or TCR genes and is partially mapped to V or J gene, such read may be used to assemble full-length CDR3 sequences. Considering only unmapped reads will result in missing such reads.</p></blockquote><p>So the first case doesn’t suit me, I have to follow the second.</p><p>And the questions are:</p><ul><li>Can I feed <code>ImRep</code> with the <code>BAM</code> and unmapped reads? not all raw reads.</li><li>It seems <code>ImRep</code> only accepts one single <code>FATSQ</code> as input, should I cat two <code>FASTQ</code> of pair-end data? Or it just works for single-end data?</li></ul><p>I reported a issue to the author: <a href="https://github.com/mandricigor/imrep/issues/32" target="_blank" rel="noopener">unmapped reads in fastq/fasta format and pair-end data</a>.</p><p>And he suggested:</p><blockquote><p>Please merge PE into one file. Also to use --digGold, you need to provide original reads, not the unmapped reads. Please let me know how it goes. If this doesn’t’ work for you, we can implement the option to allow to supply bam with mapped and FASTQ with unmapped (this is on our TODO list anyway). Thanks, Serghei</p></blockquote><p>I should run <code>STAR</code> with <code>--outSamUnmapped Within</code> option hereafter. And I’ll not plan to re-align the reads for now.</p><p>Moreover, using <code>MiXCR</code>, I can use post-analysis tools such as <code>VDJtools</code> easily.</p><h2 id="change-notes"><a class="markdownIt-Anchor" href="#change-notes"></a> Change notes</h2><ul><li>20180324: create the note.</li></ul><hr class="footnotes-sep"><section class="footnotes"><ol class="footnotes-list"><li id="fn1" class="footnote-item"><p>Mangul S, Mandric I, Yang HT, Strauli N, Montoya D, Rotman J, Wey WVD, Ronas JR, Statz B, Zelikovsky A, et al. Profiling adaptive immune repertoires across multiple human tissues by RNA Sequencing. bioRxiv. 2017 Mar 25:089235. doi:10.1101/089235 <a href="#fnref1" class="footnote-backref">↩︎</a></p></li><li id="fn2" class="footnote-item"><p>Bolotin DA, Poslavsky S, Mitrophanov I, Shugay M, Mamedov IZ, Putintseva EV, Chudakov DM. MiXCR: software for comprehensive adaptive immunity profiling. Nature Methods. 2015;12(5):380–381. doi:10.1038/nmeth.3364 <a href="#fnref2" class="footnote-backref">↩︎</a></p></li><li id="fn3" class="footnote-item"><p>Li B, Li T, Wang B, Dou R, Pignon J-C, Choueiri TK, Signoretti S, Liu JS, Liu XS. Ultrasensitive detection of TCR hypervariable region in solid-tissue RNA-seq data. bioRxiv. 2016 Sep 5:073395. doi:10.1101/073395 <a href="#fnref3" class="footnote-backref">↩︎</a></p></li><li id="fn4" class="footnote-item"><p>Stubbington MJT, Lönnberg T, Proserpio V, Clare S, Speak AO, Dougan G, Teichmann SA. T cell fate and clonality inference from single-cell transcriptomes. Nature Methods. 2016;13(4):329–332. doi:10.1038/nmeth.3800 <a href="#fnref4" class="footnote-backref">↩︎</a></p></li><li id="fn5" class="footnote-item"><p>Mose LE, Selitsky SR, Bixby LM, Marron DL, Iglesia MD, Serody JS, Perou CM, Vincent BG, Parker JS. Assembly-based inference of B-cell receptor repertoires from short read RNA sequencing data with V’DJer. Bioinformatics. 2016;32(24):3729–3734. doi:10.1093/bioinformatics/btw526 <a href="#fnref5" class="footnote-backref">↩︎</a></p></li><li id="fn6" class="footnote-item"><p>Strauli NB, Hernandez RD. Statistical inference of a convergent antibody repertoire response to influenza vaccine. Genome Medicine. 2016;8:60. doi:10.1186/s13073-016-0314-z <a href="#fnref6" class="footnote-backref">↩︎</a></p></li><li id="fn7" class="footnote-item"><p>Warren RL, Nelson BH, Holt RA. Profiling model T-cell metagenomes with short reads. Bioinformatics (Oxford, England). 2009;25(4):458–464. doi:10.1093/bioinformatics/btp010 <a href="#fnref7" class="footnote-backref">↩︎</a></p></li></ol></section>]]></content>
<summary type="html">
<p><strong>Bacground</strong>: because the transcriptome data I recently worked with is highly related with immune. So I want to dig out som
</summary>
<category term="RNA-seq" scheme="https://yiweiniu.github.io/blog/categories/RNA-seq/"/>
<category term="immune" scheme="https://yiweiniu.github.io/blog/categories/RNA-seq/immune/"/>
<category term="TCR/BCR" scheme="https://yiweiniu.github.io/blog/categories/RNA-seq/immune/TCR-BCR/"/>
<category term="bio-tools" scheme="https://yiweiniu.github.io/blog/tags/bio-tools/"/>
<category term="RNA-seq" scheme="https://yiweiniu.github.io/blog/tags/RNA-seq/"/>
<category term="immune" scheme="https://yiweiniu.github.io/blog/tags/immune/"/>
<category term="TCR/BCR" scheme="https://yiweiniu.github.io/blog/tags/TCR-BCR/"/>
</entry>
</feed>