Skip to content

Commit 1e5e7d2

Browse files
committed
CCS 4.0.0
1 parent c4e4e8e commit 1e5e7d2

2 files changed

Lines changed: 229 additions & 19 deletions

File tree

README.md

Lines changed: 229 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
_ccs_ takes multiple (sub)reads of the same SMRTbell molecule and combines
1010
them using a statistical model to produce one highly accurate consensus sequence,
1111
also called HiFi or CCS read, with base quality values.
12-
This tool powers the _Circular Consensus Sequences_ workflow in SMRT Link.
12+
This tool powers the _Circular Consensus Sequencing_ workflow in SMRT Link.
1313

1414
## Availability
1515
Latest `ccs` can be installed via bioconda package `pbccs`.
@@ -18,7 +18,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
1818
for information on Installation, Support, License, Copyright, and Disclaimer.
1919

2020
## Latest Version
21-
Version **3.4.1**: [Full changelog here](#changelog)
21+
Version **4.0.0**: [Full changelog here](#changelog)
2222

2323
## Schematic Workflow
2424
<p align="center"><img width="600px" src="img/ccs-workflow.png"/></p>
@@ -27,24 +27,116 @@ Version **3.4.1**: [Full changelog here](#changelog)
2727
**Input**: Subreads from a single movie in PacBio BAM format (`.subreads.bam`).
2828

2929
**Output**: Consensus reads in a format inferred from the file extension:
30-
unaligned BAM (`.bam`); FASTQ (`.fastq`);
30+
unaligned BAM (`.bam`); bgzipped FASTQ (`.fastq.gz`);
3131
or SMRT Link XML (`.consensusreadset.xml`) which also generates a corresponding
3232
BAM file.
3333

3434
Run on a full movie:
3535

3636
ccs movie.subreads.bam movie.ccs.bam
3737

38+
Parallelize by using `--chunk`.
39+
See [how-to chunk](#how-can-I-parallelize-on-multiple-servers).
40+
3841
## FAQ
3942

4043
### What impacts the number and quality of CCS reads that are generated?
4144
The longer the polymerase read gets, more readouts (passes) of the SMRTbell
4245
are produced and consequently more evidence is accumulated per molecule.
4346
This increase in evidence translates into higher consensus accuracy, as
44-
depicted in the following sketch:
47+
depicted in the following plot:
4548

4649
<p align="center"><img width="600px" src="img/ccs-acc.png"/></p>
4750

51+
### How is number of passes computed?
52+
Each CCS read is annotated with a `np` tag that contains the number of
53+
full-length subreads used for polishing.
54+
Since the first version of _ccs_, number of passes has only accounted for
55+
full-length subreads. In version v3.3.0 windowing has been added, which
56+
takes the minimum number of full-length subreads across all windows.
57+
Starting with version v4.0.0, minimum has been replaced with mode to get a
58+
better representation across all windows.
59+
60+
### Which and in what order are filters applied?
61+
_ccs_ exposes the following filters on input subreads, draft consensus,
62+
and final output consensus:
63+
64+
Input Filter Options:
65+
--min-passes INT Minimum number of full-length subreads required to generate CCS for a ZMW. [3]
66+
--min-snr FLOAT Minimum SNR of subreads to use for generating CCS [2.5]
67+
68+
Draft Filter Options:
69+
--min-length INT Minimum draft length before polishing. [10]
70+
--max-length INT Maximum draft length before polishing. [50000]
71+
72+
Output Filter Options:
73+
--min-rq FLOAT Minimum predicted accuracy in [0, 1]. [0.99]
74+
75+
Data flow how each ZMW gets processed and filtered:
76+
1. Remove subreads with lengths <50% or >200% of the median subread length.
77+
2. Remove subreads with SNR below `--min-snr`.
78+
3. Stop if number of full-length subreads is fewer than `--min-passes`.
79+
4. Generate draft sequence and stop if draft length does not pass `--min-length` and `--max-length`.
80+
5. Polish consensus sequence and only emit CCS read if predicted accuracy is at least `--min-rq`.
81+
82+
### How do I read the ccs_report.txt file?
83+
The `ccs_report.txt` file summarizes (B) how many ZMWs generated CCS reads and
84+
(C) how many failed CCS generation because of the listed causes. For (C), each ZMW
85+
contributes exactly to one reason of failure; percentages are with respect to (C).
86+
87+
The following comments refer to the filters that are explained in the FAQ above.
88+
89+
ZMWs input (A) : 4779
90+
ZMWs generating CCS (B) : 1875 (39.23%)
91+
ZMWs filtered (C) : 2904 (60.77%)
92+
93+
Exclusive ZMW counts for (C):
94+
No usable subreads : 66 (2.27%) <- All subreads were filtered in (1)
95+
Below SNR threshold : 54 (1.86%) <- All subreads were filtered in (2)
96+
Lacking full passes : 2779 (95.70%) <- Less than --min-passes full-length reads (3)
97+
Heteroduplexes : 0 (0.00%) <- Single-strand artifacts
98+
Min coverage violation : 0 (0.00%) <- ZMW is damaged on one strand and can't be polished reliably
99+
Draft generation error : 5 (0.17%) <- Subreads don't agree to generate a draft sequence
100+
Draft above --max-length : 0 (0.00%) <- Draft sequence is longer than --min-length (4)
101+
Draft below --min-length : 0 (0.00%) <- Draft sequence is shorter than --min-length (4)
102+
Lacking usable subreads : 0 (0.00%) <- Too many subreads were dropped while polishing
103+
CCS did not converge : 0 (0.00%) <- Draft has too many errors that can't be polished in time
104+
CCS below minimum RQ : 0 (0.00%) <- Predicted accuracy is below --min-rq (5)
105+
Unknown error : 0 (0.00%) <- Rare implementation errors
106+
107+
### What is the definition of a heteroduplex?
108+
In general, whenever bases on one strand of the SMRTbell are not the
109+
reverse complement of the other strand, as small as a single base `A` with a
110+
matching `G`. _ccs_ would polish this to one of the bases and reflect the
111+
ambiguity in the base QV. In our case, when one strand has more than `20`
112+
additional bases that the other strand does not have, _ccs_ won't be able to
113+
converge to a consensus sequence, consequently will remove the ZMW and
114+
increase the counter for heteroduplexes found in the `ccs_report.txt` file.
115+
116+
### How can I parallelize on multiple servers?
117+
Parallelize by chunking. Since _ccs_ v4.0.0, direct chunking via `--chunk`
118+
is possible. For this, the `.subreads.bam` file must accompanied by a
119+
`.pbi` file. To generate the index `subreads.bam.pbi`, use
120+
`pbindex`, which can be installed with `conda install pbbam`.
121+
122+
pbindex movie.subreads.bam
123+
124+
An example workflow, all ccs invocations can run simultaneously:
125+
126+
ccs movie.subreads.bam movie.ccs.1.bam --chunk 1/10 -j <THREADS>
127+
ccs movie.subreads.bam movie.ccs.2.bam --chunk 2/10 -j <THREADS>
128+
...
129+
ccs movie.subreads.bam movie.ccs.10.bam --chunk 10/10 -j <THREADS>
130+
131+
Merge chunks with `pbmerge` and index with `pbindex`
132+
133+
pbmerge -o movie.ccs.bam movie.ccs.*.bam
134+
pbindex movie.ccs.bam
135+
136+
or use `samtools`
137+
138+
samtools merge -@8 movie.ccs.bam movie.ccs.*.bam
139+
48140
### What happened to unanimity?
49141
Unanimity lives on as a PacBio internal library to generate consensus sequences.
50142
Customer-facing documentation will be limited to _ccs_ distributed via bioconda.
@@ -53,20 +145,31 @@ Customer-facing documentation will be limited to _ccs_ distributed via bioconda.
53145
We have stopped mirroring code changes to GitHub in March 2018.
54146
Instead, we provide binaries on bioconda to ease end-user experience.
55147
If your project relies on outdated unanimity source code,
56-
please use [this commit](https://github.com/PacificBiosciences/ccs/tree/6f11a13e1472b8c00337ba8c5e94bf83bdab31d6).
148+
please use [this commit](https://github.com/PacificBiosciences/unanimity/tree/6f11a13e1472b8c00337ba8c5e94bf83bdab31d6).
57149

58150
### Help! I am getting "Unsupported ..."!
59151
If you encounter the error `Unsupported chemistries found: (...)` or
60-
`unsupported sequencing chemistry combination`, your _ccs_ binaries predates
61-
the used sequencing chemistry kit.
152+
`unsupported sequencing chemistry combination`, your _ccs_ binaries does not
153+
support the used sequencing chemistry kit, from here on refered to as "chemistry".
154+
This may be because we removed support of an older or your binary predates
155+
release of the used chemistry.
62156
This is unlikely to happen with _ccs_ from SMRT Link installations, as SMRT Link
63157
is able to automatically update and install new chemistries.
158+
Thus, easiest solution is to always use _ccs_ from the SMRT Link version that
159+
shipped with the release of the sequencing chemistry kit.
160+
161+
**Old chemistries:**
162+
With _ccs_ 4.0.0, we have removed support for the last RSII chemistry `P6-C4`.
163+
The only option is to downgrade _ccs_ with `conda install pbccs==3.4`.
64164

65-
If you are not an early access user, stop here and solve it via updating your
66-
_ccs_ with `conda update --all`.
165+
**New chemistries:**
166+
It might happen that your _ccs_ version predates the sequencing chemistry kit
167+
and there is an easy fix, install the latest version of _ccs_ with `conda update --all`.
168+
If you are an early access user, follow the [monkey patch tutorial](#monkey-patch-ccs-to-support-additional-sequencing-chemistry-kits).
67169

68-
If you are an early access user, please create a directory that is used to
69-
inject new chemistry information into _ccs_:
170+
### Monkey patch _ccs_ to support additional sequencing chemistry kits
171+
Please create a directory that is used to inject new chemistry information
172+
into _ccs_:
70173

71174
```sh
72175
mkdir -p /path/to/persistent/dir/
@@ -75,27 +178,134 @@ export SMRT_CHEMISTRY_BUNDLE_DIR="${PWD}"
75178
mkdir -p arrow
76179
```
77180

78-
To fix `unsupported sequencing chemistry combination` please download the latest
79-
out-of-band `chemistry.xml`:
181+
Execute the following step by step instructions to fix the error you are observing
182+
and afterwards proceed using _ccs_ as you would normally do. Additional chemistry
183+
information is automatically loaded from the `${SMRT_CHEMISTRY_BUNDLE_DIR}`
184+
environmental variable.
185+
186+
#### Error: "unsupported sequencing chemistry combination"
187+
Please download the latest out-of-band `chemistry.xml`:
80188

81189
```sh
82190
wget https://raw.githubusercontent.com/PacificBiosciences/pbcore/develop/pbcore/chemistry/resources/mapping.xml -O "${SMRT_CHEMISTRY_BUNDLE_DIR}"/chemistry.xml
83191
```
84192

85-
To fix `Unsupported chemistries found: (...)` please get the latest consensus
86-
model `.json` from PacBio and copy it to:
193+
#### Error: "Unsupported chemistries found: (...)"
194+
Please get the latest consensus model `.json` from PacBio and
195+
copy it to:
87196

88197
```sh
89198
cp /some/download/dir/model.json "${SMRT_CHEMISTRY_BUNDLE_DIR}"/arrow/
90199
```
91200

92-
Afterwards, proceed using _ccs_ as you would normally do. New chemistry
93-
information is automatically loaded from the `${SMRT_CHEMISTRY_BUNDLE_DIR}`
94-
environmental variable.
201+
### How fast is CCS?
202+
We tested CCS runtime using 1,000 ZMWs per length bin with exactly 10 passes.
203+
204+
<img width="600px" src="img/runtime.png"/>
205+
206+
#### How does that translate into time to result per SMRT Cell?
207+
We will measure time to result for Sequel I and II CCS sequencing collections
208+
on a PacBio recommended HPC, according to the
209+
[Sequel II System Compute Requirements](https://www.pacb.com/wp-content/uploads/SMRT_Link_Installation_v701.pdf)
210+
with 192 physical or 384 hyper-threaded cores.
211+
212+
1) Sequel I: 15 kb insert size, 30-hours movie, 37 GB raw yield, 2.3 GB CCS UMY
213+
2) Sequel II: 15 kb insert size, 30-hours movie, 340 GB raw yield, 24 GB CCS UMY
214+
215+
CCS version | Sequel I | Sequel II
216+
:-: | :-: | :-:
217+
≤3.0.0 | 1 day | >1 week
218+
3.4.1 | 3 hours | >1 day
219+
≥4.0.0 | **40 minutes** | **6 hours**
220+
221+
#### How is CCS speed affected by raw base yield?
222+
Raw base yield is the sum of all polymerase read lengths.
223+
A polymerase read consists of all subreads concatenated
224+
with SMRTbell adapters in between.
225+
226+
Raw base yield can be increased with
227+
1) higher percentage of single loaded ZMWs and
228+
2) longer movie times that lead to longer polymerase read lengths.
229+
230+
Since the first version, _ccs_ scaled linear in (1) the number of single loaded
231+
ZMWs per SMRT Cell.
232+
Starting with version 3.3.0 _ccs_ scaled linear in (2) the polymerase read length
233+
and with version 4.0.0 _ccs_ scales sublinear.
234+
235+
#### What did change in each version?
236+
CCS version | O(insert size) | O(#passes)
237+
:-: | :-: | :-:
238+
≤3.0.0 | quadratic | linear
239+
3.4.1 | **linear** | linear
240+
≥4.0.0 | linear | **sublinear**
241+
242+
#### How can version 4.0.0 be sublinear in the number of passes?
243+
With the introduction of new heuristics, individual draft bases can skip
244+
polishing if they are of sufficient quality.
245+
The more passes a ZMW has, the fewer bases need additional polishing.
246+
247+
### What heuristics are used?
248+
Following heuristics are enabled
249+
- determine which bases need polishing,
250+
- remove ZMWs with single-strand artifacts such as heteroduplexes
251+
- remove large insertions that likely are due to sequencing errors,
252+
- on-the-fly model caching with less SNR resolution,
253+
- adaptive windowing strategy with a target window size of 22 bp with ±2 bp overlaps, avoiding breaks in simple repeats (homopolymers to 4mer repeats)
254+
255+
### Does speed impact quality and yield?
256+
Yes it does. With ~35x speed improvements from version 3.1.0 to 4.0.0 and
257+
consequently reducing CPU time from >60,000 to <2,000 core hours,
258+
heuristics and changes in algorithms lead to slightly lower yield and
259+
accuracy if run head to head on the same data set. Internal tests show
260+
that _ccs_ 4.0.0 introduces no regressions in CCS-only Structural Variant
261+
calling and has minimal impact on SNV and indel calling in DeepVariant.
262+
In contrast, lower DNA quality has a bigger impact on quality and yield.
263+
264+
### Can I tune _ccs_ to get improve results?
265+
No, we optimized _ccs_ such that there is a good balance between speed and
266+
output quality.
267+
268+
### Can I produce one consensus sequence for each strand of a molecule?
269+
Yes, please use `--by-strand`. Make sure that you have sufficient coverage,
270+
as `--min-passes` are per strand in this case. For each strand, _ccs_
271+
generates one consensus read that has to pass all filters.
272+
Read name suffix indicates strand. Example:
273+
274+
m64011_190714_120746/14/ccs/rev
275+
m64011_190714_120746/35/ccs/fwd
276+
277+
### Is there a progress report?
278+
Yes. With `--log-level INFO`, _ccs_ provides status to `stderr` every
279+
`--refresh-rate seconds` (default 30):
280+
281+
#ZMWs, #CCS, #CPM, #CMT, ETA: 2689522, 1056330, 2806, 29.2, 4h 52m
282+
283+
In detail:
284+
* `#ZMWs`, number of ZMWs processed
285+
* `#CCS`, number of CCS reads generated
286+
* `#CPM`, number of CCS reads generated per minute
287+
* `#CMT`, number of CCS reads generated per minute per thread
288+
* `ETA`, estimated processing time left
289+
290+
If there is no `.pbi` file present, ETA will be omitted.
291+
292+
## Licenses
293+
PacBio® tool _ccs_, distributed via Bioconda, is licensed under
294+
[BSD-3-Clause-Clear](https://spdx.org/licenses/BSD-3-Clause-Clear.html)
295+
and statically links GNU C Library v2.29 licensed under [LGPL](https://spdx.org/licenses/LGPL-2.1-only.html).
296+
Per LPGL 2.1 subsection 6c, you are entitled to request the complete
297+
machine-readable work that uses glibc in object code.
95298

96299
## Changelog
97300

98-
* **3.4.1**
301+
* **4.0.0**:
302+
* SMRT Link v8.0 release candidate
303+
* Speed improvements
304+
* Removed support for legacy python Genomic Consensus, please use `conda install pbgcpp`
305+
* New command-line interface
306+
* New report file
307+
* 3.4.1
308+
* Released with SMRT Link v7.0
99309
* Log used chemistry model to INFO level
100310
* 3.4.0
101311
* Fixes to unpolished mode for IsoSeq

img/runtime.png

170 KB
Loading

0 commit comments

Comments
 (0)