Skip to content

Commit beb90e5

Browse files
committed
Update to v0.7.0
1 parent d7a8676 commit beb90e5

30 files changed

Lines changed: 3329 additions & 201 deletions

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,17 @@
11
# Change Log
22

3+
## v0.7.0 - 2025-12-30
4+
5+
- Phasing update to add haplotag annotations to remapped read output (CR-551)
6+
- Automatically set standard phasing `HP` and `PS` tags on remapped reads where a simple diploid contig mapping
7+
pattern occurs on the reference genome.
8+
- Phasing annotation method can be adjusted by setting the input mode with the new `--input-assembly-mode` argument:
9+
- `fully-phased` mode assumes that the input contigs represent a phased haplotype
10+
- `partially-phased` mode assumes that the input contigs contain switch errors, and finds phase-sets from the
11+
underlying read support
12+
- If updating from a previous portello version, note that the read's assembly contig segment information is now moved
13+
to the custom `ZS` tag ( this previously annotated using the `PS` tag)
14+
315
## v0.6.1 - 2025-12-16
416

517
- Fix error message to describe chromosome consistency issues between assembly-to-ref alignments and the reference fasta [PacificBiosciences/portello#1]

Cargo.lock

Lines changed: 27 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "portello"
3-
version = "0.6.1"
3+
version = "0.7.0"
44
authors = ["Chris Saunders <csaunders@pacificbiosciences.com>"]
55
description = "Projects read alignments from personal assembly to reference"
66
edition = "2024"
@@ -10,6 +10,7 @@ license-file="LICENSE.md"
1010
vergen-gitcl = "1"
1111

1212
[dependencies]
13+
bio = "2"
1314
camino = "1"
1415
chrono = "0.4"
1516
clap = { version = "4", features = ["derive", "suggestions"] }
@@ -22,4 +23,5 @@ rayon = "1"
2223
rust-htslib = { version = "0.50", default-features = false, features = ["bzip2", "lzma"] }
2324
rust-vc-utils = { path="lib/rust-vc-utils" }
2425
simple-error = "0.3"
26+
strum = { version = "0.27", features = ["derive"] }
2527
unwrap = "1"

LICENSE-THIRDPARTY.json

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1261,7 +1261,7 @@
12611261
},
12621262
{
12631263
"name": "portello",
1264-
"version": "0.6.1",
1264+
"version": "0.7.0",
12651265
"authors": "Chris Saunders <csaunders@pacificbiosciences.com>",
12661266
"repository": null,
12671267
"license": null,
@@ -1718,6 +1718,15 @@
17181718
"license_file": null,
17191719
"description": "Helpful macros for working with enums and strings"
17201720
},
1721+
{
1722+
"name": "strum",
1723+
"version": "0.27.2",
1724+
"authors": "Peter Glotfelty <peter.glotfelty@microsoft.com>",
1725+
"repository": "https://github.com/Peternator7/strum",
1726+
"license": "MIT",
1727+
"license_file": null,
1728+
"description": "Helpful macros for working with enums and strings"
1729+
},
17211730
{
17221731
"name": "strum_macros",
17231732
"version": "0.26.4",
@@ -1727,6 +1736,15 @@
17271736
"license_file": null,
17281737
"description": "Helpful macros for working with enums and strings"
17291738
},
1739+
{
1740+
"name": "strum_macros",
1741+
"version": "0.27.2",
1742+
"authors": "Peter Glotfelty <peter.glotfelty@microsoft.com>",
1743+
"repository": "https://github.com/Peternator7/strum",
1744+
"license": "MIT",
1745+
"license_file": null,
1746+
"description": "Helpful macros for working with enums and strings"
1747+
},
17301748
{
17311749
"name": "syn",
17321750
"version": "1.0.109",

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,13 @@ reads are grouped by their associated assembly contig, to create a phased-like v
3030

3131
See the [user guide](docs/user_guide.md) for details on getting started with portello.
3232

33+
## Reference
34+
35+
Portello is still under development. Some background and performance information is available from our 2025 CSHL Genome
36+
Informatics poster available/citable from Zenodo here:
37+
38+
[Saunders, C., Kronenberg, Z., Holt, J., Rowell, W., & Eberle, M. (2025). Portello: Making global assembly more effective for rare-disease WGS. CSHL Genome Informatics, CSHL. Zenodo.](https://doi.org/10.5281/zenodo.17553274)
39+
3340
## DISCLAIMER
3441

3542
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

docs/user_guide.md

Lines changed: 70 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,9 @@ The expected workflow to use it is:
2222
- Align sample reads directly to consolidated assembly contigs using pbmm2
2323
- Run portello on the above two alignment outputs to lift sample read mappings from the assembly to the reference.
2424

25+
Note that the choice of mappers and mapping parameters can have a substantial impact on downstream small variant calling
26+
quality. It is recommended to carefully review the corresponding [Inputs](#inputs) section before starting.
27+
2528
## Getting Started
2629

2730
### Installation
@@ -32,16 +35,16 @@ release tarball, or via conda as described below.
3235
#### Install from GitHub
3336

3437
To install portello from github, download the latest release tarball compiled for 64-bit Linux on the [github release
35-
channel](https://github.com/PacificBiosciences/portello/releases/latest), then unpack the tar file. Using v0.6.1 as an
38+
channel](https://github.com/PacificBiosciences/portello/releases/latest), then unpack the tar file. Using v0.7.0 as an
3639
example, the tar file can be downloaded and unpacked as follows:
3740

38-
wget https://github.com/PacificBiosciences/portello/releases/download/v0.6.1/portello-v0.6.1-x86_64-unknown-linux-gnu.tar.gz
39-
tar -xzf portello-v0.6.1-x86_64-unknown-linux-gnu.tar.gz
41+
wget https://github.com/PacificBiosciences/portello/releases/download/v0.7.0/portello-v0.7.0-x86_64-unknown-linux-gnu.tar.gz
42+
tar -xzf portello-v0.7.0-x86_64-unknown-linux-gnu.tar.gz
4043

4144
The portello binary is found in the `bin/` directory of the unpacked file distribution. This can be run with the help
4245
option to test the binary and review latest usage details:
4346

44-
portello-v0.6.1-x86_64-unknown-linux-gnu/bin/portello --help
47+
portello-v0.7.0-x86_64-unknown-linux-gnu/bin/portello --help
4548

4649
#### Install from conda
4750

@@ -71,14 +74,16 @@ portello \
7174
--ref $ref_fasta \
7275
--assembly-to-ref $asm_to_ref_bam \
7376
--read-to-assembly $read_to_asm_bam \
77+
--input-assembly-mode partially-phased \
7478
--remapped-read-output - \
7579
--unassembled-read-output unassembled.bam |\
7680
samtools sort -@$threads - --write-index -o remapped.sort.bam
7781
```
7882

79-
When viewing the output `remapped.sort.bam` in IGV, it is recommended to group on the `PS` tag, to see a pseudo-phased
80-
view of the read alignments by grouping the reads to their respective assembly contigs. See the [Tags](#tags) section
81-
below for details of the `PS` tag string.
83+
The output `remapped.sort.bam` will include standard haplotagging annotations using the `PS` and `HP` tags, which should
84+
help to visualize typical diploid regions. For more complicated regions, it is best to group reads in IGV by the `ZS`
85+
tag, which annotates the corresponding contig alignment segment for each read. See the [Tags](#tags) section below for
86+
details of the `ZS` tag string.
8287

8388
## Inputs
8489

@@ -94,6 +99,14 @@ trio-binning. Portello has been tested with such inputs from both [hifiasm](http
9499
For either assembler, the contigs for each haplotype should be extracted to fasta format, and the two sets of
95100
haplotype specific contigs should be concatenated into a single fasta for use in the alignment steps below.
96101

102+
Note that the type of input contigs need to be conveyed to portello as well so that its phasing/read-haplotagging
103+
routines can be appropriately set. Use `--input-assembly-mode partially-phased` for partially-phased/dual assembly
104+
output and `--input-assembly-mode fully-phased` otherwise. See the [Phasing](#phasing) section below for more details.
105+
106+
Note that while portello has been primarily tested with assemlby contigs, it can work with unitig input as well. Phasing
107+
and small variant calling quality appear to be lower from unitig input compared to the partially-phased contig output
108+
from the same assembly.
109+
97110
### Read-to-assembly alignments
98111

99112
For the read-to-assembly alignment input, all sequencing reads should be mapped to the concatenated diploid assembly
@@ -143,7 +156,7 @@ left-shifted indels. Portello will take steps to preserve this left-shifting in
143156
## Outputs
144157

145158
All BAM outputs from portello are unsorted. The two bam output files are for "unassembled" and "remapped" reads. Every
146-
input read should be represented by exactly one primary read in one of the two outputs files. Both of these output files
159+
input read should be represented by exactly one primary read in one of the two output files. Both of these output files
147160
may contain unmapped reads, but they're separated into two separate files because the unmapped reads in each file have
148161
different interpretations relevant to downstream processing steps, as discussed below.
149162

@@ -162,24 +175,31 @@ the liftover reads.
162175

163176
This output contains all reads that mapped to the assembly contigs, and have been lifted over to the target reference
164177
genome. Unmapped reads in this file should correspond to assembly contig regions which did not map to the reference, so
165-
may represent population-specific and large insertion sequences.
178+
may represent population-specific and large non-templated insertion sequences.
166179

167180
## Secondary analysis on portello remapped output
168181

182+
### Small variant calling
183+
169184
Portello remapped read output has been tested on DeepVariant and some other secondary analysis tools designed for
170185
conventional pbmm2-mapped BAM inputs. Results generally seem very good. DeepVariant, run with current best-practice
171186
models for conventionally mapped BAM inputs, typically shows slightly higher accuracy on portello BAMs. Although the
172187
primary benefits of assembly-based liftover are for structural variants and larger-scale events rather than small
173188
variant calling, this DeepVariant performance trend demonstrates that portello alignments are properly aligned in
174189
other contexts as well.
175190

176-
Note that although we recommend viewing portello alignments in IGV with reads grouped by the `PS` tag, we recommend that
177-
you **Interpret the PS tag with care**. The PS tag in the portello output shows which contig each read was mapped to,
178-
and more specifically which split read segment of the contig it was mapped to. When visualizing portello BAMs in IGV it
179-
is extremely useful to show a pseudo-phased view of the reads by grouping them on this tag. Note, however that each
180-
assembly algorithm has a different approach to contig contiguity through LOH regions, so a read segment with one PS tag
181-
is not equivalent to a single haplotype phase block, since it could continue through large LOH regions with possible
182-
phase switching.
191+
### Phasing / IGV visualization
192+
193+
For viewing in IGV, portello provides haplotagged BAM output automatically, using the read to contig mappings to
194+
implement a variation on read-backed phasing, with this logic adjusted appropriately for either partially-phased or
195+
fully-phased contig input. This means that typical phased read viewing patterns relying on `PS` and `HP` tag annotations
196+
on the reads will work in typical diploid regions.
197+
198+
For viewing the reads in IGV in more complex regions, the reads can be grouped by the `ZS` tag, which annotates each read
199+
according to the assembly contig alignment segment it was aligned to. This creates a phasing-like view of the reads, but the viewer
200+
should note the potential for phasing switch errors when these annotation are derived from partially-phased contigs.
201+
202+
See the [Phasing](#phasing) and [Tags](#tags) sections below for more details.
183203

184204
## Other Notes
185205

@@ -213,11 +233,43 @@ The MAPQ value of liftover read alignment segments is taken form the contig spli
213233
segment was mapped to. The read's original MAPQ to the contig alignments is stored in the record under `ZM`. This will
214234
often be zero, which is most often a reflection of ambiguous alignment between the 2 parental contigs.
215235

236+
### Phasing
237+
238+
Portello will automatically add haplotagging annotations (`HP` and `PS` tags) to the remapped read output based on the
239+
read-to-contig alignments. These annotations are currently restricted to simple diploid regions where exactly two contig
240+
alignment segments span the reference. The phasing strategy will change based on whether the input assembly contigs are
241+
fully-phased or partially-phased (as conveyed to portello by the command-line `--input-assembly-mode` setting).
242+
243+
For fully-phased contigs, the phasing annotation will create phase-set blocks spanning each region where two contig
244+
alignment segments span the reference, assuming the contigs themselves are free of switch errors.
245+
246+
For partially-phased contigs, the phasing annotation method will start with the larger phase-set annotations from the
247+
fully-phased annotation mode, but then further subdivide phase-set blocks according to whether any reads span
248+
heterozygous variant sites. The method is similar to mapping-based read backed phasing except that the read support
249+
is evaluated based on alignments to the contigs (the phasing evaluation occurs before read-to-reference re-mapping).
250+
The partially-phased haplotagging mode will slightly increase portello'w runtime and memory usage.
251+
252+
#### Phased heterozygous VCF output
253+
254+
When run with the `partiall-phased` input assembly mode, an optional VCF file can be produced containing all
255+
phase-eligible heterozygous small variants with their corresponding phasing annotations. To produce this VCF, provide a
256+
file prefix using the `--phased-het-vcf-prefix` argument. The VCF output produced by this procedure is only intended for
257+
debugging and benchmarking the BAM haplotag output -- the heterozygous variants provided in this output file (and used
258+
for the phasing procedure) are produced directly from differences in the assembly contig sequencers, so should not be as
259+
accurate as the output of a dedicated small variant calling method such as DeepVariant.
260+
216261
### Tags
217262

218-
The following diagnostic tags added to the remapped read output, these are likely to change in future:
263+
The following standard haplotagging values are added to the remapped read output:
264+
265+
`HP` - The `HP` tag with value 1 or 2 is added in regions with coverage by two contig alignment segments.
266+
267+
`PS` - The `PS` tag provides the phase set id in diploid regions. The PS value corresponds to the start position of the
268+
first heterozygous variant in the phase set block.
269+
270+
The following custom diagnostic tags added to the remapped read output, these could change in future updates:
219271

220-
`PS` - This tag can be used for read grouping in IGV to create a pseudo-phased view of the alignments. It is
272+
`ZS` - This tag can be used for read grouping in IGV to create a pseudo-phased view of the alignments. It is
221273
`{contig_name}_split{split_alignment_no}{strand}`, where `contig_name` is the contig the read aligned to, `split
222274
alignment no` refers to which of that contig's (colinear-joined) split alignment segments the read aligned to, and
223275
`strand` is "+" or "-" for the contig alignments orientation to the reference.

lib/rust-vc-utils/CHANGELOG.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,20 @@
11
## v0.33.0 - 2025-XX-XX
22

33
### Added
4+
- Add DepthTracker from sawfish
5+
- Add RegionMap from sawfish
46
- Add drop_true from sawfish
57
- New methods to remove clipped regions from cigars
68

79
### Changed
8-
910
- In all cigar shift routines, reverse cigar indel cluster ordering such that insertion state is always ordered first
1011
- Add method to restrict GenomeRef sequence characters
1112
- Make reverse comp methods tolerant to unexpected characters
1213
- Still not supporting IUPAC ambiguity codes
1314

15+
### Fixed
16+
- Fix off-by-one issue in IntRange::intersect_range
17+
1418
## v0.32.0 - 2025-08-07
1519

1620
### Added

lib/rust-vc-utils/src/bam_utils/cigar/clip_alignment.rs

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
//! Alignment clipping facilities for read or ref based clip sizes
22
//!
33
4-
use super::{compress_cigar, update_ref_and_read_pos};
4+
use super::{compress_cigar, update_ref_and_read_pos, update_ref_pos};
55
use rust_htslib::bam::record::Cigar;
66

77
/// Modify a cigar string so that the read is soft-clipped on the left side to achieve at least the specified reference
@@ -15,10 +15,7 @@ use rust_htslib::bam::record::Cigar;
1515
pub fn clip_alignment_ref_start(cigar_in: &[Cigar], min_left_ref_clip: i64) -> (Vec<Cigar>, i64) {
1616
use Cigar::*;
1717

18-
let ignore_hard_clip = false;
19-
2018
let mut ref_pos = 0;
21-
let mut read_pos = 0;
2219

2320
let mut cigar_out = Vec::new();
2421
let mut left_ref_clip_shift = 0;
@@ -62,7 +59,7 @@ pub fn clip_alignment_ref_start(cigar_in: &[Cigar], min_left_ref_clip: i64) -> (
6259
cigar_out.push(*c);
6360
}
6461
};
65-
update_ref_and_read_pos(c, &mut ref_pos, &mut read_pos, ignore_hard_clip);
62+
update_ref_pos(c, &mut ref_pos);
6663
}
6764
(cigar_out, left_ref_clip_shift)
6865
}

0 commit comments

Comments
 (0)