Skip to content

Commit 1c265d7

Browse files
authored
Version 3.3.3 (#52)
- Fix bug that `min_variant_frequency` cannot be set lower than the default value - Fix minor bug in `ikbkg` that causes program to error out - Sort reads by name first to remove indeterminism in haplotype names - Do not write VCF when region is clearly not homozygous but no haplotypes are phased (most likely due to low depth) - Add the `phase_region` field in JSON output to report the coordinates of the analysis region and the genome build - Minor improvement on indel detection - Minor improvement on handling `gene1_cn2`, a scenario specified in the config asking Paraphase to assume a paralog group to always have two copies of gene1 - Improve documentation - Update NEB tutorial to clarify on the order of TRI1/2/3 - Update README to clarify that the `fusions_called` field is only reported for four regions - Update the targeted data tutorial to include more details on PureTarget
1 parent dce99f8 commit 1c265d7

19 files changed

Lines changed: 155 additions & 69 deletions

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ For more details about Paraphase, please check out our latest [paper](https://ww
3535

3636
- Chen X, Harting J, Farrow E, et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. The American Journal of Human Genetics. 2023. doi:10.1016/j.ajhg.2023.01.001
3737

38-
For whole-genome sequencing (WGS) data, we recommend >20X, ideally 30X, genome coverage. Low coverage or short read length could result in less accurate phasing, especially when gene copies are highly similar to each other. For hybrid capture-based enrichment data, a higher read depth (>50X) is recommended as the read length is generally shorter than WGS.
38+
Paraphase supports both whole-genome sequencing (WGS) data and targeted sequencing data, including data generated from [PureTarget](https://www.pacb.com/technology/puretarget) panels. For whole-genome sequencing (WGS) data, we recommend >20X, ideally 30X, genome coverage. Low coverage or shorter read length could result in less accurate phasing, especially when gene copies are highly similar to each other. See our [tutorial](docs/targeted_data.md) for more details on targeted data.
3939

4040
## Contact
4141

@@ -113,6 +113,7 @@ Paraphase produces a few output files in the directory specified by `-o`, with t
113113
- `haplotype_details`: lists information about each haplotype
114114
- `boundary`: the boundary of the region that is resolved on the haplotype. This is useful when a haplotype is only partially phased.
115115
- `alleles_final`: haplotypes phased into alleles. This is possible when the segmental duplication is in tandem.
116+
- `fusions_called`: deletions or duplications created by unequal crossing over between paralogous sequences, called by a special step that checks the flanking sequences of phased haplotypes. This step is currently enabled for four regions: CYP2D6, GBA, CYP11B1 and the CFH gene cluster.
116117

117118
Tutorials/Examples are provided for interpreting the `json` output and visualizing haplotypes for medically relevant genes listed below:
118119
- [SMN1/SMN2](docs/SMN1_SMN2.md)

docs/NEB.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Paraphase resolves the triplicate (TRI) repeat region in NEB, where copy number
77
- `total_cn`: total copy number of the triplicate repeat
88
- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
99
- `alleles_final`: when possible, different copies of TRI are phased into alleles with read based phasing.
10+
- `repeat_name`: haplotypes are assigned to TRI1/TRI2/TRI3, which are the three copies of the repeat in the reference genome. Note that this is according to their order in the reference genome, i.e. the first copy of the repeat in the reference genome is TRI1 and the last copy is TRI3. Some studies assign TRI1/TRI2/TRI3 according to their order in the coding sequence, which is on the negative strand of the reference genome, thus a reverse order than what's reported by Paraphase.
1011

1112
## Visualizing haplotypes
1213

66.1 KB
Loading

docs/targeted_data.md

Lines changed: 45 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,53 @@
11
# Running Paraphase on targeted data
22

3-
Paraphase can work with targeted data, such as:
4-
- Hybrid capture based enrichment data
5-
- CRISPR-Cas9 targeted data
6-
- Amplicon data
3+
## Data types
74

8-
The config file may need to be modified based on the design of the target panel. Please reach out to Xiao Chen (xchen@pacificbiosciences.com) if you need assistance.
5+
Paraphase can work with targeted sequencing data, such as:
6+
- Shotgun type enrichment data, i.e. [hybrid capture based enrichment](https://www.pacb.com/wp-content/uploads/Twist-dark-regions-application-brief.pdf) data. For this data type, the read length is generally shorter than WGS, which may result in less accurate phasing, especially when gene copies are highly similar to each other. Therefore, we recommend sequencing to a higher depth (>50X) than WGS.
7+
- Amplicon data or CRISPR-Cas9 targeted data, such as [PureTarget]((https://www.pacb.com/technology/puretarget)). For this data type, the entire target region is fully contained in a read, making it easier to determine haplotypes compared to the shotgun type data. We generally recommend sequencing to per-haplotype depth of 8-15X. However, a higher coverage (>15X) may be desired for short target regions. This is because for short regions it is possible for two haplotypes in the same sample to be identical in sequence, requiring Paraphase to adjust the haplotype copy numbers based on the relative depth difference between these haplotypes (e.g. one haplotype has twice the supporting reads of another haplotype, indicating the presence of two identical copies). The higher depth is needed to perform the copy number adjustment more accurately.
8+
9+
## Config file
10+
11+
The Paraphase config file needs to be modified based on the design of the target panel. Take the panel design for CYP21A2/CYP21A1P as an example.
12+
13+
```yaml
14+
{
15+
cyp21:
16+
realign_region: chr6:32038085-32042687
17+
extract_regions: chr6:31980000-32046800
18+
left_boundary: 32038085
19+
right_boundary: 32042687
20+
}
21+
```
22+
23+
Each target region has an arbitrary name, e.g. `cyp21` here.
24+
25+
`realign_region` specifies the main region of interest and the coordinates should refect the panel design (`realign_region` can be equal to or slightly bigger than the panel design). This is where we want all reads to be realigned to, i.e. one region selected for a homology group (e.g. CYP21A2 region for the CYP21A2/CYP21A1P group).
26+
27+
`extract_regions` specifies the regions where we want to extract relevant reads from the input bam, i.e. all the homologous regions (e.g. CYP21A2 and CYP21A1P) (where the reads might align or misalign in the genome bam). Multiple regions can be provided, separated by space.
28+
29+
`left_boundary` and `right_boundary` are optional. If they are not provided, Paraphase will perform phasing between the start and end coordinates in `realign_region` and all positions between them will be reported in the VCF. If `left_boundary` and `right_boundary` are provided, Paraphase will perform phasing between `left_boundary` and `right_boundary`, reporting all positions between them in the VCF.
30+
31+
Please don't hesitate to reach out to Xiao Chen (xchen@pacificbiosciences.com) if you need assistance with config files.
32+
33+
## Command line options
934

1035
Paraphase provides a few options for users to better work with targeted data:
11-
1) Use the `--targeted` option to drop the assumption of uniform coverage across the genome.
36+
1) Use the `--targeted` option to drop the assumption of uniform coverage across the genome. With the `--targeted` option, Paraphase will not perform any depth normalization against the rest of the genome.
1237
2) Additionally there are two optional parameters designed for targeted data. The default values are expected to work well with high depth data since they are frequency-based. But users can tune them based on the depth of their data and the expected copy numbers of their regions of interest.
1338
- `--min-variant-frequency`: Minimum frequency for a variant to be used for phasing. The cutoff for variant-supporting reads is determined by max(5, total_depth * min_frequency). Note that total_depth is the combined depth of all paralogs for a paralog group. Default is 0.11.
1439
- `--min-haplotype-frequency`: Minimum frequency of unique supporting reads for a haplotype. The cutoff for haplotype-supporting reads is determined by max(4, total_depth * min_frequency). Note that total_depth is the combined depth of all paralogs for a paralog group. Default is 0.03. This cutoff can be increased to filter out spurious and low-frequency haplotypes.
40+
41+
## Examples
42+
43+
Below we will use PureTarget data as an example.
44+
45+
Paraphase can be run with the following command. Note the custom config file and the `--targeted` option.
46+
47+
```bash
48+
paraphase -b input.bam -o output_directory -r genome_fasta -c custom_config.yaml --targeted
49+
```
50+
51+
Paraphase reports four haplotypes for this sample (including CYP21A2 and CYP21A1P copies), and reads are separated by haplotypes in `.paraphase.bam`. Not that only HiFi reads (`rq`>=0.99) are used in Paraphase, even if the input Bam contains reads with lower rqs.
52+
53+
![puretarget example](figures/puretarget_cyp21a2.png)

paraphase/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "3.3.2"
1+
__version__ = "3.3.3"

paraphase/genes/cfc1_phaser.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ def call(self):
2525
genome_depth=self.mdepth,
2626
region_depth=self.region_avg_depth._asdict(),
2727
sample_sex=self.sample_sex,
28+
phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
2829
)
2930
self.get_homopolymer()
3031
self.get_candidate_pos()
@@ -105,4 +106,5 @@ def call(self):
105106
self.region_avg_depth._asdict(),
106107
self.sample_sex,
107108
self.init_het_sites,
109+
f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
108110
)

paraphase/genes/cfhclust.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ def call(self):
7373
None,
7474
None,
7575
None,
76+
f"{self.cfh["phase_region"]},{self.cfhr3["phase_region"]}",
7677
None,
7778
fusions,
7879
)

paraphase/genes/f8_phaser.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ def call(self):
8989
genome_depth=self.mdepth,
9090
region_depth=self.region_avg_depth._asdict(),
9191
sample_sex=self.sample_sex,
92+
phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
9293
)
9394

9495
genome_bamh = pysam_handle(self.genome_bam, self.reference_fasta)
@@ -272,4 +273,5 @@ def call(self):
272273
self.region_avg_depth._asdict(),
273274
self.sample_sex,
274275
self.init_het_sites,
276+
f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
275277
)

paraphase/genes/hba_phaser.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ def call(self):
6363
genome_depth=self.mdepth,
6464
region_depth=self.region_avg_depth._asdict(),
6565
sample_sex=self.sample_sex,
66+
phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
6667
)
6768
genome_bamh = pysam_handle(self.genome_bam, self.reference_fasta)
6869
surrounding_region_depth = self.get_regional_depth(
@@ -324,5 +325,6 @@ def call(self):
324325
self.region_avg_depth._asdict(),
325326
self.sample_sex,
326327
self.init_het_sites,
328+
f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
327329
alleles,
328330
)

paraphase/genes/ikbkg_phaser.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ def call(self):
4646
genome_depth=self.mdepth,
4747
region_depth=self.region_avg_depth._asdict(),
4848
sample_sex=self.sample_sex,
49+
phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
4950
)
5051
self.get_homopolymer()
5152

@@ -135,7 +136,7 @@ def call(self):
135136
hap_name = f"{self.gene}_pseudohap{pseudo_counter}"
136137
elif clip_5p == self.clip_5p_positions[1]:
137138
dup_counter += 1
138-
tmp.setdefault(hap, f"{self.gene}_duphap{dup_counter}")
139+
hap_name = f"{self.gene}_duphap{dup_counter}"
139140
else:
140141
assert clip_5p == 0
141142
gene_counter += 1
@@ -242,5 +243,6 @@ def call(self):
242243
self.region_avg_depth._asdict(),
243244
self.sample_sex,
244245
self.init_het_sites,
246+
f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
245247
linked_haps,
246248
)

0 commit comments

Comments
 (0)