Version 3.3.3 (#52)

xiao-chen-xc · web-flow · commit 1c265d7717e4 · 2025-08-15T14:54:32.000-07:00
- Fix bug that `min_variant_frequency` cannot be set lower than the default value
- Fix minor bug in `ikbkg` that causes program to error out
- Sort reads by name first to remove indeterminism in haplotype names
- Do not write VCF when region is clearly not homozygous but no haplotypes are phased (most likely due to low depth)
- Add the `phase_region` field in JSON output to report the coordinates of the analysis region and the genome build
- Minor improvement on indel detection
- Minor improvement on handling `gene1_cn2`, a scenario specified in the config asking Paraphase to assume a paralog group to always have two copies of gene1
- Improve documentation
  - Update NEB tutorial to clarify on the order of TRI1/2/3
  - Update README to clarify that the `fusions_called` field is only reported for four regions 
  - Update the targeted data tutorial to include more details on PureTarget
diff --git a/README.md b/README.md
@@ -35,7 +35,7 @@ For more details about Paraphase, please check out our latest [paper](https://ww
 
 - Chen X, Harting J, Farrow E, et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. The American Journal of Human Genetics. 2023. doi:10.1016/j.ajhg.2023.01.001
 
-For whole-genome sequencing (WGS) data, we recommend >20X, ideally 30X, genome coverage. Low coverage or short read length could result in less accurate phasing, especially when gene copies are highly similar to each other. For hybrid capture-based enrichment data, a higher read depth (>50X) is recommended as the read length is generally shorter than WGS.
+Paraphase supports both whole-genome sequencing (WGS) data and targeted sequencing data, including data generated from [PureTarget](https://www.pacb.com/technology/puretarget) panels. For whole-genome sequencing (WGS) data, we recommend >20X, ideally 30X, genome coverage. Low coverage or shorter read length could result in less accurate phasing, especially when gene copies are highly similar to each other. See our [tutorial](docs/targeted_data.md) for more details on targeted data.
 
 ## Contact
 
@@ -113,6 +113,7 @@ Paraphase produces a few output files in the directory specified by `-o`, with t
 - `haplotype_details`: lists information about each haplotype 
   - `boundary`: the boundary of the region that is resolved on the haplotype. This is useful when a haplotype is only partially phased.
 - `alleles_final`: haplotypes phased into alleles. This is possible when the segmental duplication is in tandem.
+- `fusions_called`: deletions or duplications created by unequal crossing over between paralogous sequences, called by a special step that checks the flanking sequences of phased haplotypes. This step is currently enabled for four regions: CYP2D6, GBA, CYP11B1 and the CFH gene cluster. 
 
 Tutorials/Examples are provided for interpreting the `json` output and visualizing haplotypes for medically relevant genes listed below: 
 - [SMN1/SMN2](docs/SMN1_SMN2.md)
diff --git a/docs/NEB.md b/docs/NEB.md
@@ -7,6 +7,7 @@ Paraphase resolves the triplicate (TRI) repeat region in NEB, where copy number
 - `total_cn`: total copy number of the triplicate repeat
 - `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
 - `alleles_final`: when possible, different copies of TRI are phased into alleles with read based phasing. 
+- `repeat_name`: haplotypes are assigned to TRI1/TRI2/TRI3, which are the three copies of the repeat in the reference genome. Note that this is according to their order in the reference genome, i.e. the first copy of the repeat in the reference genome is TRI1 and the last copy is TRI3. Some studies assign TRI1/TRI2/TRI3 according to their order in the coding sequence, which is on the negative strand of the reference genome, thus a reverse order than what's reported by Paraphase.
 
 ## Visualizing haplotypes
 
diff --git a/docs/figures/puretarget_cyp21a2.png b/docs/figures/puretarget_cyp21a2.png
diff --git a/docs/targeted_data.md b/docs/targeted_data.md
@@ -1,14 +1,53 @@
 # Running Paraphase on targeted data
 
-Paraphase can work with targeted data, such as:
-- Hybrid capture based enrichment data
-- CRISPR-Cas9 targeted data
-- Amplicon data
+## Data types
 
-The config file may need to be modified based on the design of the target panel. Please reach out to Xiao Chen (xchen@pacificbiosciences.com) if you need assistance.
+Paraphase can work with targeted sequencing data, such as:
+- Shotgun type enrichment data, i.e. [hybrid capture based enrichment](https://www.pacb.com/wp-content/uploads/Twist-dark-regions-application-brief.pdf) data. For this data type, the read length is generally shorter than WGS, which may result in less accurate phasing, especially when gene copies are highly similar to each other. Therefore, we recommend sequencing to a higher depth (>50X) than WGS.
+- Amplicon data or CRISPR-Cas9 targeted data, such as [PureTarget]((https://www.pacb.com/technology/puretarget)). For this data type, the entire target region is fully contained in a read, making it easier to determine haplotypes compared to the shotgun type data. We generally recommend sequencing to per-haplotype depth of 8-15X. However, a higher coverage (>15X) may be desired for short target regions. This is because for short regions it is possible for two haplotypes in the same sample to be identical in sequence, requiring Paraphase to adjust the haplotype copy numbers based on the relative depth difference between these haplotypes (e.g. one haplotype has twice the supporting reads of another haplotype, indicating the presence of two identical copies). The higher depth is needed to perform the copy number adjustment more accurately.
+
+## Config file
+
+The Paraphase config file needs to be modified based on the design of the target panel. Take the panel design for CYP21A2/CYP21A1P as an example.
+
+```yaml
+{
+  cyp21:
+    realign_region: chr6:32038085-32042687
+    extract_regions: chr6:31980000-32046800
+    left_boundary: 32038085
+    right_boundary: 32042687
+}
+```
+
+Each target region has an arbitrary name, e.g. `cyp21` here. 
+
+`realign_region` specifies the main region of interest and the coordinates should refect the panel design (`realign_region` can be equal to or slightly bigger than the panel design). This is where we want all reads to be realigned to, i.e. one region selected for a homology group (e.g. CYP21A2 region for the CYP21A2/CYP21A1P group). 
+
+`extract_regions` specifies the regions where we want to extract relevant reads from the input bam, i.e. all the homologous regions (e.g. CYP21A2 and CYP21A1P) (where the reads might align or misalign in the genome bam). Multiple regions can be provided, separated by space.
+
+`left_boundary` and `right_boundary` are optional. If they are not provided, Paraphase will perform phasing between the start and end coordinates in `realign_region` and all positions between them will be reported in the VCF. If `left_boundary` and `right_boundary` are provided, Paraphase will perform phasing between `left_boundary` and `right_boundary`, reporting all positions between them in the VCF. 
+
+Please don't hesitate to reach out to Xiao Chen (xchen@pacificbiosciences.com) if you need assistance with config files.
+
+## Command line options
 
 Paraphase provides a few options for users to better work with targeted data: 
-1) Use the `--targeted` option to drop the assumption of uniform coverage across the genome.
+1) Use the `--targeted` option to drop the assumption of uniform coverage across the genome. With the `--targeted` option, Paraphase will not perform any depth normalization against the rest of the genome.
 2) Additionally there are two optional parameters designed for targeted data. The default values are expected to work well with high depth data since they are frequency-based. But users can tune them based on the depth of their data and the expected copy numbers of their regions of interest. 
 - `--min-variant-frequency`:  Minimum frequency for a variant to be used for phasing. The cutoff for variant-supporting reads is determined by max(5, total_depth * min_frequency). Note that total_depth is the combined depth of all paralogs for a paralog group. Default is 0.11. 
 - `--min-haplotype-frequency`: Minimum frequency of unique supporting reads for a haplotype. The cutoff for haplotype-supporting reads is determined by max(4, total_depth * min_frequency). Note that total_depth is the combined depth of all paralogs for a paralog group. Default is 0.03. This cutoff can be increased to filter out spurious and low-frequency haplotypes.
+
+## Examples
+
+Below we will use PureTarget data as an example.
+
+Paraphase can be run with the following command. Note the custom config file and the `--targeted` option.
+
+```bash
+paraphase -b input.bam -o output_directory -r genome_fasta -c custom_config.yaml --targeted
+```
+
+Paraphase reports four haplotypes for this sample (including CYP21A2 and CYP21A1P copies), and reads are separated by haplotypes in `.paraphase.bam`. Not that only HiFi reads (`rq`>=0.99) are used in Paraphase, even if the input Bam contains reads with lower rqs.
+
+![puretarget example](figures/puretarget_cyp21a2.png)
diff --git a/paraphase/__init__.py b/paraphase/__init__.py
@@ -1 +1 @@
-__version__ = "3.3.2"
+__version__ = "3.3.3"
diff --git a/paraphase/genes/cfc1_phaser.py b/paraphase/genes/cfc1_phaser.py
@@ -25,6 +25,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         self.get_homopolymer()
         self.get_candidate_pos()
@@ -105,4 +106,5 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
         )
diff --git a/paraphase/genes/cfhclust.py b/paraphase/genes/cfhclust.py
@@ -73,6 +73,7 @@ def call(self):
             None,
             None,
             None,
+            f"{self.cfh["phase_region"]},{self.cfhr3["phase_region"]}",
             None,
             fusions,
         )
diff --git a/paraphase/genes/f8_phaser.py b/paraphase/genes/f8_phaser.py
@@ -89,6 +89,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
 
         genome_bamh = pysam_handle(self.genome_bam, self.reference_fasta)
@@ -272,4 +273,5 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
         )
diff --git a/paraphase/genes/hba_phaser.py b/paraphase/genes/hba_phaser.py
@@ -63,6 +63,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         genome_bamh = pysam_handle(self.genome_bam, self.reference_fasta)
         surrounding_region_depth = self.get_regional_depth(
@@ -324,5 +325,6 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             alleles,
         )
diff --git a/paraphase/genes/ikbkg_phaser.py b/paraphase/genes/ikbkg_phaser.py
@@ -46,6 +46,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         self.get_homopolymer()
 
@@ -135,7 +136,7 @@ def call(self):
                         hap_name = f"{self.gene}_pseudohap{pseudo_counter}"
                     elif clip_5p == self.clip_5p_positions[1]:
                         dup_counter += 1
-                        tmp.setdefault(hap, f"{self.gene}_duphap{dup_counter}")
+                        hap_name = f"{self.gene}_duphap{dup_counter}"
                     else:
                         assert clip_5p == 0
                         gene_counter += 1
@@ -242,5 +243,6 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             linked_haps,
         )
diff --git a/paraphase/genes/ncf1_phaser.py b/paraphase/genes/ncf1_phaser.py
@@ -41,6 +41,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         pivot_site = self.pivot_site
         for pileupcolumn in self._bamh.pileup(
@@ -169,4 +170,5 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
         )
diff --git a/paraphase/genes/neb_phaser.py b/paraphase/genes/neb_phaser.py
@@ -39,6 +39,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         self.get_homopolymer()
         self.get_candidate_pos()
@@ -189,5 +190,6 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             linked_haps,
         )
diff --git a/paraphase/genes/opn1lw_phaser.py b/paraphase/genes/opn1lw_phaser.py
@@ -177,6 +177,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         self.get_homopolymer()
         self.get_candidate_pos(min_vaf=0.08)
@@ -454,5 +455,6 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             alleles_1st_2nd,
         )
diff --git a/paraphase/genes/pms2_phaser.py b/paraphase/genes/pms2_phaser.py
@@ -25,6 +25,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         self.get_homopolymer()
         self.find_big_deletion(min_size=2900)
@@ -158,4 +159,5 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
         )
diff --git a/paraphase/genes/rccx_phaser.py b/paraphase/genes/rccx_phaser.py
@@ -350,6 +350,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         self.get_homopolymer()
         self.del2_reads, self.del2_reads_partial = self.get_long_del_reads(
@@ -543,5 +544,6 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             alleles,
         )
diff --git a/paraphase/genes/smn1_phaser.py b/paraphase/genes/smn1_phaser.py
@@ -495,6 +495,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         self.get_homopolymer()
         # find known deletions
@@ -689,4 +690,5 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
         )
diff --git a/paraphase/genes/strc_phaser.py b/paraphase/genes/strc_phaser.py
@@ -51,6 +51,7 @@ def call(self):
                 genome_depth=self.mdepth,
                 region_depth=self.region_avg_depth._asdict(),
                 sample_sex=self.sample_sex,
+                phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
             )
         genome_bamh = pysam_handle(self.genome_bam, self.reference_fasta)
         intergenic_depth = self.get_regional_depth(genome_bamh, self.depth_region)[
@@ -194,4 +195,5 @@ def call(self):
             self.region_avg_depth._asdict(),
             self.sample_sex,
             self.init_het_sites,
+            f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",
         )
diff --git a/paraphase/phaser.py b/paraphase/phaser.py
diff --git a/paraphase/prepare_bam_and_vcf.py b/paraphase/prepare_bam_and_vcf.py

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = "3.3.2"`
	`1`	`+__version__ = "3.3.3"`
Original file line number	Diff line number	Diff line change
`@@ -25,6 +25,7 @@ def call(self):`
`25`	`25`	`genome_depth=self.mdepth,`
`26`	`26`	`region_depth=self.region_avg_depth._asdict(),`
`27`	`27`	`sample_sex=self.sample_sex,`
	`28`	`+ phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",`
`28`	`29`	`)`
`29`	`30`	`self.get_homopolymer()`
`30`	`31`	`self.get_candidate_pos()`
`@@ -105,4 +106,5 @@ def call(self):`
`105`	`106`	`self.region_avg_depth._asdict(),`
`106`	`107`	`self.sample_sex,`
`107`	`108`	`self.init_het_sites,`
	`109`	`+ f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",`
`108`	`110`	`)`
Original file line number	Diff line number	Diff line change
`@@ -73,6 +73,7 @@ def call(self):`
`73`	`73`	`None,`
`74`	`74`	`None,`
`75`	`75`	`None,`
	`76`	`+ f"{self.cfh["phase_region"]},{self.cfhr3["phase_region"]}",`
`76`	`77`	`None,`
`77`	`78`	`fusions,`
`78`	`79`	`)`
Original file line number	Diff line number	Diff line change
`@@ -89,6 +89,7 @@ def call(self):`
`89`	`89`	`genome_depth=self.mdepth,`
`90`	`90`	`region_depth=self.region_avg_depth._asdict(),`
`91`	`91`	`sample_sex=self.sample_sex,`
	`92`	`+ phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",`
`92`	`93`	`)`
`93`	`94`
`94`	`95`	`genome_bamh = pysam_handle(self.genome_bam, self.reference_fasta)`
`@@ -272,4 +273,5 @@ def call(self):`
`272`	`273`	`self.region_avg_depth._asdict(),`
`273`	`274`	`self.sample_sex,`
`274`	`275`	`self.init_het_sites,`
	`276`	`+ f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",`
`275`	`277`	`)`
Original file line number	Diff line number	Diff line change
`@@ -63,6 +63,7 @@ def call(self):`
`63`	`63`	`genome_depth=self.mdepth,`
`64`	`64`	`region_depth=self.region_avg_depth._asdict(),`
`65`	`65`	`sample_sex=self.sample_sex,`
	`66`	`+ phase_region=f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",`
`66`	`67`	`)`
`67`	`68`	`genome_bamh = pysam_handle(self.genome_bam, self.reference_fasta)`
`68`	`69`	`surrounding_region_depth = self.get_regional_depth(`
`@@ -324,5 +325,6 @@ def call(self):`
`324`	`325`	`self.region_avg_depth._asdict(),`
`325`	`326`	`self.sample_sex,`
`326`	`327`	`self.init_het_sites,`
	`328`	`+ f"{self.genome_build}:{self.nchr}:{self.left_boundary}-{self.right_boundary}",`
`327`	`329`	`alleles,`
`328`	`330`	`)`