Skip to content

Commit 08fc26d

Browse files
updated json documentation (#57)
Added a tutorial on JSON and updated tutorials on VCF, HBA, RCCX and PMS2. --------- Co-authored-by: Xiao Chen <chenxecho@gmail.com>
1 parent ef1f25c commit 08fc26d

10 files changed

Lines changed: 135 additions & 50 deletions

File tree

README.md

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -106,16 +106,9 @@ Paraphase produces a few output files in the directory specified by `-o`, with t
106106

107107
2. `.paraphase.bam`: This BAM file can be loaded into IGV for visualization of haplotypes (group reads by `HP` tag and color alignments by `YC` tag). All haplotypes are aligned against the main gene of interest. Tutorials/Examples are provided for medically relevant genes (See below).
108108

109-
3. `.paraphase.json`: Output file summarizing haplotypes and variant calls for each gene family in each sample. In brief, a few generally used fields are explained below.
110-
- `final_haplotypes`: phased haplotypes for all gene copies in a gene family
111-
- `total_cn`: total copy number of the family (sum of gene and paralog/pseudogene)
112-
- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
113-
- `haplotype_details`: lists information about each haplotype
114-
- `boundary`: the boundary of the region that is resolved on the haplotype. This is useful when a haplotype is only partially phased.
115-
- `alleles_final`: haplotypes phased into alleles. This is possible when the segmental duplication is in tandem.
116-
- `fusions_called`: deletions or duplications created by unequal crossing over between paralogous sequences, called by a special step that checks the flanking sequences of phased haplotypes. This step is currently enabled for four regions: CYP2D6, GBA, CYP11B1 and the CFH gene cluster.
117-
118-
Tutorials/Examples are provided for interpreting the `json` output and visualizing haplotypes for medically relevant genes listed below:
109+
3. `.paraphase.json`: Output file summarizing copy number and phased halotypes for each region. Details can be found [here](docs/json.md).
110+
111+
Tutorials/Examples are provided for further interpreting the `json` output and visualizing haplotypes for medically relevant genes listed below:
119112
- [SMN1/SMN2](docs/SMN1_SMN2.md)
120113
- [RCCX module (CYP21A2)](docs/RCCX.md)
121114
- [PMS2](docs/PMS2.md)

docs/HBA1_HBA2.md

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,53 @@
11
# HBA1/HBA2
22

3-
For this [region](https://www.ncbi.nlm.nih.gov/books/NBK1435/), Paraphase calls the total copy number of HBA1 and HBA2. Variants are called in the VCF, against HBA2 reference sequence.
3+
4+
This [medically relevant region](https://www.ncbi.nlm.nih.gov/books/NBK1435/), is characterised by two regions of homology, represented by "A" and "B" in this simplified schematic of the region:
5+
![HBA1_HBA2 Region](figures/HBA1-HBA2-diagram.png)
6+
7+
Paraphase returns the total copy number of HBA1 and HBA2 in a sample. Variants are called in the VCF relative to the HBA2 reference sequence.
8+
9+
10+
#### Structural Variants
11+
12+
Two well-known structural variants in this region are the **3.7 kb** and **4.2 kb deletions or duplications**, which occur due to unequal crossing-overs between a pair of homology regions:
13+
14+
- **3.7 kb deletion or duplication**
15+
Results from recombination between the "B" boxes, forming a hybrid HBA1/HBA2 gene.
16+
17+
- **4.2 kb deletion or duplication**
18+
Results from recombination between the "A" boxes, leading to a deletion or duplication of HBA2.
19+
420

521
## Fields in the `json` file
622

7-
- `genotype`: reports the genotype of this family. Possible alleles include `aa`, `aaa` (duplication), `-a` (deletion) or `--` (double deletion).
8-
- `alleles_final`: when possible, different copies of HBA are phased into alleles with read based phasing.
9-
- `sv_called`: reports SVs (3p7del, 3p7dup, 4p2del or 4p2dup) and their coordinates.
23+
Fields shared across all genes are defined in the general [json file](json.md). The region includes several unique fields:
24+
25+
- `genotype`: Reports the genotype for this region. Possible alleles include:
26+
- `aa`: wild-type
27+
- `aaa`: duplication
28+
- `-a`: single-gene deletion
29+
- `--`: double-gene deletion
30+
- `surrounding_region_depth`: depth in the regions flanking HBA1 and HBA2. Paraphase uses this depth to infer the presence of double-gene deletion.
31+
- **`sv_called`**: Reports structural variants along with their genome coordinates:
32+
- `3p7del`: 3.7 kb deletion
33+
- `3p7dup`: 3.7 kb duplication
34+
- `4p2del`: 4.2 kb deletion
35+
- `4p2dup`: 4.2 kb duplication
36+
37+
Haplotype labels are explained in the section below.
1038

1139
## Visualizing haplotypes
1240

1341
To visualize phased haplotypes, load the output bam file in IGV, group reads by the `HP` tag and color alignments by `YC` tag. Green and purple represent two alleles, i.e. all haplotypes in green are on one one allele and all haplotypes in purple are on the other allele.
1442

1543
Reads in gray are either unassigned or consistent with more than one possible haplotype. When two haplotypes are identical over a region, there can be more than one haplotype consistent with a read, and the read is randomly assigned to a haplotype and colored in gray.
1644

45+
Paraphase realigns reads to the HBA2 region in the reference genome, including the first "B" box and the second "A" box in the schematic above. Paraphase assigns a label to each haplotype based on the starting and ending (soft-clipped) positions.
46+
1747
![HBA example](figures/HBA.png)
1848

19-
- The top panel shows a sample with two copies of HBA1 and two copies of HBA2, one on each allele.
20-
- The bottom panel shows a sample with a `-a` allele, where there is a deletion, leaving only one copy of HBA (`hba_del_hap1`).
49+
- The `no SV` panel shows a sample with two copies of HBA1 (short) and two copies of HBA2 (long), one on each allele. In addition, there is a `homology_hap` that derives from the first "A" box in the schematic above. The `homology_hap` does not encode any HBA genes and is only used for infering 4.2 kb deletions or duplications, as we can see below.
50+
- The `3p7 deletion` panel shows a sample with a `-a` allele (purple), where there is a 3.7 kb deletion, creating a hybrid haplotype that has an HBA2 start and an HBA1 end (`3p7delhap1`, first haplotype). The genotype is `-a/aa`.
51+
- The `3p7 duplication` panel shows a sample with a 3.7 kb duplication, creating a hybrid haplotype that has an HBA1 start and an HBA2 end (`3p7duphap1`, first haplotype). The genotype is `aaa/aa`. `hba1hap1` is present at two copies, as we can tell from the depth.
52+
- The `4p2 deletion` panel shows a sample with a 4.2 kb deletion, creating a hybrid haplotype with a start correponding to the homology haplotype and an HBA2 end (`4p2delhap1`, first haplotype). It's on the same chromosome as `hba1hap2` (purple). The genotype is `-a/aa`.
53+
- The `4p2 duplication` panel shows a sample with a 4.2 kb duplication, creating a hybrid haplotype with an HBA2 start and an end correponding to the homology haplotype (`4p2duphap1`, first haplotype). The genotype is `aaa/aa`. `hba1hap1` is present at two copies, as we can tell from the depth.

docs/PMS2.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
## Fields in the `json` file
44

5-
- `total_cn`: total copy number of PMS2 and PMS2CL
6-
- `gene_cn`: copy number of the gene of interest, i.e. PMS2
7-
- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
5+
Fields shared across all genes are defined in the general [json file](json.md). The PMS2 locus does not include unique fields.
6+
7+
The PMS2 haplotypes are labeled as gene (labeled `pms2_pms2hap#`) and pseudogene (labeled `pms2_pms2clhap#`) based on whether the haplotype extends beyond the homology region into the unique region.
88

99
## Visualizing haplotypes
1010

docs/RCCX.md

Lines changed: 32 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,42 @@
11
# RCCX module
22

3-
Medically relevant genes in this region include:
4-
- [CYP21A2](https://www.ncbi.nlm.nih.gov/books/NBK1171/) (21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia)
5-
- TNXB (Ehlers-Danlos syndrome)
6-
- C4A/C4B (relevant in autoimmune diseases)
7-
8-
## Fields in the `json` file
9-
10-
- `total_cn`: total copy number of RCCX
11-
- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
12-
- `alleles_final`: different copies of RCCX are phased into alleles with read based phasing.
13-
- `ending_hap`: the last copy of RCCX on each allele. Only these copies contain parts of TNXB (while the other copies contain TNXA)
14-
- `annotated_alleles`: allele annotation for the CYP21A2 gene. This is a list of two items, each representing one allele in the sample. This is only based on common gene-pseudogene (CYP21A2-CYP21A1P) differences (P31L, IVS2-13A/C>G, G111Vfs, I173N, I237N, V238E, M240K, V282L, Q319X and R357W). Please refer to the VCFs for most thorough variant calling and annotation. Below are a few examples of annotated alleles:
15-
- `WT`: one copy of CYP21A2 and one copy of CYP21A1P (pseudogene) on this allele.
16-
- `pseudogene_duplication`: On this allele, there is an additional copy of the pseudogene.
17-
- `pseudogene_deletion`: On this allele, the pseudogene is deleted.
18-
- `gene_duplication`: On this allele, there is an additional copy of CYP21A2.
19-
- `gene_deletion`: On this allele, CYP21A2 is deleted.
20-
- `deletion_P31L,G111Vfs`: On this allele, there is a deletion of one RCCX copy, creating a fusion gene between CYP21A1P and CYP21A2. This fusion gene carries the variants P31L and G111Vfs (which come from the pseudogene part of the fusion).
21-
- `duplication_WT_plus_Q319X`: On this allele, there is an additional copy of CYP21A2. Among the two copies of CYP21A2, one copy is WT and the other carries Q319X.
22-
- `Q319X`: On this allele, there is no CNV, i.e. there is one copy of CYP21A2 and one copy of CYP21A1P. CYP21A2 carries the variant Q319X. (Other known variants in CYP21A2 are also reported in this way, e.g. `282L`.)
3+
The RCCX module refers to a complex and variable region on chromosome 6, overlapping several medically relevant genes, including:
4+
- [CYP21A2](https://www.ncbi.nlm.nih.gov/books/NBK1171/) (21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia)
5+
- TNXB (Ehlers-Danlos syndrome)
6+
- C4A/C4B (relevant in autoimmune diseases)
7+
8+
Below is a simplified schematic of the region:
9+
![RCCX Region](figures/RCCX-diagram.png)
10+
11+
12+
## Region specific fields in the `json` file
13+
14+
Fields shared across all genes are defined in the general [json file](json.md). The RCCX module includes several unique fields, listed below:
15+
16+
- `ending_hap`: Indicates the last RCCX copy on each allele. These haplotypes have unique sequences from the unique region downstream of RCCX. Only these final copies contain the gene TNXB; all earlier copies on the same haplotype contain TNXA (the pseudogene). This field can be used to infer the order of RCCX haplotypes on an allele.
17+
- `starting_hap`: Indicates the first RCCX copy on each allele. These haplotypes have unique sequences from the unique region upstream of RCCX. This field can be used to infer the order of RCCX haplotypes on an allele.
18+
- `deletion_hap`: a deletion haplotype has the characteristics of both a starting haplotype and an ending haplotype, indicating that it's the only haplotype on an allele, indicating a deletion of an RCCX copy, leaving just one copy of RCCX.
19+
- `hap_variants`: Variant calls for common gene-pseudogene (CYP21A2-CYP21A1P) differentiating sites (P31L, IVS2-13A/C>G, G111Vfs, I173N, I237N, V238E, M240K, V282L, Q319X and R357W). This is used for allele annotation of CYP21A2. For comprehensive variant calls of the RCCX module please refer to the vcf file.
20+
- `annotated_alleles`: Provides per-allele annotations of CYP21A2 based on the `hap_variants` field. Possible values may include:
21+
- `WT`: one copy each of CYP21A2 and CYP21A1P (pseudogene) on this allele.
22+
- `pseudogene_duplication`: additional copy of CYP21A1P on this allele.
23+
- `pseudogene_deletion`: CYP21A1P is deleted on this allele.
24+
- `gene_duplication`: additional copy of CYP21A2 on this allele.
25+
- `gene_deletion`: CYP21A2 is deleted on this allele.
26+
- `deletion_P31L,G111Vfs`: Deletion of one RCCX copy on this allele, creating a CYP21A1P–CYP21A2 fusion gene carrying P31L and G111Vfs variants from the pseudogene.
27+
- `duplication_WT_plus_Q319X`: Two copies of CYP21A2 on this allele: one WT, the other carrying Q319X.
28+
- `Q319X`: Single CYP21A2 copy with variant Q319X, no CNV present (other variants like 282L are reported similarly).
2329

2430
## Visualizing haplotypes
2531

26-
To visualize phased haplotypes, load the output bam file in IGV, group reads by the `HP` tag and color alignments by `YC` tag. Reads are realigned to CYP21A2.
32+
To visualize phased haplotypes, load the output bam file in IGV, group reads by the `HP` tag and color alignments by `YC` tag. Reads are realigned to the CYP21A2 reference.
2733

28-
Green and purple represent two alleles, i.e. all haplotypes in green are on one one allele and all haplotypes in purple are on the other allele. Reads in gray are either unassigned or consistent with more than one possible haplotype. When two haplotypes are identical over a region, there can be more than one haplotype consistent with a read, and the read is randomly assigned to a haplotype and colored in gray.
34+
Green and purple represent two alleles, i.e. all haplotypes in green are on one allele and all haplotypes in purple are on the other allele. Reads in gray are either unassigned or consistent with more than one possible haplotype. When two haplotypes are identical over a region, there can be more than one haplotype consistent with a read, and the read is randomly assigned to a haplotype and colored in gray.
2935

3036
![RCCX examples](figures/RCCX.png)
3137

32-
- In this set of examples, the top panel shows a sample with no copy number change (both alleles are `WT`). There are four copies of RCCX, two on each allele. On each allele, one copy carries CYP21A2 and the other copy carries CYP21A1P (marked by a cluster of mismatches when aligned to CYP21A2).
33-
- The middle panel shows a sample with a fusion deletion (purple allele `deletion_P31L,G111Vfs`). There is only one copy of RCCX on this allele. The deletion breakpoint is in CYP21A2, creating a fusion gene between CYP21A1P and CYP21A2.
34-
- The bottom panel shows a sample with a CYP21A2 duplication that carries Q319X (purple allele `duplication_WT_plus_Q319X`). On this allele, there are two copies of CYP21A2, among which one copy is WT and the other (the one next to TNXB) carries Q319X.
38+
Examples:
39+
- **Top panel**: Sample with no copy number change (both alleles are `WT`). There are four copies of RCCX, two per allele. Each allele carries CYP21A2 and CYP21A1P (marked by a cluster of mismatches when aligned to CYP21A2).
40+
- **Middle panel**: sample with a fusion deletion on the purple allele (`deletion_P31L,G111Vfs`). This allele has only one RCCX copy. The breakpoint occurs within CYP21A2, creating a CYP21A1P–CYP21A2 fusion gene that includes variants inherited from the pseudogene.
41+
- **Bottom panel**: shows a sample with a CYP21A2 duplication on the purple allele (`duplication_WT_plus_Q319X`). This allele contains two CYP21A2 copies. One is wild-type; the other (next to TNXB) carries the Q319X variant.
3542

docs/figures/HBA.png

66 KB
Loading

docs/figures/HBA1-HBA2-diagram.png

26 KB
Loading

docs/figures/RCCX-diagram.png

70.9 KB
Loading

0 commit comments

Comments
 (0)