Skip to content

Commit 0b8aa16

Browse files
author
Chenghao (Trevor) Zhu
authored
Merge pull request #914 from uclahs-cds/czhu-fix-mito
Support Different Codon Tables
2 parents ab02bd8 + 0f36ba6 commit 0b8aa16

64 files changed

Lines changed: 1147 additions & 253 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,6 @@ build
1818

1919
test/files
2020

21-
venv
21+
venv
22+
23+
notebooks

CHANGELOG.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,34 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
1010

1111
## [Unreleased]
1212

13+
## [1.5.0] - 2025-06-15
14+
15+
- Added `--codon-table` and `--chr-codon-table`. The former sets the codon table to use, and the latter overrides it for specific chromosomes.
16+
17+
- Added `--star-codons` and `--chr-star-codons` to specify start codons to use.
18+
19+
- Added the support for codon table for `callVariant`.
20+
21+
- Added the support for codon table for `callNovelORF` and `callAltTranslation`.
22+
23+
- Added codon table to `downsampleReference`.
24+
25+
- Fixed graph algorithms to use specified codon table and start codon
26+
27+
- Fixed `GenomicAnnotationOnDisk` that when using on-the-fly indices, the `is_protein_coding` attributes of transcripts are not updated correctly.
28+
29+
- Added `force_init_met` to `PeptideVariantGraph.call_variant_peptide` to control whether the initial amino acid should be forced to Methionine.
30+
31+
- Updated the reference data loading function to directly return a `ReferenceData` object.
32+
33+
- Updated `bruteForce` to specify codon table and start codons.
34+
35+
- Updated `fuzzTest` to pass codon tabel and start codons to `callVariant` and `bruteForce`
36+
37+
- Fixed `callVariant` that during TVG alignment, node merged from multiple frameshifts which together go back to the origianl frame, was not recognized as `was_brige` in a bubble
38+
39+
- Fixed `fuzzTest` that cds start should be at least 3 nucleotide away from the cds end.
40+
1341
## [1.4.6] - 2025-05-21
1442

1543
- Fixed biopython version #908

docs/call-alt-translation.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@
1111

1212
{% include 'partials/_caution_on_reference_version.md' %}
1313

14+
{% include 'partials/_args_reference.md' %}
15+
16+
{% include 'partials/_args_codon_table.md' %}
17+
1418
## Usage
1519

1620
```

docs/call-novel-orf.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@
1111

1212
{% include 'partials/_caution_on_reference_version.md' %}
1313

14+
{% include 'partials/_args_reference.md' %}
15+
16+
{% include 'partials/_args_codon_table.md' %}
17+
1418
## Usage
1519

1620
```

docs/call-variant.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@
99
show_root_heading: false
1010
show_source: false
1111

12+
{% include 'partials/_args_reference.md' %}
13+
14+
{% include 'partials/_args_codon_table.md' %}
15+
1216
## Usage
1317

1418
```

docs/files/fuzz_test_history.tsv

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,3 +41,6 @@ v1.4.5 f46742d 2025-02-24 comprehensive 5902 0 0 0:00:00.375135 1.01575471228857
4141
v1.4.6-rc1 78e971d 2025-03-04 snv 2710 0 0 0:00:00.163103 0.38221870483830245 0:00:56.301104 116.21712282069569
4242
v1.4.6-rc1 78e971d 2025-03-04 indel 2850 0 0 0:00:00.191917 0.3938244401664319 0:00:41.244215 95.36212234507254
4343
v1.4.6-rc1 78e971d 2025-03-04 comprehensive 5310 0 0 0:00:00.395482 1.0924741762245143 0:00:40.031276 179.44714992445952
44+
v1.4.6-rc4 8f8e871 2025-06-14 snv 3899 0 0 0:00:00.161568 0.37222759958119433 0:00:55.336499 116.08301043924084
45+
v1.4.6-rc4 8f8e871 2025-06-14 indel 3811 0 0 0:00:00.253212 1.900065095497631 0:00:39.172671 89.15186913116658
46+
v1.4.6-rc4 8f8e871 2025-06-14 comprehensive 7291 0 0 0:00:00.388579 0.9478303137705371 0:00:42.034437 171.5405703540714

docs/partials/_args_codon_table.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
## Codon Table & Start Codon
2+
3+
### Codon Table
4+
5+
The NCBI standard codon table is used by default, which is used for the majority of nulcear gene translation in eukaryote cells. See [here](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) for a complete list of NCBI codon tables.
6+
7+
The default codon table can be override using `--codon-table`. For example:
8+
9+
```shell
10+
--codon-table ’Ciliate Nuclear'
11+
```
12+
13+
The `--chr-codon-table` can be used to specify the codon table used for a specific chomosome. The example below uses the 'Vertebrate Mitochondrial' (SGC1) codon table for genes from the mitochondria chomosome, and the standard codon table otherwise.
14+
15+
```shell
16+
--codon-table Standard \
17+
--chr-codon-table 'chrM:SGC1'
18+
```
19+
20+
### Start Codons
21+
22+
Stard codons usually do not need to be specified. The standard start codon `ATG` is used by default, and it is translated as Methionine as start codon and in elongation. However, in some cases, for example, mitochondria, `ATA` and `ATT` may also be used as start codon. While `ATT` is translatted into Isoleucine during elongation, Methionine is still used as start codon.
23+
24+
Similar to codon table, the default codon table can be override using `--start-codons`.
25+
26+
```shell
27+
--start-codons ATG
28+
```
29+
30+
The `--chr-start-codon` can also be used to assign start codons to a specific chomosome. The example below assigns `ATG`, `ATA`, and `ATT` to the mitochondrial chromosome.
31+
32+
```shell
33+
--chr-start-codons 'chrM:ATG,ATA,ATT'
34+
```
35+
36+
### Default
37+
38+
The chromosome names must be specified correctly, same as what used in the genome fasta and annotation GTF file. By default, moPepGen infers the reference source of the annotation (*i.e.*, GENCODE or EMSEMBL), and uses the 'SGC1' codon table for mitochondirla chromosome. So the default is equivalent to:
39+
40+
```shell
41+
--reference-source GENCODE \
42+
--codon-table Standard \
43+
--chr-codon-table 'chrM:SGC1' \
44+
--start-codons 'ATG' \
45+
--chr-start-codongs 'chrM:ATG,ATA,ATT
46+
```
47+
48+
or
49+
50+
```shell
51+
--reference-source ENSEMBL \
52+
--codon-table Standard \
53+
--chr-codon-table 'MT:SGC1' \
54+
--start-codons 'ATG' \
55+
--chr-start-codongs 'MT:ATG,ATA,ATT
56+
```

docs/partials/_args_reference.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
## Reference
2+
3+
Reference data, incluiding reference genome, genome annotation, and protein coding translation are required. There are two ways of specifying reference data:
4+
5+
1. Using the index dir created by the [`generateIndex`](/generate-index) command.
6+
2. Specifying each reference files needed.
7+
8+
1 is highly recommended as it is faster and helps you ensure that the same reference data are used across the project.

moPepGen/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from . import constant
99

1010

11-
__version__ = '1.4.6'
11+
__version__ = '1.5.0'
1212

1313
## Error messages
1414
ERROR_INDEX_IN_INTRON = 'The genomic index seems to be in an intron'

moPepGen/aa/AminoAcidSeqRecord.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -335,10 +335,6 @@ def find_all_enzymatic_cleave_sites_with_ranges(self, rule:str, exception:str=No
335335
return list(self.iter_enzymatic_cleave_sites_with_range(rule=rule,
336336
exception=exception))
337337

338-
def find_all_start_sites(self) -> List[int]:
339-
""" Find all start positions """
340-
return [x.start() for x in re.finditer('M', str(self.seq))]
341-
342338
def find_all_cleave_and_stop_sites(self, rule:str, exception:str=None,
343339
exception_sites:List[int]=None) -> List[int]:
344340
""" Find all enzymatic lceave sites and stop sites """

0 commit comments

Comments
 (0)