Skip to content

Annotation generation bug fixes#1215

Open
ainefairbrother wants to merge 5 commits into
Ensembl:mainfrom
ainefairbrother:spliceai-gencodepri
Open

Annotation generation bug fixes#1215
ainefairbrother wants to merge 5 commits into
Ensembl:mainfrom
ainefairbrother:spliceai-gencodepri

Conversation

@ainefairbrother

@ainefairbrother ainefairbrother commented May 15, 2026

Copy link
Copy Markdown
Contributor

SpliceAI input processing bug fixes

Summary: off-by-one and multi-transcript handling fixes for annotation file creation and the variant simulation tool.

Description of changes

ensembl-variation/scripts/python/spliceai_annotation_file.py fixes:

  1. In the gff3 branch, currently gff3 tx and exon starts are written to the annotation file as-is, however, this is not correct, as SpliceAI adds +1 to tx start and exon start when importing and processing the annotations. As such, SpliceAI assumes that the annotation file has 0-based starts and 1-based ends, and to achieve this, starts should get -1 and ends should remain unchanged.
  2. Handling of multiple transcripts - as per the SpliceAI author, this should be one gene-tx per row in the annotation file.
  3. Other small fixes (robustifying, extra filters, remove buggy functionality):
    • Filter for main chromosomes only (1-22, X, Y),
    • Name cleaning/ generating functions (for gene:tx name).

The annotation file now looks like:

#NAME	CHROM	STRAND	TX_START	TX_END	EXON_START	EXON_END
TOP1:ENST00001107672	20	+	41028814	41033491	41028814,41029430,	41029100,41033491,
APP:ENST00001107674	21	-	25910135	26170830	25910135,25954589,25955626,25975069,25975953,25982343,26000014,26021839,26050999,26053235,26089942,26111978,26170563,	25911962,25954689,25955755,25975228,25976028,25982477,26000182,26022042,26051193,26053348,26090072,26112146,26170830,
APP:ENST00001107675	21	-	25910144	26170830	25910144,25954589,25955626,25975069,25975953,25982343,25997359,26000014,26021839,26050999,26053235,26089942,26111978,26170563,	25911962,25954689,25955755,25975228,25976028,25982477,25997416,26000182,26022042,26051193,26053348,26090072,26112146,26170830,

tools/variant_simulator/simulate_variation fixes:

  1. In the onlyGencodePrimary, add all GENCODE-primary protein-coding transcript to the output list, rather than just the final one.

Removal of the scripts/python/filter_annotation_file.py script - this was for the incorrect exon diff-ing approach.

Testing

You can generate and inspect the GENCODE primary protein coding annotation file using:

python3 spliceai_annotation_file.py \
  --gff3 Homo_sapiens.GRCh38.116.gff3.gz \
  --gencode_primary \
  --name_format gene_transcript \
  --release 116 \
  --output_file "gene_annotation_116_gencode_primary_protein_coding.txt"

@ainefairbrother ainefairbrother marked this pull request as ready for review May 20, 2026 15:32

@dglemos dglemos left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested against file Homo_sapiens.GRCh38.115.chr.gff3.gz

  • spliceai_annotation_file.py generates the expected output taking into account the start coordinate being 0-based
  • simulate_variation returns all regions overlapping gencode transcripts

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update this example with the correct start coordinates (-1)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is in line 21.

are aggregated, and the SpliceAI annotation formatted file is written out.
By passing --gff3, the supplied GFF3 is read in and transcript rows are retained. By default
this keeps MANE_Select transcripts; with --gencode_primary it keeps gencode_primary
protein-coding transcripts on main chromosomes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also add some information about the chrs included in the output file, something like this:

In both modes, only chromosomes 1-22, X, and Y are retained; all other contigs (including MT) are excluded from the generated annotation file.

@ainefairbrother ainefairbrother requested a review from dglemos June 8, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants