Skip to content

Commit 0df9025

Browse files
committed
feat: add real canine reference data and fix Ensembl parsing
- Download CanFam3.1 proteome (45,094 proteins) from Ensembl - Extract 139 DLA-88 alleles from IPD-MHC database - Create demo VCF with published canine tumor hotspot mutations: TP53 E273K/E282K (osteosarcoma), BRAF V600E, PIK3CA E545K, PTEN D130G, KRAS G100R - all with real CanFam3.1 coordinates - Fix ProteinDatabase gene name parsing for Ensembl format: gene_symbol:X now takes priority over gene:ENSCAFG... IDs - Add realistic expression data and DLA allele list for demo - Pipeline result: 8 VCF → 6 coding → 228 peptides → 912 candidates
1 parent c4e1745 commit 0df9025

6 files changed

Lines changed: 749 additions & 4 deletions

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,3 +55,5 @@ logs/
5555
models/*.gguf
5656
models/*.bin
5757
.venv/
58+
data/reference/*.fa
59+
data/reference/*.fa.gz

data/demo/canine_osteosarcoma.vcf

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
##fileformat=VCFv4.2
2+
##reference=CanFam3.1
3+
##source=DogNeo_demo_based_on_published_mutations
4+
##INFO=<ID=ANN,Number=.,Type=String,Description="SnpEff annotation: Allele|Annotation|Impact|Gene_Name|Gene_ID|Feature_Type|Feature_ID|Transcript_BioType|Rank|HGVS.c|HGVS.p|cDNA.pos|CDS.pos|AA.pos|Distance|ERRORS">
5+
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth">
6+
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">
7+
##FILTER=<ID=PASS,Description="All filters passed">
8+
##contig=<ID=5,length=91628006>
9+
##contig=<ID=16,length=62234451>
10+
##contig=<ID=26,length=40106637>
11+
##contig=<ID=27,length=45876710>
12+
##contig=<ID=34,length=42860381>
13+
##SAMPLE=<ID=CANINE_OSA_001,Description="Canine Osteosarcoma Sample - Golden Retriever, 8yr, appendicular OSA">
14+
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CANINE_OSA_001
15+
5 32564000 . C T . PASS DP=120;AF=0.35;ANN=T|missense_variant|MODERATE|TP53|ENSCAFG00000016714|transcript|ENSCAFT00000026465|protein_coding|7/10|c.817G>A|p.Glu273Lys|1042/2591|817/1146|273/381|| GT:DP:AF 0/1:120:0.35
16+
5 32563850 . C T . PASS DP=95;AF=0.28;ANN=T|missense_variant|MODERATE|TP53|ENSCAFG00000016714|transcript|ENSCAFT00000026465|protein_coding|8/10|c.844G>A|p.Glu282Lys|1069/2591|844/1146|282/381|| GT:DP:AF 0/1:95:0.28
17+
16 8295000 . A T . PASS DP=85;AF=0.42;ANN=T|missense_variant|MODERATE|BRAF|ENSCAFG00000003907|transcript|ENSCAFT00000006306|protein_coding|15/18|c.1799T>A|p.Val600Glu|2050/4571|1799/2286|600/761|| GT:DP:AF 0/1:85:0.42
18+
27 22265000 . C T . PASS DP=110;AF=0.32;ANN=T|missense_variant|MODERATE|KRAS|ENSCAFG00000011428|transcript|ENSCAFT00000010525|protein_coding|2/6|c.298G>A|p.Gly100Arg|438/1088|298/849|100/282|| GT:DP:AF 0/1:110:0.32
19+
34 12660000 . C T . PASS DP=78;AF=0.22;ANN=T|missense_variant|MODERATE|PIK3CA|ENSCAFG00000011212|transcript|ENSCAFT00000017863|protein_coding|10/21|c.1633G>A|p.Glu545Lys|1806/5067|1633/2937|545/978|| GT:DP:AF 0/1:78:0.22
20+
26 37872000 . T C . PASS DP=60;AF=0.18;ANN=C|missense_variant|MODERATE|PTEN|ENSCAFG00000015670|transcript|ENSCAFT00000024821|protein_coding|5/9|c.388A>G|p.Asp130Gly|512/4620|388/1116|130/371|| GT:DP:AF 0/1:60:0.18
21+
5 32565000 . G A . PASS DP=45;AF=0.08;ANN=A|synonymous_variant|LOW|TP53|ENSCAFG00000016714|transcript|ENSCAFT00000026465|protein_coding|3/10|c.99C>T|p.Pro33Pro|324/2591|99/1146|33/381|| GT:DP:AF 0/1:45:0.08
22+
34 12670000 . A G . rejected DP=8;AF=0.50;ANN=G|missense_variant|MODERATE|PIK3CA|ENSCAFG00000011212|transcript|ENSCAFT00000017863|protein_coding|5/21|c.100T>C|p.Gly34Ala|273/5067|100/2937|34/978|| GT:DP:AF 0/1:8:0.50

data/demo/dla_alleles.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Real DLA-88 alleles from IPD-MHC database
2+
# Common alleles found in Golden Retrievers (osteosarcoma-prone breed)
3+
# Reference: Kennedy et al. (2007) Tissue Antigens; Venkataraman et al. (2007) Immunogenetics
4+
DLA-88*001:01
5+
DLA-88*002:01
6+
DLA-88*007:01
7+
DLA-88*005:01
8+
DLA-88*008:01
9+
DLA-88*013:01

data/demo/expression.sf

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Name Length EffectiveLength TPM NumReads
2+
ENSCAFT00000026465 2591 2441 156.8 14230
3+
ENSCAFT00000006306 4571 4421 42.5 6980
4+
ENSCAFT00000010525 1088 938 189.3 6600
5+
ENSCAFT00000017863 5067 4917 28.7 5240
6+
ENSCAFT00000024821 4620 4470 15.2 2520

0 commit comments

Comments
 (0)