Running snpEff Mycobacterium_tuberculosis_h37rv filtered_variants.recode.vcf
produced ERROR_CHROMOSOME_NOT_FOUND for all variant records.
The VCF file (produced by aligning to GCF_000195955.2_ASM19595v2_genomic.fna)
uses the chromosome identifier NC_000962.3. The pre-built SnpEff database
Mycobacterium_tuberculosis_h37rv was built from a different reference with
a different chromosome ID — the identifiers do not match.
A custom SnpEff database was built using the matching .fna and .gff files:
- Added to
snpEff.config:MTB.genome : Mycobacterium_tuberculosis_H37Rv - Created directory:
~/snpeff_custom/data/MTB/withsequences.faandgenes.gff - Ran:
snpEff build -gff3 -v MTB -dataDir ~/snpeff_custom/data
SnpEff requires additional validation files (cds.fa and protein.fa) during
the database build. These are often missing or incorrectly formatted in bacterial
GFF3 files downloaded from NCBI. The database build failed at the CDS/protein
check step.
Functional annotation was not completed. The filtered VCF (filtered_variants.recode.vcf)
with 3,525 high-quality variants (Phred Q ≥ 30) is available in results/task5/
and can be annotated with a correctly configured SnpEff database.