Pre-Midterm Project | BS-BI UG-1 | SINES, National University of Sciences & Technology
Author: Saleha Asim
This repository covers Tasks 5 and 6 of a genomics pipeline applied to the complete reference genome of Mycobacterium tuberculosis H37Rv (NCBI accession GCF_000195955.2 / NC_000962.3):
- Task 5 — Read alignment, variant calling, filtering, and functional annotation
- Task 6 — Ab initio gene prediction (AUGUSTUS) and orthology inference (REvolutionH-tl)
- Repository Structure
- How to Reproduce This Pipeline
- Task 5 — Variant Calling and Functional Annotation
- 5.1 Project Setup and Reference Files
- 5.2 Handling Paired-End Reads
- 5.3 Reference Genome Indexing
- 5.4 Read Alignment using BWA-MEM
- 5.5 Alignment Quality Assessment
- 5.6 Converting, Sorting, and Indexing the BAM File
- 5.7 Variant Calling using FreeBayes
- 5.8 Variant Filtering using VCFtools
- 5.9 Variant Annotation using SnpEff
- 5.10 Summary of Results
- Task 6 — Gene Prediction and Orthology Inference
M.tuberculosis-variant-calling-and-gene-prediction/
│
├── README.md
├── .gitignore
│
├── scripts/
│ ├── task5_variant_calling.sh # Full Task 5 pipeline (BWA → FreeBayes → VCFtools)
│ └── task6_gene_prediction.sh # Full Task 6 pipeline (AUGUSTUS → REvolutionH-tl)
│
├── results/
│ ├── task5/
│ │ ├── raw_variants.vcf # 3,646 unfiltered FreeBayes variant calls
│ │ ├── filtered_variants.recode.vcf # 3,525 variants passing Phred Q ≥ 30
│ │ ├── filtered_variants.log # VCFtools filtering summary
│ │ ├── snpEff_summary.html # SnpEff annotation report (see known issue)
│ │ └── snpEff_genes.txt # Per-gene variant counts
│ └── task6/
│ ├── augustus_output.aa # Predicted protein sequences (E. coli K-12 proxy)
│ ├── species_tree.phy # Input species tree in Newick format
│ ├── tl_project.orthogroups.tsv # Orthogroup assignments across 3 species
│ ├── tl_project.orthologs.tsv # Inferred ortholog pairs with bit scores
│ ├── tl_project.distances.tsv # Evolutionary distances between sequences
│ ├── tl_project.gene_trees.tsv # Per-orthogroup gene trees
│ ├── tl_project.best_hits.tsv # Best reciprocal DIAMOND hits
│ ├── tl_project.best_matches.tsv # Best match assignments
│ ├── tl_project.resolved_trees.tsv # Refined gene trees (polytomies resolved)
│ ├── tl_project.corrected_trees.tsv # Duplication-corrected trees
│ ├── tl_project.reconcilied_trees.tsv # Gene–species reconciled trees (S/D nodes)
│ ├── tl_project.labeled_species_tree.nhx
│ └── tl_project.species_tree.nhx # Inferred species tree from gene trees
│
└── docs/
└── snpeff_chromosome_error_notes.md # Documents the SnpEff annotation issue
Why does this differ from the original working directory? During analysis, all intermediate files (
.sam,.bam, reads, reference genome, protein FASTAs) lived in~/Documents/Genomics/mid_project/with subdirectoriesreads/,reference/,results/, andprotein_seqs/. Those large binary and raw data files are not tracked in this repository (see.gitignore). This repo contains only the scripts needed to reproduce the analysis and the final output files — reorganised by task for clarity.
Install all required tools. On Ubuntu/WSL:
# System tools
sudo apt update
sudo apt install bwa samtools seqtk
# Conda tools
conda install -c bioconda freebayes vcftools snpeff
# AUGUSTUS (apt version may be missing config files — clone from source instead)
git clone https://github.com/Gaius-Augustus/Augustus.git ~/Augustus
echo 'export AUGUSTUS_CONFIG_PATH=~/Augustus/config' >> ~/.bashrc
echo 'export PATH=$PATH:~/Augustus/scripts' >> ~/.bashrc
source ~/.bashrc
# REvolutionH-tl + DIAMOND (in a virtual environment)
python -m venv genomics_venv
source genomics_venv/bin/activate
pip install revolutionhtl diamondThese files are not in the repository due to size. Download them manually:
Reference genome — from NCBI RefSeq:
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000195955.2/
Download the genomic FASTA. Rename the .fna file to mtb.fna.
Resequencing reads — from NCBI SRA:
fastq-dump --split-files ERR181435
# This produces ERR181435_1.fastq and ERR181435_2.fastq
# Merge them back to interleaved, or use both split files directlyAlternatively, download the interleaved ERR181435.fastq directly from:
https://www.ncbi.nlm.nih.gov/sra/ERR181435
Protein FASTAs for Task 6 — from NCBI Datasets:
https://www.ncbi.nlm.nih.gov/datasets/genome/
Download the Protein (FASTA) file for each assembly:
| Assembly | Species | Save as |
|---|---|---|
ASM19595v2 |
M. tuberculosis H37Rv | mtb.faa |
ASM19583v2 |
M. bovis AF2122/97 | mbovis.faa |
ASM76760v1 |
M. smegmatis MC2 155 | msmeg.faa |
The scripts expect this layout. Create it and place your downloaded files accordingly:
mkdir -p mid_project/reads
mkdir -p mid_project/reference
mkdir -p mid_project/protein_seqs
# Place files:
mv mtb.fna mid_project/reference/
mv ERR181435.fastq mid_project/reads/ # or the split _1/_2 files
mv mtb.faa mbovis.faa msmeg.faa mid_project/protein_seqs/cd mid_project/reads
bash ../../scripts/task5_variant_calling.shExpected outputs in results/task5/:
raw_variants.vcf— 3,646 variant callsfiltered_variants.recode.vcf— 3,525 high-quality variants (Phred Q ≥ 30)
⚠️ SnpEff annotation will produce a chromosome ID error. This is a known issue documented indocs/snpeff_chromosome_error_notes.md. The VCF usesNC_000962.3as the chromosome name but the pre-built SnpEff database expects a different identifier. The filtered VCF is still valid and usable.
cd mid_project
bash ../scripts/task6_gene_prediction.shExpected outputs in results/task6/:
augustus_output.aa— predicted protein sequences- All
tl_project.*files — orthogroups, gene trees, reconciled trees
This task covers aligning resequencing reads using BWA, performing variant calling using FreeBayes, filtering variants using VCFtools, and annotating variants using SnpEff to assess their potential functional effects.
A working directory was organised with three subdirectories:
mid_project/
├── reads/
├── reference/
└── results/
Reference genome: RefSeq assembly GCF_000195955.2 (renamed to mtb.fna). The GCF RefSeq dataset was selected over the raw AL123456.3 download because it is curated, standardised, and directly compatible with the annotation file (.gff), IGV, GATK, and SnpEff. Both files represent the same genome; the RefSeq assembly ensures coordinate consistency across all tools.
Resequencing reads: SRA accession ERR181435 — an Illumina Genome Analyzer IIx paired-end whole-genome sequencing run of M. tuberculosis H37Rv. This dataset was chosen because it is small (~150 MB), publicly available, and is a direct resequencing of the H37Rv reference strain.
Upon inspection, each read ID appeared twice in the file, indicating interleaved paired-end format (forward and reverse reads in the same file).
The interleaved file was split into separate forward and reverse files using seqtk:
seqtk seq -1 ERR181435.fastq > reads_1.fastq
seqtk seq -2 ERR181435.fastq > reads_2.fastqThe -1 flag extracts all forward reads and -2 extracts all reverse reads. Both commands completed successfully.
Before alignment, the reference genome was indexed. Indexing enables rapid and efficient searching during read mapping — without it, the aligner would need to scan the entire reference for every read.
bwa index mtb.fna
samtools faidx mtb.fna- BWA constructs an FM-index based on the Burrows–Wheeler Transform, allowing the aligner to locate seed matches in near-constant time.
- SAMtools
faidxcreates a companion.faiindex for fast random access to specific genomic regions.
The BWA indexing process produced five auxiliary files: .amb, .ann, .bwt, .pac, .sa. SAMtools produced the .fai index file.
Read alignment was performed using BWA-MEM, which uses a seed-and-extend strategy to balance speed and accuracy:
bwa mem ~/Documents/Genomics/mid_project/reference/mtb.fna \
reads_1.fastq reads_2.fastq > aligned.samBWA-MEM took the indexed reference and the two split FASTQ files as input, and produced a SAM (Sequence Alignment/Map) file. Total alignment time: 699.463 seconds.
Alignment statistics were checked using samtools flagstat:
samtools flagstat aligned.sam| Metric | Value | Interpretation |
|---|---|---|
| Total reads | 19,330,668 | All QC-passed and QC-failed reads combined |
| Primary mapped | 18,606,932 (96.39%) | Reads successfully aligned to the reference |
| Properly paired | 18,440,658 (95.52%) | Both mates mapped in expected orientation |
| Singletons | 47,136 (0.24%) | One mate mapped, the other did not |
| Reads mapped to different chr | 0 | No multi-chromosome contamination |
96.39% of reads mapped successfully, indicating high-quality alignment. The very low singleton rate (~0.24%) and complete absence of inter-chromosomal mappings confirm minimal alignment errors.
The SAM output was converted to BAM (binary) format, sorted by genomic coordinate, and indexed. Downstream variant callers require a sorted, indexed BAM.
samtools view -S -b aligned.sam > aligned.bam
samtools sort aligned.bam -o aligned_sorted.bam
samtools index aligned_sorted.bamview -S -b— converts SAM to the more compact BAM formatsort— arranges reads by genomic coordinate (required by most variant callers)index— creates a.bailookup table for rapid retrieval of reads from any genomic region
Variant calling was performed using FreeBayes, a Bayesian haplotype-based variant caller that detects SNPs and small indels without requiring a duplicate-marking step.
conda install -c bioconda freebayes
freebayes -f ~/Documents/Genomics/mid_project/reference/mtb.fna \
aligned_sorted.bam > raw_variants.vcfThe -f flag specifies the reference genome. Output was written to raw_variants.vcf.
Raw variants were filtered with VCFtools to remove low-confidence calls:
vcftools --vcf raw_variants.vcf \
--minQ 30 \
--recode \
--out filtered_variants| Parameter | Purpose |
|---|---|
--vcf |
Input VCF file |
--minQ 30 |
Retain only variants with Phred-scaled quality ≥ 30 (error probability ≤ 1 in 1,000) |
--recode |
Write filtered variants to a new VCF |
--out |
Output file prefix |
Result: 3,525 of 3,646 raw variants passed quality filters — a pass rate of 96.7%. Filtered output: filtered_variants.recode.vcf.
SnpEff was used to annotate the filtered variants with predicted functional effects:
conda install -c bioconda snpeff
snpEff Mycobacterium_tuberculosis_h37rv \
filtered_variants.recode.vcf > annotated.vcfThis produced three output files: annotated.vcf, snpEff_summary.html, and snpEff_genes.txt.
Upon inspection, all variant records contained an ERROR_CHROMOSOME_NOT_FOUND warning. The pre-built SnpEff database for Mycobacterium_tuberculosis_h37rv was not built from the same reference used for alignment — the VCF contained chromosome ID NC_000962.3 while the SnpEff database expected a different identifier.
A custom SnpEff database was constructed using the matching FASTA and GFF files:
-
Edit the SnpEff config — added a new genome entry:
MTB.genome : Mycobacterium_tuberculosis_H37Rv -
Create the directory structure — placed
sequences.fa(renamed frommtb.fna) andgenes.gffin~/snpeff_custom/data/MTB/ -
Build the custom database:
snpEff build -gff3 -v MTB -dataDir ~/snpeff_custom/data
The build process failed because SnpEff requires additional validation files — cds.fa and protein.fa — which are often absent or incompletely formatted in bacterial GFF annotations. As a result, the database was not fully indexed and annotation produced no usable output.
| Output File | Contents | Status |
|---|---|---|
aligned.sam |
Raw alignment output from BWA-MEM | ✅ Completed |
aligned_sorted.bam |
Sorted and indexed BAM file | ✅ Completed |
raw_variants.vcf |
3,646 unfiltered variant calls from FreeBayes | ✅ Completed |
filtered_variants.recode.vcf |
3,525 high-quality variants (Phred Q ≥ 30) | ✅ Completed |
annotated.vcf |
SnpEff functional annotation | ❌ Error: chromosome ID mismatch |
Note: I intend to fix the annotation error soon.
This task involves predicting gene structures using AUGUSTUS, followed by orthogroup identification, gene tree reconstruction, and orthology inference using REvolutionH-tl to analyse gene evolution across species.
Pipeline overview:
Genome (FASTA) → AUGUSTUS → Predicted genes (CDS / proteins) → REvolutionH-tl
Note: Because a proxy species was used for ab initio gene prediction, the AUGUSTUS output was not passed to REvolutionH-tl. Instead, protein FASTA (
.faa) files for M. tuberculosis, M. bovis, and M. smegmatis were downloaded directly from NCBI.
The AUGUSTUS web server was unavailable at the time of analysis (server busy), so the command-line interface was used instead.
sudo apt update
sudo apt install augustus augustus-data
# Verify installation
augustus --helpWorking directory: ~/Documents/Genomics/mid_project/reference
No pre-trained AUGUSTUS model exists for Mycobacterium tuberculosis, so the E. coli K-12 model was used as a proxy under an intronless gene model. This is appropriate because both organisms are bacteria with similar gene structures (no introns) and comparable codon usage patterns.
augustus --species=E_coli_K12 --genemodel=intronless mtb.fna > augustus_output.gffThe AUGUSTUS scripts directory was added to the system PATH, then protein sequences were extracted using getAnnoFasta.pl:
echo 'export PATH=$PATH:~/Augustus/scripts' >> ~/.bashrc
source ~/.bashrc
getAnnoFasta.pl augustus_output.gffThis produced augustus_output.aa — the protein sequences file used as input for Part 2.
Protein FASTA files were downloaded from NCBI Datasets for three species:
| Assembly | Species |
|---|---|
ASM19595v2 |
Mycobacterium tuberculosis H37Rv |
ASM76760v1 |
Mycolicibacterium smegmatis MC2 155 |
ASM19583v2 |
Mycobacterium tuberculosis variant bovis AF2122/97 |
protein_seqs/
├── mtb.faa
├── mbovis.faa
└── msmeg.faa
Raw FASTA headers contained spaces, long descriptions, and unstructured species names, which can break parsing, orthogroup labelling, and species assignment in REvolutionH-tl. Headers were cleaned and species prefixes prepended using sed:
sed 's/^>\([^ ]*\).*/>&mtb_\1/' mtb.faa > mtb_clean.faa
sed 's/^>\([^ ]*\).*/>&mbovis_\1/' mbovis.faa > mbovis_clean.faa
sed 's/^>\([^ ]*\).*/>&msmeg_\1/' msmeg.faa > msmeg_clean.faamkdir fasta_files
mv *_clean.faa fasta_files/A species tree file (species_tree.phy) was created in Newick format:
((mtb,mbovis),msmeg);
This represents the topology:
┌─ mtb
┌───┤
│ └─ mbovis
────┤
└──── msmeg
source ~/Documents/Genomics/lab4/genomics_venv/bin/activatepython -m revolutionhtl \
-F fasta_files \
-S species_tree.phy \
-aligner diamondNote: An initial error occurred because REvolutionH-tl expected
.fafile extensions while the input files had.faaextensions. This was resolved by renaming the files:mv mtb_clean.faa mtb.fa mv mbovis_clean.faa mbovis.fa mv msmeg_clean.faa msmeg.fa
After renaming, the pipeline completed successfully: REvolutionH-tl finished all the tasks without any problem.
The pipeline executed six internal steps:
Step 1 — Pairwise Sequence Similarity
All three proteomes were compared using pairwise sequence similarity. Six alignments were computed (all pairwise combinations among mtb, mbovis, msmeg, and their reciprocals). Results written to tl_project.alignment_all_vs_all/.
Step 2 — Singleton and Orthogroup Identification
3,338 singletons were identified — genes with no detectable homologs in the other species, reflecting species-specific genes, highly diverged sequences, or annotation differences. Remaining genes were grouped into orthogroups (genes descended from a common ancestor), written to tl_project.orthogroups.tsv.
Example orthogroups:
| OG | n_genes | n_species | mbovis | msmeg | mtb |
|---|---|---|---|---|---|
| OG0 | 2 | 2 | mbovis_WP_003900015.1 | — | mtb_YP_009030045.1 |
| OG1 | 2 | 2 | mbovis_WP_003400850.1 | — | mtb_YP_009030027.1 |
| OG8 | 2 | 2 | mbovis_WP_003402602.1 | msmeg_WP_003402602.1 | — |
Step 3 — Gene Tree Construction
3,306 orthogroups contained enough homologous sequences for phylogenetic analysis. A gene tree was constructed per orthogroup (tl_project.gene_trees.tsv), representing evolutionary relationships among gene copies. Orthologs (genes diverged through speciation, not duplication) were written to tl_project.orthologs.tsv.
Step 4 — Gene Tree Refinement
Ambiguities such as polytomies were resolved and duplication inference was improved using evolutionary distances. Output: tl_project.resolved_trees.tsv.
Step 5 — Species Tree Reconstruction
A species tree (tl_project.species_tree.nhx) was reconstructed from the collective signal of all gene trees:
(msmeg,(mbovis,mtb));
This topology matches known biology, serving as a validation of the analysis.
Step 6 — Gene–Species Tree Reconciliation
Individual gene trees were reconciled against the species tree to infer evolutionary events (speciation, duplication, gene loss). Internal nodes in the output (tl_project.reconcilied_trees.tsv) are annotated as S (speciation) or D (duplication), providing a comprehensive view of the evolutionary history of each gene family across the three Mycobacterium species.
| File | Description |
|---|---|
tl_project.alignment_all_vs_all/ |
Pairwise alignment hits |
tl_project.best_hits.tsv |
Best reciprocal hits per gene pair |
tl_project.orthogroups.tsv |
Orthogroup assignments |
tl_project.orthologs.tsv |
Inferred ortholog pairs with normalised bit scores |
tl_project.distances.tsv |
Evolutionary distances between sequences |
tl_project.gene_trees.tsv |
Gene trees per orthogroup |
tl_project.resolved_trees.tsv |
Refined, fully resolved gene trees |
tl_project.species_tree.nhx |
Reconstructed species tree |
tl_project.reconcilied_trees.tsv |
Reconciled gene–species trees (S/D annotations) |
Analysis performed: March 2026 | Reference genome: NC_000962.3 / GCF_000195955.2