Mycobacterium tuberculosis H37Rv — Genomics Pipeline

Pre-Midterm Project | BS-BI UG-1 | SINES, National University of Sciences & Technology
Author: Saleha Asim

This repository covers Tasks 5 and 6 of a genomics pipeline applied to the complete reference genome of Mycobacterium tuberculosis H37Rv (NCBI accession GCF_000195955.2 / NC_000962.3):

Task 5 — Read alignment, variant calling, filtering, and functional annotation
Task 6 — Ab initio gene prediction (AUGUSTUS) and orthology inference (REvolutionH-tl)

Repository Structure

M.tuberculosis-variant-calling-and-gene-prediction/
│
├── README.md
├── .gitignore
│
├── scripts/
│   ├── task5_variant_calling.sh       # Full Task 5 pipeline (BWA → FreeBayes → VCFtools)
│   └── task6_gene_prediction.sh       # Full Task 6 pipeline (AUGUSTUS → REvolutionH-tl)
│
├── results/
│   ├── task5/
│   │   ├── raw_variants.vcf                  # 3,646 unfiltered FreeBayes variant calls
│   │   ├── filtered_variants.recode.vcf      # 3,525 variants passing Phred Q ≥ 30
│   │   ├── filtered_variants.log             # VCFtools filtering summary
│   │   ├── snpEff_summary.html               # SnpEff annotation report (see known issue)
│   │   └── snpEff_genes.txt                  # Per-gene variant counts
│   └── task6/
│       ├── augustus_output.aa                # Predicted protein sequences (E. coli K-12 proxy)
│       ├── species_tree.phy                  # Input species tree in Newick format
│       ├── tl_project.orthogroups.tsv        # Orthogroup assignments across 3 species
│       ├── tl_project.orthologs.tsv          # Inferred ortholog pairs with bit scores
│       ├── tl_project.distances.tsv          # Evolutionary distances between sequences
│       ├── tl_project.gene_trees.tsv         # Per-orthogroup gene trees
│       ├── tl_project.best_hits.tsv          # Best reciprocal DIAMOND hits
│       ├── tl_project.best_matches.tsv       # Best match assignments
│       ├── tl_project.resolved_trees.tsv     # Refined gene trees (polytomies resolved)
│       ├── tl_project.corrected_trees.tsv    # Duplication-corrected trees
│       ├── tl_project.reconcilied_trees.tsv  # Gene–species reconciled trees (S/D nodes)
│       ├── tl_project.labeled_species_tree.nhx
│       └── tl_project.species_tree.nhx       # Inferred species tree from gene trees
│
└── docs/
    └── snpeff_chromosome_error_notes.md      # Documents the SnpEff annotation issue

Why does this differ from the original working directory? During analysis, all intermediate files (.sam, .bam, reads, reference genome, protein FASTAs) lived in ~/Documents/Genomics/mid_project/ with subdirectories reads/, reference/, results/, and protein_seqs/. Those large binary and raw data files are not tracked in this repository (see .gitignore). This repo contains only the scripts needed to reproduce the analysis and the final output files — reorganised by task for clarity.

How to Reproduce This Pipeline

1. Prerequisites

Install all required tools. On Ubuntu/WSL:

# System tools
sudo apt update
sudo apt install bwa samtools seqtk

# Conda tools
conda install -c bioconda freebayes vcftools snpeff

# AUGUSTUS (apt version may be missing config files — clone from source instead)
git clone https://github.com/Gaius-Augustus/Augustus.git ~/Augustus
echo 'export AUGUSTUS_CONFIG_PATH=~/Augustus/config' >> ~/.bashrc
echo 'export PATH=$PATH:~/Augustus/scripts' >> ~/.bashrc
source ~/.bashrc

# REvolutionH-tl + DIAMOND (in a virtual environment)
python -m venv genomics_venv
source genomics_venv/bin/activate
pip install revolutionhtl diamond

2. Download Raw Data

These files are not in the repository due to size. Download them manually:

Reference genome — from NCBI RefSeq:

https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000195955.2/

Download the genomic FASTA. Rename the .fna file to mtb.fna.

Resequencing reads — from NCBI SRA:

fastq-dump --split-files ERR181435
# This produces ERR181435_1.fastq and ERR181435_2.fastq
# Merge them back to interleaved, or use both split files directly

Alternatively, download the interleaved ERR181435.fastq directly from:

https://www.ncbi.nlm.nih.gov/sra/ERR181435

Protein FASTAs for Task 6 — from NCBI Datasets:

https://www.ncbi.nlm.nih.gov/datasets/genome/

Download the Protein (FASTA) file for each assembly:

Assembly	Species	Save as
`ASM19595v2`	M. tuberculosis H37Rv	`mtb.faa`
`ASM19583v2`	M. bovis AF2122/97	`mbovis.faa`
`ASM76760v1`	M. smegmatis MC2 155	`msmeg.faa`

3. Set Up Working Directories

The scripts expect this layout. Create it and place your downloaded files accordingly:

mkdir -p mid_project/reads
mkdir -p mid_project/reference
mkdir -p mid_project/protein_seqs

# Place files:
mv mtb.fna              mid_project/reference/
mv ERR181435.fastq      mid_project/reads/       # or the split _1/_2 files
mv mtb.faa mbovis.faa msmeg.faa  mid_project/protein_seqs/

4. Run Task 5

cd mid_project/reads
bash ../../scripts/task5_variant_calling.sh

Expected outputs in results/task5/:

raw_variants.vcf — 3,646 variant calls
filtered_variants.recode.vcf — 3,525 high-quality variants (Phred Q ≥ 30)

⚠️ SnpEff annotation will produce a chromosome ID error. This is a known issue documented in docs/snpeff_chromosome_error_notes.md. The VCF uses NC_000962.3 as the chromosome name but the pre-built SnpEff database expects a different identifier. The filtered VCF is still valid and usable.

5. Run Task 6

cd mid_project
bash ../scripts/task6_gene_prediction.sh

Expected outputs in results/task6/:

augustus_output.aa — predicted protein sequences
All tl_project.* files — orthogroups, gene trees, reconciled trees

Task 5 — Variant Calling and Functional Annotation

This task covers aligning resequencing reads using BWA, performing variant calling using FreeBayes, filtering variants using VCFtools, and annotating variants using SnpEff to assess their potential functional effects.

5.1 Project Setup and Reference Files

A working directory was organised with three subdirectories:

mid_project/
├── reads/
├── reference/
└── results/

Reference genome: RefSeq assembly GCF_000195955.2 (renamed to mtb.fna). The GCF RefSeq dataset was selected over the raw AL123456.3 download because it is curated, standardised, and directly compatible with the annotation file (.gff), IGV, GATK, and SnpEff. Both files represent the same genome; the RefSeq assembly ensures coordinate consistency across all tools.

Resequencing reads: SRA accession ERR181435 — an Illumina Genome Analyzer IIx paired-end whole-genome sequencing run of M. tuberculosis H37Rv. This dataset was chosen because it is small (~150 MB), publicly available, and is a direct resequencing of the H37Rv reference strain.

Upon inspection, each read ID appeared twice in the file, indicating interleaved paired-end format (forward and reverse reads in the same file).

5.2 Handling Paired-End Reads

The interleaved file was split into separate forward and reverse files using seqtk:

seqtk seq -1 ERR181435.fastq > reads_1.fastq
seqtk seq -2 ERR181435.fastq > reads_2.fastq

The -1 flag extracts all forward reads and -2 extracts all reverse reads. Both commands completed successfully.

5.3 Reference Genome Indexing

Before alignment, the reference genome was indexed. Indexing enables rapid and efficient searching during read mapping — without it, the aligner would need to scan the entire reference for every read.

bwa index mtb.fna
samtools faidx mtb.fna

BWA constructs an FM-index based on the Burrows–Wheeler Transform, allowing the aligner to locate seed matches in near-constant time.
SAMtools faidx creates a companion .fai index for fast random access to specific genomic regions.

The BWA indexing process produced five auxiliary files: .amb, .ann, .bwt, .pac, .sa. SAMtools produced the .fai index file.

5.4 Read Alignment using BWA-MEM

Read alignment was performed using BWA-MEM, which uses a seed-and-extend strategy to balance speed and accuracy:

bwa mem ~/Documents/Genomics/mid_project/reference/mtb.fna \
    reads_1.fastq reads_2.fastq > aligned.sam

BWA-MEM took the indexed reference and the two split FASTQ files as input, and produced a SAM (Sequence Alignment/Map) file. Total alignment time: 699.463 seconds.

5.5 Alignment Quality Assessment

Alignment statistics were checked using samtools flagstat:

samtools flagstat aligned.sam

Metric	Value	Interpretation
Total reads	19,330,668	All QC-passed and QC-failed reads combined
Primary mapped	18,606,932 (96.39%)	Reads successfully aligned to the reference
Properly paired	18,440,658 (95.52%)	Both mates mapped in expected orientation
Singletons	47,136 (0.24%)	One mate mapped, the other did not
Reads mapped to different chr	0	No multi-chromosome contamination

96.39% of reads mapped successfully, indicating high-quality alignment. The very low singleton rate (~0.24%) and complete absence of inter-chromosomal mappings confirm minimal alignment errors.

5.6 Converting, Sorting, and Indexing the BAM File

The SAM output was converted to BAM (binary) format, sorted by genomic coordinate, and indexed. Downstream variant callers require a sorted, indexed BAM.

samtools view -S -b aligned.sam > aligned.bam
samtools sort aligned.bam -o aligned_sorted.bam
samtools index aligned_sorted.bam

view -S -b — converts SAM to the more compact BAM format
sort — arranges reads by genomic coordinate (required by most variant callers)
index — creates a .bai lookup table for rapid retrieval of reads from any genomic region

5.7 Variant Calling using FreeBayes

Variant calling was performed using FreeBayes, a Bayesian haplotype-based variant caller that detects SNPs and small indels without requiring a duplicate-marking step.

conda install -c bioconda freebayes

freebayes -f ~/Documents/Genomics/mid_project/reference/mtb.fna \
    aligned_sorted.bam > raw_variants.vcf

The -f flag specifies the reference genome. Output was written to raw_variants.vcf.

5.8 Variant Filtering using VCFtools

Raw variants were filtered with VCFtools to remove low-confidence calls:

vcftools --vcf raw_variants.vcf \
    --minQ 30 \
    --recode \
    --out filtered_variants

Parameter	Purpose
`--vcf`	Input VCF file
`--minQ 30`	Retain only variants with Phred-scaled quality ≥ 30 (error probability ≤ 1 in 1,000)
`--recode`	Write filtered variants to a new VCF
`--out`	Output file prefix

Result: 3,525 of 3,646 raw variants passed quality filters — a pass rate of 96.7%. Filtered output: filtered_variants.recode.vcf.

5.9 Variant Annotation using SnpEff

SnpEff was used to annotate the filtered variants with predicted functional effects:

conda install -c bioconda snpeff

snpEff Mycobacterium_tuberculosis_h37rv \
    filtered_variants.recode.vcf > annotated.vcf

This produced three output files: annotated.vcf, snpEff_summary.html, and snpEff_genes.txt.

⚠️ Chromosome Mapping Error

Upon inspection, all variant records contained an ERROR_CHROMOSOME_NOT_FOUND warning. The pre-built SnpEff database for Mycobacterium_tuberculosis_h37rv was not built from the same reference used for alignment — the VCF contained chromosome ID NC_000962.3 while the SnpEff database expected a different identifier.

Attempted Fix: Custom SnpEff Database

A custom SnpEff database was constructed using the matching FASTA and GFF files:

Edit the SnpEff config — added a new genome entry:
```
MTB.genome : Mycobacterium_tuberculosis_H37Rv
```
Create the directory structure — placed sequences.fa (renamed from mtb.fna) and genes.gff in ~/snpeff_custom/data/MTB/

Build the custom database:

snpEff build -gff3 -v MTB -dataDir ~/snpeff_custom/data

The build process failed because SnpEff requires additional validation files — cds.fa and protein.fa — which are often absent or incompletely formatted in bacterial GFF annotations. As a result, the database was not fully indexed and annotation produced no usable output.

5.10 Summary of Results

Output File	Contents	Status
`aligned.sam`	Raw alignment output from BWA-MEM	✅ Completed
`aligned_sorted.bam`	Sorted and indexed BAM file	✅ Completed
`raw_variants.vcf`	3,646 unfiltered variant calls from FreeBayes	✅ Completed
`filtered_variants.recode.vcf`	3,525 high-quality variants (Phred Q ≥ 30)	✅ Completed
`annotated.vcf`	SnpEff functional annotation	❌ Error: chromosome ID mismatch

Note: I intend to fix the annotation error soon.

Task 6 — Gene Prediction and Orthology Inference

This task involves predicting gene structures using AUGUSTUS, followed by orthogroup identification, gene tree reconstruction, and orthology inference using REvolutionH-tl to analyse gene evolution across species.

Pipeline overview:

Genome (FASTA) → AUGUSTUS → Predicted genes (CDS / proteins) → REvolutionH-tl

Note: Because a proxy species was used for ab initio gene prediction, the AUGUSTUS output was not passed to REvolutionH-tl. Instead, protein FASTA (.faa) files for M. tuberculosis, M. bovis, and M. smegmatis were downloaded directly from NCBI.

6.1 Gene Prediction with AUGUSTUS

The AUGUSTUS web server was unavailable at the time of analysis (server busy), so the command-line interface was used instead.

Installation

sudo apt update
sudo apt install augustus augustus-data

# Verify installation
augustus --help

Running AUGUSTUS

Working directory: ~/Documents/Genomics/mid_project/reference

No pre-trained AUGUSTUS model exists for Mycobacterium tuberculosis, so the E. coli K-12 model was used as a proxy under an intronless gene model. This is appropriate because both organisms are bacteria with similar gene structures (no introns) and comparable codon usage patterns.

augustus --species=E_coli_K12 --genemodel=intronless mtb.fna > augustus_output.gff

Extracting Protein Sequences

The AUGUSTUS scripts directory was added to the system PATH, then protein sequences were extracted using getAnnoFasta.pl:

echo 'export PATH=$PATH:~/Augustus/scripts' >> ~/.bashrc
source ~/.bashrc

getAnnoFasta.pl augustus_output.gff

This produced augustus_output.aa — the protein sequences file used as input for Part 2.

6.2 Orthology Inference with REvolutionH-tl

Step 1 — Download Protein (.faa) Files from NCBI

Protein FASTA files were downloaded from NCBI Datasets for three species:

Assembly	Species
`ASM19595v2`	Mycobacterium tuberculosis H37Rv
`ASM76760v1`	Mycolicibacterium smegmatis MC2 155
`ASM19583v2`	Mycobacterium tuberculosis variant bovis AF2122/97

protein_seqs/
├── mtb.faa
├── mbovis.faa
└── msmeg.faa

Step 2 — Clean FASTA Headers

Raw FASTA headers contained spaces, long descriptions, and unstructured species names, which can break parsing, orthogroup labelling, and species assignment in REvolutionH-tl. Headers were cleaned and species prefixes prepended using sed:

sed 's/^>\([^ ]*\).*/>&mtb_\1/'    mtb.faa    > mtb_clean.faa
sed 's/^>\([^ ]*\).*/>&mbovis_\1/' mbovis.faa > mbovis_clean.faa
sed 's/^>\([^ ]*\).*/>&msmeg_\1/'  msmeg.faa  > msmeg_clean.faa

Step 3 — Organise Cleaned Files

mkdir fasta_files
mv *_clean.faa fasta_files/

Step 4 — Create the Species Tree File

A species tree file (species_tree.phy) was created in Newick format:

((mtb,mbovis),msmeg);

This represents the topology:

        ┌─ mtb
    ┌───┤
    │   └─ mbovis
────┤
    └──── msmeg

Step 5 — Activate the Conda Environment

source ~/Documents/Genomics/lab4/genomics_venv/bin/activate

Step 6 — Run REvolutionH-tl

python -m revolutionhtl \
    -F fasta_files \
    -S species_tree.phy \
    -aligner diamond

Note: An initial error occurred because REvolutionH-tl expected .fa file extensions while the input files had .faa extensions. This was resolved by renaming the files:
mv mtb_clean.faa   mtb.fa
mv mbovis_clean.faa mbovis.fa
mv msmeg_clean.faa msmeg.fa

After renaming, the pipeline completed successfully: REvolutionH-tl finished all the tasks without any problem.

Step 7 — REvolutionH-tl Pipeline Breakdown

The pipeline executed six internal steps:

Step 1 — Pairwise Sequence Similarity
All three proteomes were compared using pairwise sequence similarity. Six alignments were computed (all pairwise combinations among mtb, mbovis, msmeg, and their reciprocals). Results written to tl_project.alignment_all_vs_all/.

Step 2 — Singleton and Orthogroup Identification
3,338 singletons were identified — genes with no detectable homologs in the other species, reflecting species-specific genes, highly diverged sequences, or annotation differences. Remaining genes were grouped into orthogroups (genes descended from a common ancestor), written to tl_project.orthogroups.tsv.

Example orthogroups:

OG	n_genes	n_species	mbovis	msmeg	mtb
OG0	2	2	mbovis_WP_003900015.1	—	mtb_YP_009030045.1
OG1	2	2	mbovis_WP_003400850.1	—	mtb_YP_009030027.1
OG8	2	2	mbovis_WP_003402602.1	msmeg_WP_003402602.1	—

Step 3 — Gene Tree Construction
3,306 orthogroups contained enough homologous sequences for phylogenetic analysis. A gene tree was constructed per orthogroup (tl_project.gene_trees.tsv), representing evolutionary relationships among gene copies. Orthologs (genes diverged through speciation, not duplication) were written to tl_project.orthologs.tsv.

Step 4 — Gene Tree Refinement
Ambiguities such as polytomies were resolved and duplication inference was improved using evolutionary distances. Output: tl_project.resolved_trees.tsv.

Step 5 — Species Tree Reconstruction
A species tree (tl_project.species_tree.nhx) was reconstructed from the collective signal of all gene trees:

(msmeg,(mbovis,mtb));

This topology matches known biology, serving as a validation of the analysis.

Step 6 — Gene–Species Tree Reconciliation
Individual gene trees were reconciled against the species tree to infer evolutionary events (speciation, duplication, gene loss). Internal nodes in the output (tl_project.reconcilied_trees.tsv) are annotated as S (speciation) or D (duplication), providing a comprehensive view of the evolutionary history of each gene family across the three Mycobacterium species.

Output Files

File	Description
`tl_project.alignment_all_vs_all/`	Pairwise alignment hits
`tl_project.best_hits.tsv`	Best reciprocal hits per gene pair
`tl_project.orthogroups.tsv`	Orthogroup assignments
`tl_project.orthologs.tsv`	Inferred ortholog pairs with normalised bit scores
`tl_project.distances.tsv`	Evolutionary distances between sequences
`tl_project.gene_trees.tsv`	Gene trees per orthogroup
`tl_project.resolved_trees.tsv`	Refined, fully resolved gene trees
`tl_project.species_tree.nhx`	Reconstructed species tree
`tl_project.reconcilied_trees.tsv`	Reconciled gene–species trees (S/D annotations)

Analysis performed: March 2026 | Reference genome: NC_000962.3 / GCF_000195955.2

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Mycobacterium tuberculosis H37Rv — Genomics Pipeline

Table of Contents

Repository Structure

How to Reproduce This Pipeline

1. Prerequisites

2. Download Raw Data

3. Set Up Working Directories

4. Run Task 5

5. Run Task 6

Task 5 — Variant Calling and Functional Annotation

5.1 Project Setup and Reference Files

5.2 Handling Paired-End Reads

5.3 Reference Genome Indexing

5.4 Read Alignment using BWA-MEM

5.5 Alignment Quality Assessment

5.6 Converting, Sorting, and Indexing the BAM File

5.7 Variant Calling using FreeBayes

5.8 Variant Filtering using VCFtools

5.9 Variant Annotation using SnpEff

⚠️ Chromosome Mapping Error

Attempted Fix: Custom SnpEff Database

5.10 Summary of Results

Task 6 — Gene Prediction and Orthology Inference

6.1 Gene Prediction with AUGUSTUS

Installation

Running AUGUSTUS

Extracting Protein Sequences

6.2 Orthology Inference with REvolutionH-tl

Step 1 — Download Protein (.faa) Files from NCBI

Step 2 — Clean FASTA Headers

Step 3 — Organise Cleaned Files

Step 4 — Create the Species Tree File

Step 5 — Activate the Conda Environment

Step 6 — Run REvolutionH-tl

Step 7 — REvolutionH-tl Pipeline Breakdown

Output Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages