Skip to content

saleha-zip/M.tuberculosis-variant-calling-and-gene-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mycobacterium tuberculosis H37Rv — Genomics Pipeline

Pre-Midterm Project | BS-BI UG-1 | SINES, National University of Sciences & Technology
Author: Saleha Asim

This repository covers Tasks 5 and 6 of a genomics pipeline applied to the complete reference genome of Mycobacterium tuberculosis H37Rv (NCBI accession GCF_000195955.2 / NC_000962.3):

  • Task 5 — Read alignment, variant calling, filtering, and functional annotation
  • Task 6 — Ab initio gene prediction (AUGUSTUS) and orthology inference (REvolutionH-tl)

Table of Contents


Repository Structure

M.tuberculosis-variant-calling-and-gene-prediction/
│
├── README.md
├── .gitignore
│
├── scripts/
│   ├── task5_variant_calling.sh       # Full Task 5 pipeline (BWA → FreeBayes → VCFtools)
│   └── task6_gene_prediction.sh       # Full Task 6 pipeline (AUGUSTUS → REvolutionH-tl)
│
├── results/
│   ├── task5/
│   │   ├── raw_variants.vcf                  # 3,646 unfiltered FreeBayes variant calls
│   │   ├── filtered_variants.recode.vcf      # 3,525 variants passing Phred Q ≥ 30
│   │   ├── filtered_variants.log             # VCFtools filtering summary
│   │   ├── snpEff_summary.html               # SnpEff annotation report (see known issue)
│   │   └── snpEff_genes.txt                  # Per-gene variant counts
│   └── task6/
│       ├── augustus_output.aa                # Predicted protein sequences (E. coli K-12 proxy)
│       ├── species_tree.phy                  # Input species tree in Newick format
│       ├── tl_project.orthogroups.tsv        # Orthogroup assignments across 3 species
│       ├── tl_project.orthologs.tsv          # Inferred ortholog pairs with bit scores
│       ├── tl_project.distances.tsv          # Evolutionary distances between sequences
│       ├── tl_project.gene_trees.tsv         # Per-orthogroup gene trees
│       ├── tl_project.best_hits.tsv          # Best reciprocal DIAMOND hits
│       ├── tl_project.best_matches.tsv       # Best match assignments
│       ├── tl_project.resolved_trees.tsv     # Refined gene trees (polytomies resolved)
│       ├── tl_project.corrected_trees.tsv    # Duplication-corrected trees
│       ├── tl_project.reconcilied_trees.tsv  # Gene–species reconciled trees (S/D nodes)
│       ├── tl_project.labeled_species_tree.nhx
│       └── tl_project.species_tree.nhx       # Inferred species tree from gene trees
│
└── docs/
    └── snpeff_chromosome_error_notes.md      # Documents the SnpEff annotation issue

Why does this differ from the original working directory? During analysis, all intermediate files (.sam, .bam, reads, reference genome, protein FASTAs) lived in ~/Documents/Genomics/mid_project/ with subdirectories reads/, reference/, results/, and protein_seqs/. Those large binary and raw data files are not tracked in this repository (see .gitignore). This repo contains only the scripts needed to reproduce the analysis and the final output files — reorganised by task for clarity.


How to Reproduce This Pipeline

1. Prerequisites

Install all required tools. On Ubuntu/WSL:

# System tools
sudo apt update
sudo apt install bwa samtools seqtk

# Conda tools
conda install -c bioconda freebayes vcftools snpeff

# AUGUSTUS (apt version may be missing config files — clone from source instead)
git clone https://github.com/Gaius-Augustus/Augustus.git ~/Augustus
echo 'export AUGUSTUS_CONFIG_PATH=~/Augustus/config' >> ~/.bashrc
echo 'export PATH=$PATH:~/Augustus/scripts' >> ~/.bashrc
source ~/.bashrc

# REvolutionH-tl + DIAMOND (in a virtual environment)
python -m venv genomics_venv
source genomics_venv/bin/activate
pip install revolutionhtl diamond

2. Download Raw Data

These files are not in the repository due to size. Download them manually:

Reference genome — from NCBI RefSeq:

https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000195955.2/

Download the genomic FASTA. Rename the .fna file to mtb.fna.

Resequencing reads — from NCBI SRA:

fastq-dump --split-files ERR181435
# This produces ERR181435_1.fastq and ERR181435_2.fastq
# Merge them back to interleaved, or use both split files directly

Alternatively, download the interleaved ERR181435.fastq directly from:

https://www.ncbi.nlm.nih.gov/sra/ERR181435

Protein FASTAs for Task 6 — from NCBI Datasets:

https://www.ncbi.nlm.nih.gov/datasets/genome/

Download the Protein (FASTA) file for each assembly:

Assembly Species Save as
ASM19595v2 M. tuberculosis H37Rv mtb.faa
ASM19583v2 M. bovis AF2122/97 mbovis.faa
ASM76760v1 M. smegmatis MC2 155 msmeg.faa

3. Set Up Working Directories

The scripts expect this layout. Create it and place your downloaded files accordingly:

mkdir -p mid_project/reads
mkdir -p mid_project/reference
mkdir -p mid_project/protein_seqs

# Place files:
mv mtb.fna              mid_project/reference/
mv ERR181435.fastq      mid_project/reads/       # or the split _1/_2 files
mv mtb.faa mbovis.faa msmeg.faa  mid_project/protein_seqs/

4. Run Task 5

cd mid_project/reads
bash ../../scripts/task5_variant_calling.sh

Expected outputs in results/task5/:

  • raw_variants.vcf — 3,646 variant calls
  • filtered_variants.recode.vcf — 3,525 high-quality variants (Phred Q ≥ 30)

⚠️ SnpEff annotation will produce a chromosome ID error. This is a known issue documented in docs/snpeff_chromosome_error_notes.md. The VCF uses NC_000962.3 as the chromosome name but the pre-built SnpEff database expects a different identifier. The filtered VCF is still valid and usable.

5. Run Task 6

cd mid_project
bash ../scripts/task6_gene_prediction.sh

Expected outputs in results/task6/:

  • augustus_output.aa — predicted protein sequences
  • All tl_project.* files — orthogroups, gene trees, reconciled trees

Task 5 — Variant Calling and Functional Annotation

This task covers aligning resequencing reads using BWA, performing variant calling using FreeBayes, filtering variants using VCFtools, and annotating variants using SnpEff to assess their potential functional effects.

5.1 Project Setup and Reference Files

A working directory was organised with three subdirectories:

mid_project/
├── reads/
├── reference/
└── results/

Reference genome: RefSeq assembly GCF_000195955.2 (renamed to mtb.fna). The GCF RefSeq dataset was selected over the raw AL123456.3 download because it is curated, standardised, and directly compatible with the annotation file (.gff), IGV, GATK, and SnpEff. Both files represent the same genome; the RefSeq assembly ensures coordinate consistency across all tools.

Resequencing reads: SRA accession ERR181435 — an Illumina Genome Analyzer IIx paired-end whole-genome sequencing run of M. tuberculosis H37Rv. This dataset was chosen because it is small (~150 MB), publicly available, and is a direct resequencing of the H37Rv reference strain.

Upon inspection, each read ID appeared twice in the file, indicating interleaved paired-end format (forward and reverse reads in the same file).


5.2 Handling Paired-End Reads

The interleaved file was split into separate forward and reverse files using seqtk:

seqtk seq -1 ERR181435.fastq > reads_1.fastq
seqtk seq -2 ERR181435.fastq > reads_2.fastq

The -1 flag extracts all forward reads and -2 extracts all reverse reads. Both commands completed successfully.


5.3 Reference Genome Indexing

Before alignment, the reference genome was indexed. Indexing enables rapid and efficient searching during read mapping — without it, the aligner would need to scan the entire reference for every read.

bwa index mtb.fna
samtools faidx mtb.fna
  • BWA constructs an FM-index based on the Burrows–Wheeler Transform, allowing the aligner to locate seed matches in near-constant time.
  • SAMtools faidx creates a companion .fai index for fast random access to specific genomic regions.

The BWA indexing process produced five auxiliary files: .amb, .ann, .bwt, .pac, .sa. SAMtools produced the .fai index file.


5.4 Read Alignment using BWA-MEM

Read alignment was performed using BWA-MEM, which uses a seed-and-extend strategy to balance speed and accuracy:

bwa mem ~/Documents/Genomics/mid_project/reference/mtb.fna \
    reads_1.fastq reads_2.fastq > aligned.sam

BWA-MEM took the indexed reference and the two split FASTQ files as input, and produced a SAM (Sequence Alignment/Map) file. Total alignment time: 699.463 seconds.


5.5 Alignment Quality Assessment

Alignment statistics were checked using samtools flagstat:

samtools flagstat aligned.sam
Metric Value Interpretation
Total reads 19,330,668 All QC-passed and QC-failed reads combined
Primary mapped 18,606,932 (96.39%) Reads successfully aligned to the reference
Properly paired 18,440,658 (95.52%) Both mates mapped in expected orientation
Singletons 47,136 (0.24%) One mate mapped, the other did not
Reads mapped to different chr 0 No multi-chromosome contamination

96.39% of reads mapped successfully, indicating high-quality alignment. The very low singleton rate (~0.24%) and complete absence of inter-chromosomal mappings confirm minimal alignment errors.


5.6 Converting, Sorting, and Indexing the BAM File

The SAM output was converted to BAM (binary) format, sorted by genomic coordinate, and indexed. Downstream variant callers require a sorted, indexed BAM.

samtools view -S -b aligned.sam > aligned.bam
samtools sort aligned.bam -o aligned_sorted.bam
samtools index aligned_sorted.bam
  • view -S -b — converts SAM to the more compact BAM format
  • sort — arranges reads by genomic coordinate (required by most variant callers)
  • index — creates a .bai lookup table for rapid retrieval of reads from any genomic region

5.7 Variant Calling using FreeBayes

Variant calling was performed using FreeBayes, a Bayesian haplotype-based variant caller that detects SNPs and small indels without requiring a duplicate-marking step.

conda install -c bioconda freebayes

freebayes -f ~/Documents/Genomics/mid_project/reference/mtb.fna \
    aligned_sorted.bam > raw_variants.vcf

The -f flag specifies the reference genome. Output was written to raw_variants.vcf.


5.8 Variant Filtering using VCFtools

Raw variants were filtered with VCFtools to remove low-confidence calls:

vcftools --vcf raw_variants.vcf \
    --minQ 30 \
    --recode \
    --out filtered_variants
Parameter Purpose
--vcf Input VCF file
--minQ 30 Retain only variants with Phred-scaled quality ≥ 30 (error probability ≤ 1 in 1,000)
--recode Write filtered variants to a new VCF
--out Output file prefix

Result: 3,525 of 3,646 raw variants passed quality filters — a pass rate of 96.7%. Filtered output: filtered_variants.recode.vcf.


5.9 Variant Annotation using SnpEff

SnpEff was used to annotate the filtered variants with predicted functional effects:

conda install -c bioconda snpeff

snpEff Mycobacterium_tuberculosis_h37rv \
    filtered_variants.recode.vcf > annotated.vcf

This produced three output files: annotated.vcf, snpEff_summary.html, and snpEff_genes.txt.

⚠️ Chromosome Mapping Error

Upon inspection, all variant records contained an ERROR_CHROMOSOME_NOT_FOUND warning. The pre-built SnpEff database for Mycobacterium_tuberculosis_h37rv was not built from the same reference used for alignment — the VCF contained chromosome ID NC_000962.3 while the SnpEff database expected a different identifier.

Attempted Fix: Custom SnpEff Database

A custom SnpEff database was constructed using the matching FASTA and GFF files:

  1. Edit the SnpEff config — added a new genome entry:

    MTB.genome : Mycobacterium_tuberculosis_H37Rv
    
  2. Create the directory structure — placed sequences.fa (renamed from mtb.fna) and genes.gff in ~/snpeff_custom/data/MTB/

  3. Build the custom database:

    snpEff build -gff3 -v MTB -dataDir ~/snpeff_custom/data

The build process failed because SnpEff requires additional validation files — cds.fa and protein.fa — which are often absent or incompletely formatted in bacterial GFF annotations. As a result, the database was not fully indexed and annotation produced no usable output.


5.10 Summary of Results

Output File Contents Status
aligned.sam Raw alignment output from BWA-MEM ✅ Completed
aligned_sorted.bam Sorted and indexed BAM file ✅ Completed
raw_variants.vcf 3,646 unfiltered variant calls from FreeBayes ✅ Completed
filtered_variants.recode.vcf 3,525 high-quality variants (Phred Q ≥ 30) ✅ Completed
annotated.vcf SnpEff functional annotation ❌ Error: chromosome ID mismatch

Note: I intend to fix the annotation error soon.


Task 6 — Gene Prediction and Orthology Inference

This task involves predicting gene structures using AUGUSTUS, followed by orthogroup identification, gene tree reconstruction, and orthology inference using REvolutionH-tl to analyse gene evolution across species.

Pipeline overview:

Genome (FASTA) → AUGUSTUS → Predicted genes (CDS / proteins) → REvolutionH-tl

Note: Because a proxy species was used for ab initio gene prediction, the AUGUSTUS output was not passed to REvolutionH-tl. Instead, protein FASTA (.faa) files for M. tuberculosis, M. bovis, and M. smegmatis were downloaded directly from NCBI.


6.1 Gene Prediction with AUGUSTUS

The AUGUSTUS web server was unavailable at the time of analysis (server busy), so the command-line interface was used instead.

Installation

sudo apt update
sudo apt install augustus augustus-data

# Verify installation
augustus --help

Running AUGUSTUS

Working directory: ~/Documents/Genomics/mid_project/reference

No pre-trained AUGUSTUS model exists for Mycobacterium tuberculosis, so the E. coli K-12 model was used as a proxy under an intronless gene model. This is appropriate because both organisms are bacteria with similar gene structures (no introns) and comparable codon usage patterns.

augustus --species=E_coli_K12 --genemodel=intronless mtb.fna > augustus_output.gff

Extracting Protein Sequences

The AUGUSTUS scripts directory was added to the system PATH, then protein sequences were extracted using getAnnoFasta.pl:

echo 'export PATH=$PATH:~/Augustus/scripts' >> ~/.bashrc
source ~/.bashrc

getAnnoFasta.pl augustus_output.gff

This produced augustus_output.aa — the protein sequences file used as input for Part 2.


6.2 Orthology Inference with REvolutionH-tl

Step 1 — Download Protein (.faa) Files from NCBI

Protein FASTA files were downloaded from NCBI Datasets for three species:

Assembly Species
ASM19595v2 Mycobacterium tuberculosis H37Rv
ASM76760v1 Mycolicibacterium smegmatis MC2 155
ASM19583v2 Mycobacterium tuberculosis variant bovis AF2122/97
protein_seqs/
├── mtb.faa
├── mbovis.faa
└── msmeg.faa

Step 2 — Clean FASTA Headers

Raw FASTA headers contained spaces, long descriptions, and unstructured species names, which can break parsing, orthogroup labelling, and species assignment in REvolutionH-tl. Headers were cleaned and species prefixes prepended using sed:

sed 's/^>\([^ ]*\).*/>&mtb_\1/'    mtb.faa    > mtb_clean.faa
sed 's/^>\([^ ]*\).*/>&mbovis_\1/' mbovis.faa > mbovis_clean.faa
sed 's/^>\([^ ]*\).*/>&msmeg_\1/'  msmeg.faa  > msmeg_clean.faa

Step 3 — Organise Cleaned Files

mkdir fasta_files
mv *_clean.faa fasta_files/

Step 4 — Create the Species Tree File

A species tree file (species_tree.phy) was created in Newick format:

((mtb,mbovis),msmeg);

This represents the topology:

        ┌─ mtb
    ┌───┤
    │   └─ mbovis
────┤
    └──── msmeg

Step 5 — Activate the Conda Environment

source ~/Documents/Genomics/lab4/genomics_venv/bin/activate

Step 6 — Run REvolutionH-tl

python -m revolutionhtl \
    -F fasta_files \
    -S species_tree.phy \
    -aligner diamond

Note: An initial error occurred because REvolutionH-tl expected .fa file extensions while the input files had .faa extensions. This was resolved by renaming the files:

mv mtb_clean.faa   mtb.fa
mv mbovis_clean.faa mbovis.fa
mv msmeg_clean.faa msmeg.fa

After renaming, the pipeline completed successfully: REvolutionH-tl finished all the tasks without any problem.


Step 7 — REvolutionH-tl Pipeline Breakdown

The pipeline executed six internal steps:

Step 1 — Pairwise Sequence Similarity
All three proteomes were compared using pairwise sequence similarity. Six alignments were computed (all pairwise combinations among mtb, mbovis, msmeg, and their reciprocals). Results written to tl_project.alignment_all_vs_all/.

Step 2 — Singleton and Orthogroup Identification
3,338 singletons were identified — genes with no detectable homologs in the other species, reflecting species-specific genes, highly diverged sequences, or annotation differences. Remaining genes were grouped into orthogroups (genes descended from a common ancestor), written to tl_project.orthogroups.tsv.

Example orthogroups:

OG n_genes n_species mbovis msmeg mtb
OG0 2 2 mbovis_WP_003900015.1 mtb_YP_009030045.1
OG1 2 2 mbovis_WP_003400850.1 mtb_YP_009030027.1
OG8 2 2 mbovis_WP_003402602.1 msmeg_WP_003402602.1

Step 3 — Gene Tree Construction
3,306 orthogroups contained enough homologous sequences for phylogenetic analysis. A gene tree was constructed per orthogroup (tl_project.gene_trees.tsv), representing evolutionary relationships among gene copies. Orthologs (genes diverged through speciation, not duplication) were written to tl_project.orthologs.tsv.

Step 4 — Gene Tree Refinement
Ambiguities such as polytomies were resolved and duplication inference was improved using evolutionary distances. Output: tl_project.resolved_trees.tsv.

Step 5 — Species Tree Reconstruction
A species tree (tl_project.species_tree.nhx) was reconstructed from the collective signal of all gene trees:

(msmeg,(mbovis,mtb));

This topology matches known biology, serving as a validation of the analysis.

Step 6 — Gene–Species Tree Reconciliation
Individual gene trees were reconciled against the species tree to infer evolutionary events (speciation, duplication, gene loss). Internal nodes in the output (tl_project.reconcilied_trees.tsv) are annotated as S (speciation) or D (duplication), providing a comprehensive view of the evolutionary history of each gene family across the three Mycobacterium species.


Output Files

File Description
tl_project.alignment_all_vs_all/ Pairwise alignment hits
tl_project.best_hits.tsv Best reciprocal hits per gene pair
tl_project.orthogroups.tsv Orthogroup assignments
tl_project.orthologs.tsv Inferred ortholog pairs with normalised bit scores
tl_project.distances.tsv Evolutionary distances between sequences
tl_project.gene_trees.tsv Gene trees per orthogroup
tl_project.resolved_trees.tsv Refined, fully resolved gene trees
tl_project.species_tree.nhx Reconstructed species tree
tl_project.reconcilied_trees.tsv Reconciled gene–species trees (S/D annotations)

Analysis performed: March 2026 | Reference genome: NC_000962.3 / GCF_000195955.2

About

Task 5 and 6 of Genomics Mid semester project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors