A comprehensive bioinformatics pipeline for Hepatitis A Virus (HAV) sequence analysis
Specializing in VP1-P2B junction region analysis
Features β’ Installation β’ Usage β’ Outputs β’ Support
- Overview
- Key Features
- Prerequisites
- Installation
- Pipeline Workflow
- Usage
- Output Files
- Test Data
- Troubleshooting
- Contributing
- Contact
- Citation
Daytona_HAV_VPB is a specialized bioinformatics pipeline designed for comprehensive sequence analysis of Hepatitis A Virus (HAV), focusing on the VP1-P2B junction region. The pipeline integrates multiple analytical tools to provide:
- π§ͺ Species Detection - Accurate identification of HAV reads
- 𧬠Mutation Analysis - Comprehensive variant calling and characterization
- π³ Genotyping - Phylogenetic classification across all 7 HAV genotypes (IA, IB, IC, IIA, IIB, IIIA, IIIB)
- π Quality Control - Integrated QC reporting with MultiQC
| Feature | Daytona_HAV | Daytona_HAV_VPB |
|---|---|---|
| Data Type | Any HAV sequencing data | VP1-P2B junction region only |
| Use Case | General HAV analysis | Targeted VP1-P2B analysis |
| References | Variable | 1,773 curated VP1-P2B sequences |
β οΈ Important: This pipeline requires Illumina paired-end sequencing data and is optimized for VP1-P2B region analysis.
- Automated QC Pipeline: FastQC, Trimmomatic, BBTools, and MultiQC integration
- Kraken2-based Species Detection: Utilizes PlusPF database for accurate HAV identification
- High-Resolution Variant Calling: BWA + SAMtools + iVar workflow for SNP detection
- Comprehensive Phylogenetic Analysis: Integration with Nextclade using 1,773 reference sequences
- Interactive Visualization: Auspice-compatible outputs for dynamic tree exploration
- Production-Ready: Designed for SLURM-based HPC environments (HiPerGator optimized)
The pipeline includes a curated reference dataset of 1,773 VP1-P2B sequences spanning all 7 HAV genotypes:
- Genotype IA, IB, IC
- Genotype IIA, IIB
- Genotype IIIA, IIIB
| Tool | Purpose | Installation |
|---|---|---|
| Nextflow | Workflow manager | Installation Guide |
| Singularity/Apptainer | Container runtime | Installation Guide |
| SLURM | Job scheduler | System-dependent |
| Python 3.10+ | Scripting | System package manager |
pip3 install pandas biopython- Kraken2 PlusPF Database (pre-configured for HiPerGator users)
π‘ HiPerGator Users: All prerequisites are pre-installed. No additional setup required!
git clone https://github.com/BPHL-Molecular/Daytona_HAV_VPB.git
cd Daytona_HAV_VPB# Create dedicated environment
conda create -n HAV_VPB -c conda-forge python=3.10
# Activate environment
conda activate HAV_VPB
# Install dependencies
pip install pandas biopython# Check Nextflow
nextflow -version
# Check Singularity
singularity --version
# Check Python packages
python -c "import pandas, Bio; print('Dependencies OK')"%%{ init: { 'gitGraph': { 'mainBranchName': 'Daytona_HAV_VPB' } } }%%
%%{init: { 'themeVariables': { 'commitLabelFontSize': '20px', 'fontSize': '24px' } } }%%
gitGraph
commit id: "Input Data"
branch QC
checkout QC
commit id: "FastQC"
commit id: "Trimmomatic"
commit id: "BBTools"
commit id: "MultiQC"
checkout Daytona_HAV_VPB
merge QC tag: "QC Complete"
commit id: "Species Detection"
branch Kraken2
checkout Kraken2
commit id: "Kraken2 Classification"
commit id: "Report Generation"
checkout Daytona_HAV_VPB
merge Kraken2 tag: "HAV Identified"
commit id: "Variant Calling"
branch SNP_calling
checkout SNP_calling
commit id: "BWA Alignment"
commit id: "SAMtools Processing"
commit id: "iVar Consensus"
checkout Daytona_HAV_VPB
merge SNP_calling tag: "Variants Called"
commit id: "Consensus Generation"
branch Consensus
checkout Consensus
commit id: "Extract HAV Reads"
commit id: "Re-align to Reference"
commit id: "Generate Consensus"
checkout Daytona_HAV_VPB
merge Consensus tag: "Consensus Built"
commit id: "Phylogenetic Analysis"
branch Phylogeny
checkout Phylogeny
commit id: "Nextclade (1773 refs)"
commit id: "Tree Visualization"
checkout Daytona_HAV_VPB
merge Phylogeny tag: "Analysis Complete"
- Quality Control: Trim adapters, filter low-quality reads, generate QC reports
- Species Detection: Identify HAV reads using Kraken2
- Variant Calling: Align to reference, call SNPs, generate variant reports
- Consensus Building: Extract HAV-specific reads, build consensus sequences
- Phylogenetic Analysis: Genotype assignment, mutation detection, tree generation
CRITICAL: Place your FASTQ files in the correct directory:
# File location (required)
fastqs/hav/
# Naming convention
SampleID_1.fastq.gz # Forward reads
SampleID_2.fastq.gz # Reverse reads
# Example
XZA22002292_1.fastq.gz
XZA22002292_2.fastq.gzπ Important: Files MUST be in
fastqs/hav/- other locations will cause errors!
Need to rename files? Use the provided rename script:
./rename.shEdit params_hav.yaml with absolute paths:
# Input/Output directories (use absolute paths)
input_dir: "/path/to/your/fastqs/hav"
output_dir: "/path/to/your/output"
reference_dir: "/path/to/reference"
# Submit to SLURM scheduler
sbatch Daytona_HAV_VPB_NXC.sh# Check job status
squeue -u $USER
# View log file
tail -f slurm-<jobid>.out
# Check output directory
ls -lh output/File: output/sum_report.txt
| Column | Description | Example |
|---|---|---|
sampleID |
Sample identifier | xxx25002686_S1 |
species/tax_ID/percent(%)/number |
Classification results | Hepatitis A/12092/93.58/123349 |
Interpretation: In the example above, 123,349 reads (93.58%) were identified as HAV in sample xxx25002686_S1.
File: output/variants/*.tsv
| Column | Description | Quality Check |
|---|---|---|
REGION |
Reference genome | NC_001489.1 |
POS |
Position | 2895 |
REF |
Reference base | T |
ALT |
Alternate base | G |
PVAL |
P-value | 0.526316 |
PASS |
QC status | FALSE (p > 0.05) |
β Pass Criteria: Variants with
PASS = TRUEhave p-value β€ 0.05
File: output/nextclade/genotype_mutation.csv
Key Columns:
clade: HAV subtype (IA, IB, IC, IIA, IIB, IIIA, IIIB)substitutions: Nucleotide changesaaSubstitutions: Amino acid changesqc.overallStatus: Quality assessment
Output Formats:
| Format | File | Use Case |
|---|---|---|
| Newick | nextclade.nwk |
Standard phylogenetic software |
| Auspice JSON v2 | nextclade.json |
Interactive visualization at auspice.us |
| SVG | tree_with_reference.svg |
High-resolution image |
tree_with_reference.pdf |
Publication-ready |
Figure 1: Complete Tree (Test Samples + References)
Figure 2: Test Samples Only
output/
βββ qc/
β βββ fastqc/ # FastQC reports
β βββ multiqc/ # Aggregated QC report
βββ kraken_out/ # Species classification
βββ variants/ # SNP calling results
βββ extract/ # Consensus sequences
β βββ *_consensus.fa # Individual consensus
β βββ sum_consensus.fa # Combined consensus
βββ nextclade/ # Phylogenetic analysis
β βββ nextclade.tsv # Genotype table
β βββ nextclade.nwk # Newick tree
β βββ nextclade.json # Auspice JSON
β βββ genotype_mutation.csv
βββ tree_with_reference.svg
βββ tree_with_reference.pdf
βββ logs/ # Pipeline logs
Pre-validated test datasets are available:
# Location of test data
/blue/bphl-florida/share/Daytona_HAV_test_sample
# Copy to your working directory
cp /blue/bphl-florida/share/Daytona_HAV_test_sample/* /path/to/your/fastqs/hav/
# Run pipeline
sbatch Daytona_HAV_VPB_NXC.shExpected Results:
- Runtime: ~30-45 minutes (depending on data size)
- Output: Complete phylogenetic tree with genotype assignments
- Quality: All samples should pass QC metrics
β Error: Files not found in fastqs/hav/
Solution: Verify file location and naming
# Check files exist
ls -l fastqs/hav/
# Verify naming pattern
# Correct: SampleID_1.fastq.gz, SampleID_2.fastq.gz
# Incorrect: SampleID.R1.fastq.gzβ Error: No Hepatovirus A detected
Possible Causes:
- Insufficient sequencing depth
- Wrong reference database
- Contamination
Solution:
- Check MultiQC report for read quality
- Verify Kraken2 database configuration
- Inspect raw FASTQ files
β Error: Nextclade analysis failed
Solution: Check consensus quality
# Verify consensus sequences exist
ls output/extract/*_consensus.fa
# Check sequence length
grep -v ">" output/extract/sum_consensus.fa | wc -cβ οΈ Warning: Low PASS rate in variants
Interpretation: Many variants failing p-value threshold
Solution:
- Increase sequencing depth
- Adjust quality filtering parameters
- Review alignment quality metrics
- Check logs: Review
slurm-*.outfor error messages - Validate inputs: Ensure FASTQ files are properly formatted
- Resource issues: Verify adequate disk space and memory
- Contact support: Open an issue on GitHub
To receive email updates when your pipeline completes:
- Edit
Daytona_HAV_VPB_NXC.sh - Update the mail-user line:
#SBATCH --mail-user=your.email@example.com
#SBATCH --mail-type=END,FAILWe welcome contributions! Here's how you can help:
- Visit our Issues page
- Check if your issue already exists
- Create a new issue with:
- Clear description
- Steps to reproduce
- Expected vs actual behavior
- System information
- Open an issue with the
enhancementlabel - Describe the feature and its benefits
- Provide use cases if applicable
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request with clear description
If you use this pipeline in your research, please cite:
@software{daytona_hav_vpb,
title = {Daytona HAV VPB: A Comprehensive Pipeline for Hepatitis A Virus VP1-P2B Analysis},
author = {BPHL Molecular Biology Division},
year = {2025},
url = {https://github.com/BPHL-Molecular/Daytona_HAV_VPB}
}This project is licensed under the MIT License - see the LICENSE file for details.


