🧬 Daytona_HAV_VPB

A comprehensive bioinformatics pipeline for Hepatitis A Virus (HAV) sequence analysis
Specializing in VP1-P2B junction region analysis

Features • Installation • Usage • Outputs • Support

📋 Table of Contents

Overview
Key Features
Prerequisites
Installation
Pipeline Workflow
Usage
Output Files
Test Data
Troubleshooting
Contributing
Contact
Citation

🔬 Overview

Daytona_HAV_VPB is a specialized bioinformatics pipeline designed for comprehensive sequence analysis of Hepatitis A Virus (HAV), focusing on the VP1-P2B junction region. The pipeline integrates multiple analytical tools to provide:

🧪 Species Detection - Accurate identification of HAV reads
🧬 Mutation Analysis - Comprehensive variant calling and characterization
🌳 Genotyping - Phylogenetic classification across all 7 HAV genotypes (IA, IB, IC, IIA, IIB, IIIA, IIIB)
📊 Quality Control - Integrated QC reporting with MultiQC

Daytona_HAV vs Daytona_HAV_VPB

Feature	Daytona_HAV	Daytona_HAV_VPB
Data Type	Any HAV sequencing data	VP1-P2B junction region only
Use Case	General HAV analysis	Targeted VP1-P2B analysis
References	Variable	1,773 curated VP1-P2B sequences

⚠️ Important: This pipeline requires Illumina paired-end sequencing data and is optimized for VP1-P2B region analysis.

✨ Key Features

🎯 Core Capabilities

Automated QC Pipeline: FastQC, Trimmomatic, BBTools, and MultiQC integration
Kraken2-based Species Detection: Utilizes PlusPF database for accurate HAV identification
High-Resolution Variant Calling: BWA + SAMtools + iVar workflow for SNP detection
Comprehensive Phylogenetic Analysis: Integration with Nextclade using 1,773 reference sequences
Interactive Visualization: Auspice-compatible outputs for dynamic tree exploration
Production-Ready: Designed for SLURM-based HPC environments (HiPerGator optimized)

📦 Reference Database

The pipeline includes a curated reference dataset of 1,773 VP1-P2B sequences spanning all 7 HAV genotypes:

Genotype IA, IB, IC
Genotype IIA, IIB
Genotype IIIA, IIIB

🔧 Prerequisites

Required Software

Tool	Purpose	Installation
Nextflow	Workflow manager	Installation Guide
Singularity/Apptainer	Container runtime	Installation Guide
SLURM	Job scheduler	System-dependent
Python 3.10+	Scripting	System package manager

Python Dependencies

pip3 install pandas biopython

Database Requirements

Kraken2 PlusPF Database (pre-configured for HiPerGator users)

💡 HiPerGator Users: All prerequisites are pre-installed. No additional setup required!

📥 Installation

1. Clone the Repository

git clone https://github.com/BPHL-Molecular/Daytona_HAV_VPB.git
cd Daytona_HAV_VPB

2. Set Up Conda Environment (Recommended)

# Create dedicated environment
conda create -n HAV_VPB -c conda-forge python=3.10

# Activate environment
conda activate HAV_VPB

# Install dependencies
pip install pandas biopython

3. Verify Installation

# Check Nextflow
nextflow -version

# Check Singularity
singularity --version

# Check Python packages
python -c "import pandas, Bio; print('Dependencies OK')"

🔄 Pipeline Workflow

%%{ init: { 'gitGraph': { 'mainBranchName': 'Daytona_HAV_VPB' } } }%%
%%{init: { 'themeVariables': { 'commitLabelFontSize': '20px', 'fontSize': '24px' } } }%%
gitGraph
    commit id: "Input Data"
    
    branch QC
    checkout QC
    commit id: "FastQC"
    commit id: "Trimmomatic"
    commit id: "BBTools"
    commit id: "MultiQC"
    checkout Daytona_HAV_VPB
    merge QC tag: "QC Complete"

    commit id: "Species Detection"
    branch Kraken2
    checkout Kraken2
    commit id: "Kraken2 Classification"
    commit id: "Report Generation"
    checkout Daytona_HAV_VPB
    merge Kraken2 tag: "HAV Identified"
    
    commit id: "Variant Calling"
    branch SNP_calling
    checkout SNP_calling
    commit id: "BWA Alignment"
    commit id: "SAMtools Processing"
    commit id: "iVar Consensus"
    checkout Daytona_HAV_VPB
    merge SNP_calling tag: "Variants Called"
    
    commit id: "Consensus Generation"
    branch Consensus
    checkout Consensus
    commit id: "Extract HAV Reads"
    commit id: "Re-align to Reference"
    commit id: "Generate Consensus"
    checkout Daytona_HAV_VPB
    merge Consensus tag: "Consensus Built"

    commit id: "Phylogenetic Analysis"
    branch Phylogeny
    checkout Phylogeny
    commit id: "Nextclade (1773 refs)"
    commit id: "Tree Visualization"
    checkout Daytona_HAV_VPB
    merge Phylogeny tag: "Analysis Complete"

Workflow Steps

Quality Control: Trim adapters, filter low-quality reads, generate QC reports
Species Detection: Identify HAV reads using Kraken2
Variant Calling: Align to reference, call SNPs, generate variant reports
Consensus Building: Extract HAV-specific reads, build consensus sequences
Phylogenetic Analysis: Genotype assignment, mutation detection, tree generation

🚀 Usage

Step 1: Prepare Input Data

CRITICAL: Place your FASTQ files in the correct directory:

# File location (required)
fastqs/hav/

# Naming convention
SampleID_1.fastq.gz  # Forward reads
SampleID_2.fastq.gz  # Reverse reads

# Example
XZA22002292_1.fastq.gz
XZA22002292_2.fastq.gz

📁 Important: Files MUST be in fastqs/hav/ - other locations will cause errors!

Need to rename files? Use the provided rename script:

./rename.sh

Step 2: Configure Parameters

Edit params_hav.yaml with absolute paths:

# Input/Output directories (use absolute paths)
input_dir: "/path/to/your/fastqs/hav"
output_dir: "/path/to/your/output"
reference_dir: "/path/to/reference"

Step 3: Run the Pipeline

# Submit to SLURM scheduler
sbatch Daytona_HAV_VPB_NXC.sh

Step 4: Monitor Progress

# Check job status
squeue -u $USER

# View log file
tail -f slurm-<jobid>.out

# Check output directory
ls -lh output/

📊 Output Files

1. 🧬 HAV Reads Detection

File: output/sum_report.txt

Column	Description	Example
`sampleID`	Sample identifier	xxx25002686_S1
`species/tax_ID/percent(%)/number`	Classification results	Hepatitis A/12092/93.58/123349

Interpretation: In the example above, 123,349 reads (93.58%) were identified as HAV in sample xxx25002686_S1.

2. 🔍 Variant Analysis

File: output/variants/*.tsv

Column	Description	Quality Check
`REGION`	Reference genome	NC_001489.1
`POS`	Position	2895
`REF`	Reference base	T
`ALT`	Alternate base	G
`PVAL`	P-value	0.526316
`PASS`	QC status	FALSE (p > 0.05)

✅ Pass Criteria: Variants with PASS = TRUE have p-value ≤ 0.05

3. 🧪 Genotype & Mutations

File: output/nextclade/genotype_mutation.csv

Key Columns:

clade: HAV subtype (IA, IB, IC, IIA, IIB, IIIA, IIIB)
substitutions: Nucleotide changes
aaSubstitutions: Amino acid changes
qc.overallStatus: Quality assessment

4. 🌳 Phylogenetic Trees

Output Formats:

Format	File	Use Case
Newick	`nextclade.nwk`	Standard phylogenetic software
Auspice JSON v2	`nextclade.json`	Interactive visualization at auspice.us
SVG	`tree_with_reference.svg`	High-resolution image
PDF	`tree_with_reference.pdf`	Publication-ready

Visualization Examples

Figure 1: Complete Tree (Test Samples + References)

1,773 reference sequences with test samples highlighted

Figure 2: Test Samples Only

Focused view of your analyzed samples

5. 📁 Complete Output Structure

output/
├── qc/
│   ├── fastqc/           # FastQC reports
│   └── multiqc/          # Aggregated QC report
├── kraken_out/           # Species classification
├── variants/             # SNP calling results
├── extract/              # Consensus sequences
│   ├── *_consensus.fa    # Individual consensus
│   └── sum_consensus.fa  # Combined consensus
├── nextclade/            # Phylogenetic analysis
│   ├── nextclade.tsv     # Genotype table
│   ├── nextclade.nwk     # Newick tree
│   ├── nextclade.json    # Auspice JSON
│   └── genotype_mutation.csv
├── tree_with_reference.svg
├── tree_with_reference.pdf
└── logs/                 # Pipeline logs

🧪 Test Data

For HiPerGator Users

Pre-validated test datasets are available:

# Location of test data
/blue/bphl-florida/share/Daytona_HAV_test_sample

# Copy to your working directory
cp /blue/bphl-florida/share/Daytona_HAV_test_sample/* /path/to/your/fastqs/hav/

# Run pipeline
sbatch Daytona_HAV_VPB_NXC.sh

Expected Results:

Runtime: ~30-45 minutes (depending on data size)
Output: Complete phylogenetic tree with genotype assignments
Quality: All samples should pass QC metrics

🔧 Troubleshooting

Common Issues

❌ Error: Files not found in fastqs/hav/

Solution: Verify file location and naming

# Check files exist
ls -l fastqs/hav/

# Verify naming pattern
# Correct: SampleID_1.fastq.gz, SampleID_2.fastq.gz
# Incorrect: SampleID.R1.fastq.gz

❌ Error: No Hepatovirus A detected

Possible Causes:

Insufficient sequencing depth
Wrong reference database
Contamination

Solution:

Check MultiQC report for read quality
Verify Kraken2 database configuration
Inspect raw FASTQ files

❌ Error: Nextclade analysis failed

Solution: Check consensus quality

# Verify consensus sequences exist
ls output/extract/*_consensus.fa

# Check sequence length
grep -v ">" output/extract/sum_consensus.fa | wc -c

⚠️ Warning: Low PASS rate in variants

Interpretation: Many variants failing p-value threshold

Solution:

Increase sequencing depth
Adjust quality filtering parameters
Review alignment quality metrics

Getting Help

Check logs: Review slurm-*.out for error messages
Validate inputs: Ensure FASTQ files are properly formatted
Resource issues: Verify adequate disk space and memory
Contact support: Open an issue on GitHub

🔔 Email Notifications

To receive email updates when your pipeline completes:

Edit Daytona_HAV_VPB_NXC.sh
Update the mail-user line:

#SBATCH --mail-user=your.email@example.com
#SBATCH --mail-type=END,FAIL

🤝 Contributing

We welcome contributions! Here's how you can help:

Reporting Issues

Visit our Issues page
Check if your issue already exists
Create a new issue with:
- Clear description
- Steps to reproduce
- Expected vs actual behavior
- System information

Suggesting Enhancements

Open an issue with the enhancement label
Describe the feature and its benefits
Provide use cases if applicable

Pull Requests

Fork the repository
Create a feature branch
Make your changes
Submit a pull request with clear description

📚 Citation

If you use this pipeline in your research, please cite:

@software{daytona_hav_vpb,
  title = {Daytona HAV VPB: A Comprehensive Pipeline for Hepatitis A Virus VP1-P2B Analysis},
  author = {BPHL Molecular Biology Division},
  year = {2025},
  url = {https://github.com/BPHL-Molecular/Daytona_HAV_VPB}
}

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
fastqs/hav		fastqs/hav
modules		modules
reference		reference
Daytona_HAV_VPB_NXC.sh		Daytona_HAV_VPB_NXC.sh
LICENSE		LICENSE
README.md		README.md
braken_phy_hav_VPB_NXC_with_logging.py		braken_phy_hav_VPB_NXC_with_logging.py
extract_kraken_reads.py		extract_kraken_reads.py
hav.nf		hav.nf
params_hav.yaml		params_hav.yaml
rename.sh		rename.sh

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🧬 Daytona_HAV_VPB

📋 Table of Contents

🔬 Overview

Daytona_HAV vs Daytona_HAV_VPB

✨ Key Features

🎯 Core Capabilities

📦 Reference Database

🔧 Prerequisites

Required Software

Python Dependencies

Database Requirements

📥 Installation

1. Clone the Repository

2. Set Up Conda Environment (Recommended)

3. Verify Installation

🔄 Pipeline Workflow

Workflow Steps

🚀 Usage

Step 1: Prepare Input Data

Step 2: Configure Parameters

Step 3: Run the Pipeline

Step 4: Monitor Progress

📊 Output Files

1. 🧬 HAV Reads Detection

2. 🔍 Variant Analysis

3. 🧪 Genotype & Mutations

4. 🌳 Phylogenetic Trees

Visualization Examples

5. 📁 Complete Output Structure

🧪 Test Data

For HiPerGator Users

🔧 Troubleshooting

Common Issues

Getting Help

🔔 Email Notifications

🤝 Contributing

Reporting Issues

Suggesting Enhancements

Pull Requests

📚 Citation

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages