A Python-based toolkit for validating DNA barcode sequences through structural and taxonomic validation. This tool helps ensure sequence quality and taxonomic accuracy for submissions to the Barcode of Life Data System (BOLD) and to Naturalis's Core Sequence Cloud.
-
Structural validation of DNA barcodes:
- Sequence length requirements
- Ambiguous base detection
- Stop codon analysis for protein-coding markers
- HMM-based alignment for codon phase detection
-
Taxonomic validation:
- BLAST-based validation against reference databases
- Flexible taxonomy mapping (NSR or BOLD)
- Integration with NCBI taxonomy
- Support for multiple taxonomic ranks
-
Input/Output:
- Support for FASTA and tabular input formats
- Detailed validation reports
- Filtered FASTA output for valid sequences
- Integration with Galaxy workflow platform
# Create and activate conda environment
conda env create -f environment.yml
conda activate barcode-validator
# Installation of Python dependencies invoked by conda
# pip install -r requirements.txt
# Install required command line tools
sudo apt-get install hmmer ncbi-blast+
# Install Python package
pip install .
# Using NSR taxonomy, we do this when we validate CSC dumps
python barcode_validator \
--input_file sequences.tsv \
--exp_taxonomy nsr.zip \
--exp_taxonomy_type nsr \
--config config.yml \
--output_file results.tsv
# Using BOLD taxonomy, this is the typical process for BGE where we prepare BOLD uploads
barcode-validator \
--input_file sequences.fasta \
--exp_taxonomy bold.xlsx \
--exp_taxonomy_type bold \
--config config.yml \
--output_file results.tsv \
--emit_valid_fasta --output_fasta valid.fasta
The tool is available as a Galaxy tool wrapper, enabling web-based usage through the Galaxy platform. Users can:
- Upload sequence files to their Galaxy history
- Configure validation parameters through the GUI
- Run validations and view results within Galaxy
- Download validation reports and filtered sequences
The tool uses YAML configuration files for flexible setup:
# Example config.yml
marker: COI-5P
validation_rank: family
taxonomic_backbone: bold
blast_db: /path/to/blast/db
hmm_profile_dir: /path/to/hmm/profiles
log_level: INFO
See config/config.yml
for a complete configuration template with documentation.
We welcome contributions! Please see:
# Run test suite
pytest
# Run with coverage
pytest --cov=barcode_validator
This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details.
If you use this software in your research, please cite:
[Citation information to be added]
[Contact information to be added]