A Python-based toolkit for validating DNA barcode sequences through structural and taxonomic validation. This tool helps ensure sequence quality and taxonomic accuracy for submissions to the Barcode of Life Data System (BOLD) and to Naturalis's Core Sequence Cloud.
-
Structural validation of DNA barcodes:
- Sequence length requirements
- Ambiguous base detection
- Stop codon analysis for protein-coding markers
- HMM-based alignment for codon phase detection
-
Taxonomic validation:
- BLAST-based validation against reference databases
- Flexible taxonomy mapping (NSR or BOLD)
- Integration with NCBI taxonomy
- Support for multiple taxonomic ranks
-
Input/Output:
- Support for FASTA and tabular input formats
- Detailed validation reports
- Filtered FASTA output for valid sequences
- Integration with Galaxy workflow platform
The easiest way to install a contained environment for the barcode validator is using
bioconda:
conda create -n barcode-validator
conda activate barcode-validator
conda install -c bioconda barcode-validator blast hmmer
When setting up the BLAST environment, the following environment variables should be set correctly:
BLASTDB
: Path to the BLAST database directory. This must be the directory within which the BLAST databases are stored, not the database files themselves. The directory must contain (many) files starting withnt
. Furthermore, the directory must contain the filestaxdb.btd
,taxdb.bti
, andtaxonomy4blast.sqlite3
. (Thent.*
files are the indexed sequences, the other files help BLAST running taxonomically constrained queries. All can be fetched into the correct folder using the commandupdate_blastdb.pl --decompress nt
from the NCBI BLAST+ package.)BLASTDB_LMDB_MAP_SIZE
: Optionally, set the size of the LMDB map for the BLAST database. This is useful for large databases and can be set to a value like1000G
(1 TB RAM) to ensure sufficient RAM for the initial map of the BLAST database. At Naturalis, we discovered that this mostly functions as a threshold: if you set it too low, BLAST will fail to start. Empirically, this is around 512G. Higher values above the threshold have no effect on performance, they are simply a means to discover you don't have enough RAM available for the BLAST database.
python barcode_validator \
--input_file data/BGE00196_MGE-BGE_r1_1.3_1.5_s50_100.fasta \
--exp_taxonomy examples/bold.xlsx \
--exp_taxonomy_type bold \
--config config/config.yml \
--output_format tsv \
--log_level DEBUG > results.tsv
--input_file
: Path to the input FASTA file containing sequences to validate. The first word in the header line should be the BOLD process ID, followed by an underscore '_', and then a suffix that makes the sequence unique. (The underscore separator can be changed in the configuration file undergroup_id_separator
).--exp_taxonomy
: Path to the 'expected taxonomy' file, i.e. what the sequences are expected to be. In this case, this is a BOLD spreadsheet in Excel format.--exp_taxonomy_type
: Type of expected taxonomy, eithernsr
(Nederlands Soortenregister) orbold
. In this case, we are usingbold
to validate against the BOLD database.--config
: Path to the configuration file. Almost certainly, you will want to update theconfig/config.yml
file to specify the BLAST database name, configuration of the BLAST search, and other parameters, and the location of the NCBI taxonomy database (as *.tar.gz).--output_format
: Format of the output report. Options aretsv
(tab-separated values) orfasta
(filtered FASTA). In this case, we are generating a tabular (tsv) report.--log_level
: Set the logging level. Options areDEBUG
,INFO
,WARNING
,ERROR
, orCRITICAL
. In this case, we are setting it toDEBUG
for detailed output.> results.tsv
: Redirects the output to a file namedresults.tsv
.
Note: the config file has a parameter blast_db
. This should be set to the name of the BLAST database you want to use.
The name of the database is the path to the 'file stem' of the database, without the .nhr
, .nin
, etc. So it is
not the name of the directory, but that of the indexed sequence files without the extensions.
The tool is available as a Galaxy tool wrapper, enabling web-based usage through the Galaxy platform. Users can:
- Upload sequence files to their Galaxy history
- Configure validation parameters through the GUI
- Run validations and view results within Galaxy
- Download validation reports and filtered sequences
We welcome contributions! Please see:
# Run test suite
pytest
# Run with coverage
pytest --cov=barcode_validator
This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details.
If you use this software in your research, please cite:
[Citation information to be added]
[Contact information to be added]