A bioinformatics pipeline for processing Nanopore barcoding data of fungi.
This pipeline streamlines the conversion of Oxford Nanopore Technologies (ONT) basecaller output into high-quality Internal Transcribed Spacer (ITS) sequences. It is designed to work with demultiplexed basecalling results generated by MinKNOW or Dorado—the latter being the current default basecaller integrated into MinKNOW.
The workflow includes several key steps:
- Quality assessment with NanoPlot
- Clustering of similar reads
- Polishing to improve read accuracy
- ITS extraction (optional)
- Taxonomy assignment
This modular structure enables researchers to efficiently generate and analyze ITS sequences from ONT data with minimal manual intervention.
- Overview
- Data Prerequisites
- Input Data
- Installation
- Running the Pipeline
- Usage
- Results
- Output Files
- Why am I getting more than one record per barcode?
- How many sequences per cluster is enough?
- Each barcode should contain one fungal sample. The pipeline includes logic to account for potential contamination.
- The pipeline automatically discovers and processes all barcode folders located in the
pass
directory of the provided folder structure.
- The pipeline is designed to be run using NextFlow - a scientific workflow system for bioinformatic data analysis. Thank to that it allows for scalable and paraller running of different steps on multiple barcodes at once.
conda create -n nf-env -c bioconda -c conda-forge nextflow
git clone [email protected]:mdziurzynski/ont_fungal_barcoding_pipeline.git
cd ont_fungal_barcoding_pipeline
- Download the FASTA release of the UNITE database.
- Unpack the archive and create a BLAST database:
makeblastdb -in <your_unite.fasta> -dbtype nucl -out <unite_blastdb>/db
⚠️ The first run may take longer due to Conda environment setup.
⚠️ ️All paths MUST be absolute!
conda activate nf-env
nextflow run main.nf \
--ONT_DIRECTORY <FULL PATH to basecalled ONT data (must contain pass/ with barcode01-XX folders)> \
--BLASTDB_PATH <FULL PATH to folder containing unite_blastdb> \
--RUN_ID <your analysis ID>
FUNGAL BARCODING WITH ONT: This pipeline streamlines the conversion of Oxford Nanopore Technologies (ONT) basecaller output into high-quality Internal Transcribed Spacer (ITS) sequences.
Required arguments:
--ONT_DIRECTORY Location of the input file file.
--BLASTDB_PATH Location of the input file file.
--RUN_ID Location of the input file file.
Optional arguments:
--MEDAKA_MODEL Medaka inference model. [default: r1041_e82_400bps_hac_variant_v4.3.0]
--USE_ITSX Set to 0 if you want to ommit extraction of full ITS region using ITSx. [default: 1]
--CHOPPER_MIN_READ_LENGTH Reads shorter than this value wont be used for clusters generation. [default: 150]
--CHOPPER_MAX_READ_LENGTH Reads longer than this value wont be used for clusters generation. [default: 1000]
--REL_ABU_THRESHOLD Output only clusters with barcode-wise relative abundance above this value. [default: 10]
- NanoPlot report for each barcode, allowing visual quality assessment of the reads.
- Excel summary table listing identified sequences for each barcode. Due to contamination or inherent variability in ONT data, more than one sequence may be identified per barcode.
"{barcode_name}_NanoPlot_results"
— Contains NanoPlot output includingNanoPlot-report.html
, which should be inspected for read count and overall quality. Verify that the majority of reads align with expected characteristics."{run_id}.results.xlsx"
— The primary result file containing detailed information about sequence clusters and taxonomic assignments.
This Excel file may contain multiple records per barcode, including entries with the same taxonomic assignment. To select the most representative sequence, consider both the cluster size and its relative abundance within the barcode dataset.
For downstream applications, it is recommended to further align these sequences against diverse reference databases or incorporate them into phylogenetic analyses to confirm and refine taxonomic placement.
- Barcode
- Number of clusters – Clusters in barcode data at 95% identity threshold
- Total reads after filtering
- Cluster ID – Unique identifier for the cluster representative sequence
- Cluster size – Number of sequences in the cluster
- Cluster relative abundance –Proportion of sequences in the cluster relative to the barcode total
- Cluster sequence – Extracted ITS sequence using ITSx
- Cluster sequence untrimmed –Polished sequence prior to ITSx trimming
- BLASTn taxonomy assignment
- BLASTn percent identity
- BLASTn query coverage
- BLASTn query length
- BLASTn subject length
- BLASTn e-value
- BLASTn subject SH
- BLASTn full taxonomy
There are several possible reasons why your barcode may yield multiple sequence records:
-
Sample contamination with other fungi
- The pipeline is optimized for barcoding fungal fruiting bodies, which are often colonized or contaminated by other fungal species. This is a common issue.
- To verify the barcode’s accuracy, cross-check the results with a morphological taxonomic assessment of the entire fruiting body. Visual inspection and expert identification can help confirm whether the observed sequences are expected or due to contamination.
-
Chimeric sequences
-
Chimeras can form during PCR when the DNA polymerase jumps between different template strands. This typically results from imbalanced PCR conditions, such as incorrect concentrations of primers, polymerase, or template DNA.
-
To identify and remove chimeras:
-
Compare the expected ITS product length for your target species with the observed sequence length.
-
In your BLASTn results, look at the subject length (reference) and query length (your sequence). Discrepancies may indicate chimeric sequences.
-
You can also submit your sequence to UNITE or NCBI BLASTn to see which species it aligns with most closely.
-
-
- This is yet to be confirmed, but you should not use clusters with less than 20 sequences.