FunBarONT

A bioinformatics pipeline for processing Nanopore barcoding data of fungi.

If you are looking for ONT metabarcoding analysis pipeline check out BaNaNA

This pipeline streamlines the conversion of Oxford Nanopore Technologies (ONT) basecaller output into high-quality Internal Transcribed Spacer (ITS) sequences. It is designed to work with demultiplexed basecalling results generated by MinKNOW or Dorado—the latter being the current default basecaller integrated into MinKNOW.

The workflow includes several key steps:

Quality assessment with NanoPlot
Clustering of similar reads
Polishing to improve read accuracy
ITS extraction (optional)
Taxonomy assignment

This modular structure enables researchers to efficiently generate and analyze ITS sequences from ONT data with minimal manual intervention.

Data Prerequisites

Each barcode should contain one fungal sample. The pipeline includes logic to account for potential contamination.

Input Data

The pipeline automatically discovers and processes all barcode folders located in the pass directory of the provided folder structure.

Installation

The pipeline is designed to be run using NextFlow - a scientific workflow system for bioinformatic data analysis. Thank to that it allows for scalable and paraller running of different steps on multiple barcodes at once.

1. Install Nextflow via Conda

conda create -n nf-env -c bioconda nextflow=24.10.5

2. Clone the Repository

git clone https://github.com/mdziurzynski/FunBarONT.git
cd FunBarONT

3. Prepare the BLAST Database (e.g., using UNITE)

Download the FASTA release of the UNITE database.
Unpack the archive and create a BLAST database:

makeblastdb -in <your_unite.fasta> -dbtype nucl -out <unite_blastdb>/db

Running the Pipeline

⚠️ The first run may take longer due to Conda environment setup.

⚠️ ️All paths MUST be absolute!

conda activate nf-env
nextflow run main.nf \
    --ONT_DIRECTORY <FULL PATH to basecalled ONT data (must contain pass/ with barcode01-XX folders)> \
    --BLASTDB_PATH <FULL PATH to folder containing unite_blastdb> \
    --RUN_ID <your analysis ID>

Usage

FUNGAL BARCODING WITH ONT: This pipeline streamlines the conversion of Oxford Nanopore Technologies (ONT) basecaller output into high-quality Internal Transcribed Spacer (ITS) sequences.

Required arguments:

  --ONT_DIRECTORY  Location of the input file file.

  --BLASTDB_PATH  Location of the input file file.

  --RUN_ID  Location of the input file file.

Optional arguments:

  --MEDAKA_MODEL  Medaka inference model. [default: r1041_e82_400bps_hac_variant_v4.3.0]

  --USE_ITSX  Set to 0 if you want to ommit extraction of full ITS region using ITSx. [default: 1]

  --CHOPPER_MIN_READ_LENGTH Reads shorter than this value wont be used for clusters generation. [default: 150]

  --CHOPPER_MAX_READ_LENGTH  Reads longer than this value wont be used for clusters generation. [default: 1000]

  --REL_ABU_THRESHOLD  Output only clusters with barcode-wise relative abundance above this value. [default: 10]

  --CPU_THREADS  Number of CPU threads. [default: 8]

Results

NanoPlot report for each barcode, allowing visual quality assessment of the reads.
Excel summary table listing identified sequences for each barcode. Due to contamination or inherent variability in ONT data, more than one sequence may be identified per barcode.

Output Files

"{barcode_name}_NanoPlot_results" — Contains NanoPlot output including NanoPlot-report.html, which should be inspected for read count and overall quality. Verify that the majority of reads align with expected characteristics.
"{run_id}.results.xlsx" — The primary result file containing detailed information about sequence clusters and taxonomic assignments.

This Excel file may contain multiple records per barcode, including entries with the same taxonomic assignment. To select the most representative sequence, consider both the cluster size and its relative abundance within the barcode dataset.

For downstream applications, it is recommended to further align these sequences against diverse reference databases or incorporate them into phylogenetic analyses to confirm and refine taxonomic placement.

Columns in the Results Excel File

Barcode
Number of clusters – Clusters in barcode data at 95% identity threshold
Total reads after filtering
Cluster ID – Unique identifier for the cluster representative sequence
Cluster size – Number of sequences in the cluster
Cluster relative abundance –Proportion of sequences in the cluster relative to the barcode total
Cluster sequence – Extracted ITS sequence using ITSx
Cluster sequence untrimmed –Polished sequence prior to ITSx trimming
BLASTn taxonomy assignment
BLASTn percent identity
BLASTn query coverage
BLASTn query length
BLASTn subject length
BLASTn e-value
BLASTn subject SH
BLASTn full taxonomy

Why am I getting more than one record per barcode?

There are several possible reasons why your barcode may yield multiple sequence records:

Sample contamination with other fungi
- The pipeline is optimized for barcoding fungal fruiting bodies, which are often colonized or contaminated by other fungal species. This is a common issue.
- To verify the barcode’s accuracy, cross-check the results with a morphological taxonomic assessment of the entire fruiting body. Visual inspection and expert identification can help confirm whether the observed sequences are expected or due to contamination.
Chimeric sequences
- Chimeras can form during PCR when the DNA polymerase jumps between different template strands. This typically results from imbalanced PCR conditions, such as incorrect concentrations of primers, polymerase, or template DNA.
- To identify and remove chimeras:
  - Compare the expected ITS product length for your target species with the observed sequence length.
  - In your BLASTn results, look at the subject length (reference) and query length (your sequence). Discrepancies may indicate chimeric sequences.
  - You can also submit your sequence to UNITE or NCBI BLASTn to see which species it aligns with most closely.

How many sequences per cluster is enough?

This is yet to be confirmed, but you should not use clusters with less than 20 sequences.

License and contributions

GNU General Public License v3.0

The pipeline was developed by Mikołaj Dziurzyński in scope of the FunDive project.

Inspired by BaNaNA

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
images		images
modules		modules
subworkflows		subworkflows
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
funbaront.sh		funbaront.sh
main.nf		main.nf
nextflow.config		nextflow.config
ont_fungal_barcoding_env.yml		ont_fungal_barcoding_env.yml
results_aggregation_script.py		results_aggregation_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FunBarONT

Table of Contents

Data Prerequisites

Input Data

Installation

1. Install Nextflow via Conda

2. Clone the Repository

3. Prepare the BLAST Database (e.g., using UNITE)

Running the Pipeline

Usage

Results

Output Files

Columns in the Results Excel File

Why am I getting more than one record per barcode?

How many sequences per cluster is enough?

License and contributions

About

Uh oh!

Releases

Packages

Languages

License

mdziurzynski/FunBarONT

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages