Version: v20250211
Author: Gentaro Shigita
Open the Terminal app and use the mkdir (= make directory) command to create a new directory (= folder) in which to run your analysis and store the results:
mkdir ~/Desktop/MyAnalysisUse the cd (= change directory) command to move into the directory you just created:
cd ~/Desktop/MyAnalysisUse the pwd (= print working directory) command to check that you are in the right place:
pwdUse the ln (= link) command to create a shortcut to the raw data in the current directory (raw data should be in the minknow_data directory on the Desktop):
ln -fns ../minknow_data/MySequencingRun 00_raw_dataFinally, activate the nanoamp environment so that you have access to all software needed for this pipeline.
conda activate nanoampTip
Did you perform live basecalling during the sequencing run? Then you can skip the next step.
However, you need to prepare your files in a specific directory structure (one FASTQ file per barcode) for the subsequent steps to work.
To do so, run the following:
mkdir 01_basecalled_demultiplexed
for barcode in 00_raw_data/*/*/fastq_*/*; do
cat ${barcode}/*.fastq.gz >> 01_basecalled_demultiplexed/${barcode##*/}.fastq.gz
doneProceed to 2. Quality filtering
The first step is to convert the current signals acquired by the sequencer into DNA sequences (= reads) using Guppy (This process is called "basecalling"). At the same time, reads are sorted according to the barcode sequences attached during the library preparation (this process is called "demultiplexing"). These can be done with the basecall_demultiplex_guppy.sh script.
Check the usage:
basecall_demultiplex_guppy.sh
Argument 1: Input folder path containing FAST5 files
Argument 2: Output folder, destination for FASTQ files
Argument 3: Nanopore barcoding kit used in library, (SQK-RPB004, SQK-PBK004, SQK-RBK110-96)
Usage example:
basecall_demultiplex_guppy.sh 00_raw_data/run001 01_basecalled_demultiplexed/run001 SQK-RBK110-96Following the usage, your command should look like this:
basecall_demultiplex_guppy.sh 00_raw_data 01_basecalled_demultiplexed SQK-RBK110-96This will take some time. You can check the progress in the guppy_basecaller.log file created in the output directory.
As some of the reads are of poor quality and unreliable, the next step is to exclude them and retain decent-quality reads using BBDuk. This can be done with the filter_quality_bbduk.sh script.
Check the usage:
filter_quality_bbduk.sh
Argument 1: Input folder path containing basecalled FASTQ files
Argument 2: Output folder, destination for clean FASTQ files
Usage example:
filter_quality_bbduk.sh 01_basecalled_demultiplexed/run001 02_bbduk/run001Following the usage, your command should look like this:
filter_quality_bbduk.sh 01_basecalled_demultiplexed 02_bbdukThe next step is to assemble the individual reads and reconstruct the amplicon(s) using Canu. This can be done with the assemble_canu.sh script.
Check the usage:
assemble_canu.sh
Argument 1: Input folder path containing clean FASTQ files
Argument 2: Output folder, destination for Canu de novo assemblies
Argument 3: Expected total assembly length in kilobases (approximate sum of amplicon lengths)
Argument 4: Number of CPU cores for minimap2
Usage example:
assemble_canu.sh 02_bbduk/run001 03_canu/run001 3k 16Important
Canu does not accept an expected assembly length (3rd argument) shorter than 1 kb. Put 1k even if your amplicon is supposed to be shorter than 1 kb.
Following the usage, your command should look like this:
assemble_canu.sh 02_bbduk 03_canu 1k 16Since Nanopore sequencing is still less accurate than other sequencing technologies (e.g., Sanger, Illumina, and PacBio HiFi), particularly for homopolymers (e.g., AAAAA...), the next step is to infer and correct sequencing errors in the assembled contigs using Medaka (This process is called "polishing"). This can be done with the polish_raw_medaka.sh script.
Check the usage:
polish_raw_medaka.sh
Argument 1: Input folder path containing clean FASTQ files
Argument 2: Input folder path containing Canu de novo assemblies
Argument 3: Output folder, destination for polished assemblies
Usage example:
polish_raw_medaka.sh 02_bbduk/run001 03_canu/run001 04_medaka/run001Following the usage, your command should look like this:
polish_raw_medaka.sh 02_bbduk 03_canu 04_medakaCanu often merges two or more different but similar haplotypes into a single collapsed contig.
You can separate these collapsed haplotypes using Longshot. This may be of particular interest if you are working on hybrids and/or polyploids and can be done with the haplotype_polished_longshot.sh script.
Check the usage:
haplotype_polished_longshot.sh
Argument 1: Input folder path containing clean FASTQ files
Argument 2: Input folder path containing polished assemblies
Argument 3: Output folder, destination for polished haploid assemblies
Argument 4: Number of CPU cores for minimap2
Usage example:
haplotype_polished_longshot.sh 02_bbduk/run001 04_medaka/run001 05_medaka_haploid 16Following the usage, your command should look like this:
haplotype_polished_longshot.sh 02_bbduk 04_medaka 05_longshot 16The match_db.py script can be used to assign a species identification to each haplotype. This script is also useful for renaming sequences according to the correspondence between barcode and sample name.
Check the usage:
match_db.py
Argument 1: Input folder containing the polished haploid assemblies for the sequencing run
Argument 2: Output folder, destination for final annotated FASTA files and species identifications for the sequencing run
Argument 3: Path to barcodes file used in the sequencing run
Argument 4: Path to database file in FASTA format
Usage example:
match_db.py 05_medaka_haploid/run001 06_final_seqs_ids/run001 barcodes/run001.txt db/Cucurbitaceae.fastaFollowing the usage, your command should look like this:
match_db.py 05_longshot 06_fastas barcodes.txt ../nanoamp_refs/Cucurbitaceae_5markers_v2.fasta