Assembly, variant calling, and annotation of microbial genomes.
This workflow is used to produce long-read de novo assemblies of bacterial and fungal genomes, or to perform reference-based variant calling.
The workflow also performs analysis of the assemblies, such as species identification, antimicrobial resistance (AMR) analysis, and sequence typing through an optional --isolates mode.
In brief, this workflow will perform the following:
- De novo (or reference-based) assembly of bacterial and fungal genomes
- Variant calling (
--referencemode only) - Annotation of regions of interest within the assembly
- Identification of plasmid regions within the assembly
- MLST and genomic similarity based species identification (
--isolatesmode only) - Identify genes and SNVs associated with AMR (
--isolatesmode only)
Recommended requirements:
- CPUs = 16
- Memory = 64GB
Minimum requirements:
- CPUs = 8
- Memory = 32GB
Approximate run time: 20-40 minutes per sample with ~50x coverage using minimum requirements
ARM processor support: False
These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.
The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.
The workflow can currently be run using either
Docker
or Singularity
to provide isolation of the required software.
Both methods are automated out-of-the-box provided
either Docker or Singularity is installed.
This is controlled by the
-profile
parameter as exemplified below.
It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.
The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:
nextflow run epi2me-labs/wf-bacterial-genomes --help
To update a workflow to the latest version on the command line use the following command:
nextflow pull epi2me-labs/wf-bacterial-genomes
A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:
wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-bacterial-genomes/wf-bacterial-genomes-demo.tar.gz
tar -xzvf wf-bacterial-genomes-demo.tar.gz
The workflow can then be run with the downloaded demo data using:
nextflow run epi2me-labs/wf-bacterial-genomes \
--fastq 'wf-bacterial-genomes-demo/isolates_fastq' \
--isolates \
--sample_sheet 'wf-bacterial-genomes-demo/isolates_sample_sheet.csv' \
-profile standard
For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/
This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.
Optimal DNA extraction will be dependent on the gram status of the organism. Some useful protocols are provided below:
Find more related protocols in the Nanopore community.
This workflow accepts either FASTQ or BAM files as input.
The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with --sample. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.
(i) (ii) (iii)
input_reads.fastq ─── input_directory ─── input_directory
├── reads0.fastq ├── barcode01
└── reads1.fastq │ ├── reads0.fastq
│ └── reads1.fastq
├── barcode02
│ ├── reads0.fastq
│ ├── reads1.fastq
│ └── reads2.fastq
└── barcode03
└── reads0.fastq
The fastcat/bamstats tool is used to concatenate multifile samples to be processed by the workflow. It will also output per read statistics, such as read length, count and quality statistics.
Flye is used to generate draft assembly from the FASTQ reads. By default, Flye runs with --nano-hq parameter. Additional configuration can be specified using --flye_opts parameter.
Flye produces a draft assembly along with a statistics file containing information on contig lengths, coverage, multiplicity, and circularity. For de novo assemblies, contigs are assigned random names, whereas in reference-based assemblies, contig names are inherited from the reference sequence.
Following de novo assembly, circular contigs are processed with dnaapler, which reorients contigs to a common start position at the origin of replication: dnaA for chromosomes and repA for plasmids. Dnaapler relies on MMseqs2-based search against a database consisting of origin of replication sequences. Information on the database content can be found in the dnaapler documentation.
Following contig reorientation, reads are aligned against the assembly using mini_align. Subsequently, bamstats and mosdepth are used to calculate mapped read and genome coverage statistics.
The draft assembly from Flye is then polished using medaka. This step will attempt to correct any errors that were introduced during the de novo assembly process and generate a FASTA output file.
The workflow selects the appropriate medaka models based on the basecaller configuration that was used to process the signal data.
Per default, the workflow will attempt to determine the basecaller model from the input data.
When this fails (or when you wish to override the automatic selection), it can be provided with --override_basecaller_cfg.
Reads are aligned against the provided reference with mini_align.
After alignment, haploid variants are called with medaka (see the Polishing section above for details on Medaka model selection).
The variants passing the depth filter are then incorporated in the reference to create the consensus sequence. Variant stats are also created at this point.
Plasmid identification and characterization is performed using MOB-suite.
This workflow implements mob-recon, which is used to perform plasmid reconstruction and typing.
Contigs are characterized as chromosomal or plasmid.
Plasmid contigs sharing similar replication and mobilization characteristics are grouped into MOB-clusters, representing distinct plasmids that may consist of multiple fragments.
The summary report includes a section listing plasmid contigs and identified characteristics, such as replicon, relaxase, MPF, and oriT typing information, as well as mobility prediction (conjugative, mobilizable, or non-mobilizable), and predicted/observed plasmid host range.
The output directory contains a {alias}_mob_results subdirectory, which includes a mobtyper-results.txt file with additional characteristics for each contig.
If plasmid contigs are identified, this directory will also contain .fasta files grouping plasmid contigs based on MOB-cluster identification, and an additional mobtyper-results.txt file with extended typing information.
Plasmid identification can be disabled using the --run_plasmid_id parameter.
Additional mob-recon options can be specified using the --plasmid_id_opts parameter.
Consensus genome assemblies can be annotated using Bakta.
By default, Bakta will run with light database, which will be automatically downloaded and installed when the workflow is executed. To optionally run Bakta with the full database specify --bakta_db_type=full. This will trigger the installation of the full version of the database. As of version 6.0, the compressed size of the full database is approximately 32 GB, so the download may take a significant amount of time.
Alternatively, users can use following steps to manually download and install Bakta database, and specify the custom database location using the --bakta_db parameter.
A compatible database version can be downloaded using the following command:
docker run -v /path/to/desired-db-path:/db --entrypoint /bin/bash ontresearch/bakta:latest
-c "bakta_db download --output /db --type full"Alternatively, you can manually download the database:
wget https://zenodo.org/record/14916843/files/db.tar.xzThen install it using Docker:
docker run -v /path/to/db:/db --entrypoint /bin/bash ontresearch/bakta:latest -c "bakta_db install -i db.tar.xz"Bakta relies on the AMRFinderPlus database, which may need to be updated manually (even after a fresh installation of the Bakta database). To update it:
docker run -v /path/to/db:/db --entrypoint /bin/bash ontresearch/bakta:latest -c "amrfinder_update --force_update --database db/amrfinderplus-db/"To use a locally installed Bakta database, pass the path using --bakta_db. For example:
nextflow run epi2me-labs/wf-bacterial-genomes \
--fastq '/path/to/fastq' \
--bakta_db '/path/to/db'By default, the workflow disables circular genome plot generation by Bakta (--skip-plot), but all other settings follow Bakta’s defaults. Users can customize Bakta's behavior by supplying additional arguments via --bakta_opts.
MLST is a common technique used to help characterise bacterial isolates by using allelic variation from internal DNA fragments of 6-7 housekeeping genes. Typing schemes for specific species and genera are found on PubMLST and are pre-loaded into this workflow. MLST is used to scan assembly and automatically infer the correct typing scheme, and subsequently identify the allele variant found.
MLST results determine the species assignment shown in the final report. This information also determines whether the sample will undergo Salmonella typing or point mutation–based antimicrobial resistance prediction.
Sourmash is used to perform taxonomic identification of assemblies.
Sourmash performs a k-mer based sequence comparison against genomes in a reference database collection.
The analysis uses MinHash sketching to rapidly identify the most similar reference genomes and provide species-level classification with confidence metrics.
Sourmash results provide taxonomic assignment from domain to species level, along with Average Nucleotide Identity (ANI) values and coverage statistics, to assess the quality of the match.
The analysis identifies the primary species match as well as secondary matches that may represent genomic variation within the species.
Species identification complements MLST analysis by providing broader taxonomic context and can help resolve cases where MLST schemes may not be available for the organism.
The isolate HTML report includes the folloiwing subset of sourmash gather metrics:
| Metric | Description |
|---|---|
| Match rank | Order of matches ranked by quality (0 = best match) |
| Species call | Taxonomic classification of the reference genome |
| Query containment ANI | Average Nucleotide Identity for the query intersect section |
| Intersect (bp) | Number of base pairs that overlap between query and reference |
| Remaining (bp) | Base pairs in query not explained by previous matches |
| Fraction original query | Proportion of the original query sequence that matches this reference |
| Fraction unique | Proportion of query that uniquely matches this reference (not shared with others) |
| Reference Genome | Identifier and name of the matching reference genome in database |
The full set of sourmash gather metrics for all matches with intersect above 25000bp threshold is output as a {alias}_sourmash_taxonomy.csv file.
The workflow uses GTDB R226 bacterial and NCBI Funal reference genome databases.
Sourmash behavior can be further modified by providing an exclusion list via --sourmash_db_exclude_list or by modifying the existing exclusion list present in data\sourmash_db_exclude_list.txt.
This parameter accepts a text file containing assembly identifiers (GCA_XXXXXXXXX.V or GCF_XXXXXXXXX.V format) to exclude from the reference databases.
Excluded genomes are filtered out before species identification, allowing for removal of problematic reference sequences.
ResFinder is used to detect the presence of acquired resistance genes in the genomes of all bacterial species. In addition, a subset of well-characterised species/genera will be further analysed using integrated PointFinder to identify specific resistance-associated SNVs. The following species/genera are included in PointFinder analysis:
- Campylobacter spp.
- Enterococcus faecalis
- Enterococcus faecium
- Escherichia coli
- Helicobacter pylori
- Klebsiella spp.
- Mycobacterium tuberculosis
- Neisseria gonorrhoeae
- Salmonella spp.
- Staphylococcus aureus
- Aspergillus fumigatus
- Candida albicans
- Candidozyma auris
- Nakaseomyces glabratus
- Saccharomyces cerevisiae
For purposes of PointFinder analysis, species/genera for a given assembly will be determined by the results of the MLST or Sourmash analysis.
PointFinder analysis will be selected automatically if applicable for a given species call.
PointFinder behaviour can be additionally modified using --pointfinder_ignore_indels and --pointfinder_ignore_stop_codons parameters.
The role of these options is further described in the ResFinder documentation.
Optionally, users can replace the default databases with custom ResFinder and PointFinder databases.
Paths to database files can be provided using --resfinder_db and --pointfinder_db.
Both databases should be KMA indexed as described in the ResFinder documentation.
Additionally, when using a database with added PointFinder species, users need to provide a CSV file mapping Sourmash species names to PointFinder species names.
The file path can be specified using --sourmash_pointfinder_mapping.
This CSV file must contain two columns: 'Sourmash species' and 'PointFinder species'.
For example:
Sourmash species,PointFinder species
Klebsiella pneumoniae,klebsiella
Klebsiella aerogenes,klebsiella
Candida albicans,candida_albicans
In cases where custom databases are not specified, the workflow will use ResFinder database v2.6.0, DisinFinder database v2.0.1, and a custom PointFinder database that has been extended to include fungal mutations. Fungal AMR detection has been implemented for five key species by using single nucleotide SNVs from FungAMR DB.
Samples identified as salmonella from the MLST step will undergo serotyping and antigenic profile prediction analysis using SeqSero2. The analysis will give predictions on:
- Serotype
- Antigenic profile
- Sub-species identification
- O antigen
- H1 antigen (fliC)
- H2 antigen (fljB)
| Nextflow parameter name | Type | Description | Help | Default |
|---|---|---|---|---|
| fastq | string | FASTQ files to use in the analysis. | This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet. |
|
| bam | string | BAM or unaligned BAM (uBAM) files to use in the analysis. | This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet. |
|
| reference_based_assembly | boolean | Enable reference guided assembly instead of de novo assembly. | By default de novo assembly will be performed with Flye. Enable this to instead perform a reference-based consensus. A reference must be provided. | False |
| reference | string | Reference sequence FASTA file. | The reference sequence is used when performing reference-based assembly. | |
| analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False |
| Nextflow parameter name | Type | Description | Help | Default |
|---|---|---|---|---|
| sample_sheet | string | A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. | The sample sheet is a CSV file with, minimally, columns named barcode and alias. Extra columns are allowed. A type column is required for certain workflows and should have the following values; test_sample, positive_control, negative_control, no_template_control. |
|
| sample | string | A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. |
| Nextflow parameter name | Type | Description | Help | Default |
|---|---|---|---|---|
| isolates | boolean | Run the Isolates pipeline on the assembly results if set to True. | Isolates mode adds further analysis options to the workflow such as multi-locus sequence typing and antimicrobial resistance calling, as well as producing single reports for each sample in the run. | True |
| sourmash_db_exclude_list | string | Path to file containing assembly IDs (GCA_/GCF_) to exclude from Sourmash species identification, one per line. | A text file containing assembly identifiers (starting with GCA_ or GCF_) that should be excluded from the sourmash database during species identification. Each line should contain one assembly ID. Invalid entries will be skipped. | |
| resfinder_threshold | string | Threshold of required identity to report a match between a gene in the ResFinder database and the assembly. Valid interval: 0.00-1.00 | Identity refers to the ratio of base pairs that match between the sequence in your assembly and that of the sequence in the ResFinder database. Increasing the threshold will result in fewer, but more accurate hits against the database. | 0.8 |
| resfinder_coverage | string | Minimum coverage (breadth-of) threshold required to report a match between a gene in the ResFinder database and the assembly. Valid interval: 0.00-1.00 | The amount of an AMR gene that has to be present within the assembly as compared to the reference in the ResFinder database. | 0.6 |
| pointfinder_ignore_indels | boolean | Ignore indels when calling point mutations in PointFinder analysis. | When enabled, PointFinder will skip detection of insertion and deletion mutations that may confer antimicrobial resistance. This is useful when sequencing errors or assembly artifacts might cause false positive indel calls, but may miss genuine resistance-conferring indels. | False |
| pointfinder_ignore_stop_codons | boolean | Ignore stop codons when calling point mutations in PointFinder analysis. | When enabled, PointFinder will not report premature stop codons as resistance mutations. This can be useful to reduce noise from sequencing errors that introduce false stop codons, but may miss genuine resistance mutations that create truncated proteins. | False |
| resfinder_db | string | Path to custom ResFinder database directory. | Override the default ResFinder database with a custom one. This should point to a directory containing the ResFinder database files. If not specified, ResFinder will use its bundled ResFinder database. | |
| pointfinder_db | string | Path to custom PointFinder database directory. | Override the default PointFinder database with a custom one. This should point to a directory containing the PointFinder database files. If not specified, ResFinder will use its bundled PointFinder database. | |
| sourmash_pointfinder_mapping | string | CSV file mapping Sourmash species names to PointFinder species names. Must contain 'Sourmash species' and 'PointFinder species' columns. | Optional CSV file to customize the mapping between Sourmash taxonomic classifications and PointFinder species for antimicrobial resistance analysis. If not provided, default mappings will be used. The CSV should have two columns: 'Sourmash species' and 'PointFinder species'. |
| Nextflow parameter name | Type | Description | Help | Default |
|---|---|---|---|---|
| override_basecaller_cfg | string | Override auto-detected basecaller model that processed the signal data; used to select an appropriate Medaka model. | Per default, the workflow tries to determine the basecall model from the input data. This parameter can be used to override the detected value (or to provide a model name if none was found in the inputs). However, users should only do this if they know for certain which model was used as selecting the wrong option might give sub-optimal results. A list of recent models can be found here: https://github.com/nanoporetech/dorado#DNA-models. | |
| run_bakta | boolean | Run Bakta on consensus sequence | Will provide an output file with a list of annotations for your sequence. Optional because it can take some time. | True |
| bakta_opts | string | Command-line arguments for Bakta | Command line arguments which can be used to alter Bakta output annotation files. | |
| bakta_db | string | Path to existing Bakta database directory. If not provided, the workflow will download the database. | Provide the path to a Bakta database directory containing both Bakta and AMRFinderPlus databases. If not specified, the workflow will automatically download the database type specified by bakta_db_type. | |
| bakta_db_type | string | Type of Bakta database to download if bakta_db is not provided. | Choose 'light' for the standard database (~1.3 GB) or 'full' for the complete database (~30 GB). Only used when bakta_db is not specified. | light |
| flye_genome_size | integer | Estimated genome size for de novo assembly in non-SI prefix format (e.g 5000000 for 5 Mb genome) | This setting is used in conjunction with flye_asm_coverage to subsample the reads used in the initial disjointig step only; all reads are used in subsequent steps. The values in the two parameters are used to calculate a target yield used to subsample the longest reads in the dataset, see Flye docs for more information. Note For runs with mixed genome sizes, preference the larger genome size. |
|
| flye_asm_coverage | integer | Target coverage to use for subsampling in de novo assembly | This setting is used in conjunction with flye_genome_size to subsample the reads used in the initial disjointig step only; all reads are used in subsequent steps. The values in the two parameters are used to calculate a target yield used to subsample the longest reads in the dataset, see Flye docs for more information. |
|
| flye_opts | string | Command-line arguments for flye | Command line arguments which can be used to alter the de novo assembly process. Enter the command as quoted string (e.g '--meta --iterations 2'). Flye's --genome-size and --asm-coverage parameters can be set directly in the workflow with --flye_genome_size and --flye_asm_coverage, respectively. |
|
| min_read_length | integer | Reads below this threshold will be removed from analysis. | 1000 | |
| max_coverage_plots | integer | Maximum number of largest contigs for which depth coverage plots will be displayed in the summary report. | When specified, only the N largest contigs (by total length) will be shown in the genome coverage depth plots. Set to 'null' to display all contigs. | 20 |
| run_plasmid_id | boolean | Enable plasmid identification and characterization using MOB-suite. | MOB-suite will identify plasmid sequences in the assembled genomes, reconstruct plasmids, and predict plasmid mobility based on the presence of mobilization and replication genes. Results include plasmid sequences, mobility predictions, and incompatibility group typing. | True |
| plasmid_id_opts | string | Command-line arguments for mob_recon | Additional command-line arguments that can be passed to mob_recon. For example, --filter_db to specify custom databases. See MOB-suite documentation for available options. |
| Nextflow parameter name | Type | Description | Help | Default |
|---|---|---|---|---|
| out_dir | string | Directory for output of all workflow results. | output |
| Nextflow parameter name | Type | Description | Help | Default |
|---|---|---|---|---|
| threads | integer | Number of CPU threads. | Provided to alignment, Flye assembly, Dnaapler, aligment, Bakta, and MOB-suite workflow steps to improve performance. | 3 |
Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
| Title | File path | Description | Per sample or aggregated |
|---|---|---|---|
| Workflow report | wf-bacterial-genomes-report.html | Report for all samples. | aggregated |
| Workflow results | results.json | Structured workflow results for internal/onward use. | aggregated |
| Workflow checkpoints | checkpoints.json | Structured workflow checkpoints for internal/onward use. | aggregated |
| Concatenated sequence data | {{ alias }}/seqs.fastq.gz | Per sample reads concatenated in to one fastq file. | per-sample |
| Per file read stats | {{ alias }}/fastcat_stats/per-file-stats.tsv | A TSV with per file read stats. | per-sample |
| Per read stats | {{ alias }}/fastcat_stats/per-read-stats.tsv.gz | A TSV with per read stats. | per-sample |
| Flye assembly stats | {{ alias }}.flye_stats.tsv | Assembly statistics generated for Flye assemblies. | per-sample |
| Alignment | {{ alias }}.bam | Aligned reads for the sample in BAM format. | per-sample |
| Alignment index | {{ alias }}.bam.bai | An index file for the alignment in BAI format. | per-sample |
| Draft assembly FASTA file | {{ alias }}.medaka.fasta.gz | Consensus file generated from either de novo assembly or reference variant calling pipeline, with contigs oriented to start at replication initiation point. | per-sample |
| Plasmid contig identification results | {{ alias }}_mob_results | Complete MOB-suite output directory containing all plasmid identification and characterisation results | per-sample |
| Variants VCF file | {{ alias }}.medaka.vcf.gz | VCF file of variants detected against the provided reference (reference mode only). | per-sample |
| Variants summary | {{ alias }}.variants.stats | TSV file of summary statistics for variants in sample (reference mode only). | per-sample |
| Annotations in GFF3 format | {{ alias }}.bakta.gff3 | Annotations of regions of interest in assembly in GFF3 format. | per-sample |
| Annotations in GBFF format | {{ alias }}.bakta.gbff | Annotations of regions of interest in assembly in GBFF format. | per-sample |
| Sequence typing results | {{ alias }}.mlst.json | Sequence typing results in JSON format (isolates mode only). | per-sample |
| Sourmash taxonomic identification results with lineage annotations. | {{ alias }}_sourmash_taxonomy.csv | Sourmash gather hits with taxonomic lineage annotations (isolates mode only). | per-sample |
| Sourmash excluded assemblies | sourmash_picklist_excluded.txt | List of assembly identifiers (GCA_/GCF_) excluded from default Sourmash species identification database (isolates mode only). | aggregated |
| AMR calling results | {{ alias }}_resfinder_results | Complete AMR calling output directory containing detailed ResFinder, PointFinder, and DesinFinder outputs (isolates mode only). | per-sample |
| Isolate per sample report | {{ alias }}-isolates-report.html | Per sample report (isolates mode only). | per-sample |
- If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
- See how to interpret some common nextflow exit codes here.
This usually occurs if the assembly is incomplete and does not have sufficient coverage to identify the house keeping genes of the typing scheme. Another, rarer scenario is if the assembly is from an organism with no typing scheme. A list of the available typing schemes can be found here. In both scenarios, AMR calling will still be performed but only for acquired resistance genes.
If your question is not answered here, please report any issues or suggestions on the github issues page or start a discussion on the community.
See the EPI2ME website for lots of other resources and blog posts.
