Skip to content

BCCDC-PHL/riboscope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

279 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

This Nextflow pipeline monitors for mutations that may confer AMR in bacterial ribosomal repeats, specifically the 16S and 23S rRNA genes of Treponema pallidum subsp. pallidum, the etiological agent of syphilis. This workflow calls mutations in reference-guided assemblies of short paired-end reads sequenced from amplicon-based libraries.

Quick-Start

nextflow run BCCDC-PHL/riboscope \
  -r v0.9.9 \
  -profile conda \
  --cache ~/.conda/envs \
  --fastq_input /path/to/fastq_files \
  --ref /path/to/ref.fa \
  --bed /path/to/primer_scheme.bed \
  --search_seqs /path/to/query/seq/multi.fasta \
  --kraken2_db /path/to/kraken2/database \
  --bracken_db /path/to/bracken/database \
  --apply_qc \
  --outdir /path/to/output_dir

Table of Contents

Quick-Start
Workflow
Usage
Input
Output
Parameters
References

Workflow

workflow

Usage

To test the pipeline is functioning as expected, a test profile is used with dataset provided in /tests/data/ with subsampled raw reads of SRR34079400 & 5000 pairs of simulated reads, ribosomal repeat 1 of TPA reference sequence, gff3 file, primer bed file, and sequences to search for in raw reads as proxy for amplicon sequencing success. Results will be available in the directory riboscope is launched from under riboscope_test_results. An example of the output results_report.html has been included in /tests/results for reference. Note that the results are generated from a synthetic test dataset with excessively lenient QC thresholds and are solely meant to verify that the pipeline runs successfully. Results are not biologically meaningful and only intended to illustrate expected output formats.

nextflow run BCCDC-PHL/riboscope -profile conda,test

The following command can be used to run the basic pipeline:

nextflow run BCCDC-PHL/riboscope \
  -profile conda \
  --cache ~/.conda/envs \
  --fastq_input /path/to/fastq_files \
  --ref /path/to/ref.fa \
  --bed /path/to/primer_scheme.bed \
  --search_seqs /path/to/query/seq/multi.fasta \
  --kraken2_db /path/to/kraken2/database \
  --bracken_db /path/to/bracken/database \
  --outdir /path/to/output_dir

To activate quality control (QC) filtering and generation of html report, run with the --apply_qc flag:

nextflow run BCCDC-PHL/riboscope \
  -profile conda \
  --cache ~/.conda/envs \
  --fastq_input /path/to/fastq_files \
  --ref /path/to/ref.fa \
  --bed /path/to/primer_scheme.bed \
  --search_seqs /path/to/query/seq/multi.fasta \
  --kraken2_db /path/to/kraken2/database \
  --bracken_db /path/to/bracken/database \
  --apply_qc \
  --outdir /path/to/output_dir

By default, reads will be trimmed by fastp prior to alignment. To align the untrimmed reads instead, add the --align_untrimmed_reads parameter. If this option is used, the pipeline will proceed as follows, using the original untrimmed reads as input for the bwa alignment process:

flowchart TD
  ref[ref.fa]
  bed[scheme.bed]
  fastq[fastq]
  fastq --> fastp(fastp)
  ref --> bwa(bwa_mem)
  fastq -- untrimmed_reads --> bwa
  bwa -- alignment --> trim_primer_seqs(trim_primer_sequences)
  bed --> trim_primer_seqs
  trim_primer_seqs -- primertrimmed_alignment --> make_consensus(make_consensus)
  make_consensus -- consensus --> align_consensus_to_ref(align_consensus_to_ref)
  ref --> align_consensus_to_ref
Loading

Consensus Sequence

Consensus sequences are generated by applying SNPs called by LoFreq that are present at allele frequencies >= min_vaf to the reference sequence. Regions of the sequence covered by < min_depth in a sample are masked with 'N'. IUPAC bases encode REF + most dominant ALT allele, and do NOT encode for multiple ALT alleles at the same site. For example, at position 963 if there are 5A, 3T, 2G, the resulting IUPAC base will be W (A & T) not D (A & T & G). The parameters min_iupac and max_iupac control the range of frequencies of ALT alleles at which a SNP should be considered mixed. An ALT allele frequency < min_iupac = REF base in the consensus, an ALT allele frequency >= max_iupac = ALT base in the consensus sequence, and ALT allele frequencies between min_iupac and max_iupac result in IUPAC bases (REF + ALT). In the event of a tie in allele frequencies at multiallelic sites, the first in the VCF file is used. In the example of position 963 with 5A, 3T, 2G, if min_iupac is set to 0.1 (10%) and max_iupac is set to 0.9 (90%) the resulting consensus base will be W (A & T). To turn off IUPAC behaviour and generate a majority consensus, set min_iupac to the same value as max_iupac (ex. 50%), meaning allele frequencies below 0.5 will be the REF and those >= 0.5 will be the ALT. Alternatively, to reproduce default bcftools consensus -s - -I bcftools.norm.vcf.gz behaviour, set min_iupac to 0 and max_iupac to 0.5 (ALT allele frequencies < 0.5 will appear as IUPAC base (REF+ALT) and those >= 0.5 will be the dominant allele (ALT) in the consensus sequence). Note that LoFreq splits multiallelic sites into one ALT allele per row and does not produce genotype (GT) column with result per sample, both of which bcftools uses to call the consensus nucleotide (ex. A,T,G --> GT 1/2 --> T/G --> K).

Dehosting

Kraken2 and Bracken provide taxonomic classification and abundance estimation on samples before and after "dehosting". The pipeline does not align to host genome but instead uses pathogen inclusion by aligning reads to the target pathogen sequence and excluding any reads with bit set in the UNMAP flag (samtools view -F4). It is important to note that "other" reads in the kraken2_summary may be from the pathogen but at a higher taxonomic level. The number and percentage of pathogen reads is determined by an exact match, so you may obtain different results depending on the level of resolution you use (ex. 49% of reads may have been classified as "Treponema pallidum" but 90% are classified as "Treponema"). It is recommended to trial Kraken2 and select a taxon that encompasses the majority of reads for your organism of interest to an acceptable resolution. Therefore, the target taxon name chosen effectively determines the taxonomic level at the Kraken2 step. Due to how closely related Treponema pallidum subsp. pallidum is to other Treponema pallidum subspecies (TEN, TPE) the pipeline is designed to use Kraken2 results as a broad check for host or environmental contamination, then uses Bracken at taxonomy level of 'S1' (subspecies) to further resolve subspecies of Treponema pallidum to check specificty of amplification/ sequencing. Host reads (ex. Homo sapiens) are not classified to the subspecies level, which is why the Kraken2 report is used for an initial approximation of host, target pathogen, unclassified, and other reads.

Reporting Mutations According to Position in Genomic Feature(s)

Users may optionally adjust reported positions of mutations according to where they are located in a specified genomic feature by providing a GFF3 file. Mutations are displayed relative to position in each genomic feature. The same mutation may appear in multiple features when there are overlapping features. Coordinates are reported in strand-aware manner. The original position along the reference sequence is found under "snv" when hovering over a mutation in the report, while "feature_snv" shows the position of the mutation relative to the start of a specific feature from a given organism. The genomic features as well as the position are controlled by the user via GFF3 file. For example, a mutation may be reported relative to the start of a Treponema pallidum 23S rRNA gene or relative to an Escherichia coli 23S rRNA gene by simply aligning this gene to the reference and specifying the start and end positions relative to the reference in the GFF3 file. Note that this process is not gap-aware. For example, the position of the mutation relative to the reference sequence (POS = 4718) is converted to the equivalent position in the E.coli 23S rRNA gene (FEATURE_POS = 2100) but does not account for gaps from the E.coli 23S:TPA reference sequence pairwise alignment (42 gaps before position 2100, since E.coli 23S rRNA gene is “stretched” across the ref sequence). This means it is reported as 2100 in E.coli 23S rRNA gene but is actually 2100-42 gaps = 2058. Here is what the entry might look like in the *.feature.annotated.vcf VCF file for this example:

CHROM POS REF ALT AF DP4 SAMPLE_ID FEATURE_NAME FEATURE_STRAND FEATURE_POS NC_000919.1_ribo_rpt1 4718 A G 0.09 4730,7588,37,20 example_sample ecoli_23S + 2100

An example of the SNP plot in the results report with GFF3 file provided that specifies 16S, 23S, and 5S genes: workflow

Quality Control

The --apply_qc parameter generates the results_report.html which reports SNPs across genomic features as well as QC results, aggregates results for all samples, summarizes QC (qc_summary.csv), and filters out SNPs from low quality samples or amplicons that did not pass the minimum count of unique sequences (see --search_seqs). There are a number of parameters that control QC pass/fail thresholds in the nextflow.config file. For example, the --max_qc_flags adjusts the maximum allowable number of "soft fail" QC flags allowable before sample is failed. Currently, samples instantly fail if no amplicons are detected, number of reads are insufficient (< min_map), or there is insufficient pathogen content detected (see filter_qc.py).

Provenance

riboscope uses v1.3.0 of the nf-prov plugin with the legacy format. The inputs, outputs, command line ran, and process name of each task is recorded, along with workflow information such as pipeline version, commit hash, analysis start time, Nextflow session ID. Two processes exist for the sake of printing relevant provenance information (pipeline_provenance & print_hashed_records) like sha256 hashes of input raw sequencing read (FASTQ) files, as an alternative to producing and collating yml files within each process. This approach was taken to simplify code required and ensure script block is recorded dynamically. One caveat of this approach compared to aggregating provenance yaml files is that information is not grouped by sample, though this may be resolved through downstream parsing of provenance.json.

Input

Input Parameter Description Mandatory
Paired-end sequencing reads fastq_input Absolute path to directory containing raw FASTQ reads to be analyzed. Riboscope accepts gzip compressed or uncompressed files (*.fastq.gz, *.fq.gz, *.fastq, *.fq). true
Reference genome ref Reference genome used to align reads to during guided assembly true
BED file bed Primer scheme BED file in the 6 column format. Note that rows must be repeated for each gene copy (ex. 3 gene copies = same row 3 times). The way primers are named should include information about gene (ex. pool) and copy number (ex. rpt1) and match amplicon_groups parameter used to calculate median and minimum amplicon sequence counts. true
MultiFASTA of query sequences search_seqs MultiFASTA file contain short sequences to query raw reads. Used to verify presence of expected gene copies. true
Bracken database bracken_db Path to prebuilt bracken database (ex. 50, 75, 100-mers, etc) true
Kraken2 database kraken2_db Path to prebuilt kraken2 database (ex. standard_16gb) true
GFF3 file gff Path to the GFF3 file containing features user would like to identify mutations in. If provided, mutations are reported according to feature coordinates. If no GFF3 provided, mutations are reported as default coordinates in reference sequence false

Output

Output Description Notes
*_amplicon_coverage.{tsv,png} Average depth of coverage for each amplicon in reference sequence All output files listed with wildcard * are produced per sample in separate directories
*_{pre,post}_dehosting_kraken2.tsv The final formatted Kraken2 report summarizing read classification by pathogen, host, and other -
*_kraken_report_{pre,post}_dehosting.txt The raw Kraken2 report used as input for Bracken -
*_bracken_output_{pre,post}_dehosting.tsv The taxonomic classification of reads at level specified by user -
*_consensus_masked_iupac.fasta Consensus sequence of each sample, including masked regions below specified depth, mutations from *.lofreq.variants.vcf filtered to include those >= minimum VAF (--min_vaf), and adjustable thresholds for multiallelic sites encoded by IUPAC ambiguous bases --min_iupac: if vaf < min_iupac = ref in consensus while sites with alt alleles between min_iupac & max_iupac = mixed. --max_iupac: vaf that an alt is called in consensus.
*_depths.tsv Number of sequencing reads covering each position in the reference sequence none
*_fastp.csv Trimming and filtering read statistics. Uses fastp --trim_poly_g --trim_poly_x
*.expected.snps.vcf Variant Call Format (VCF) file required by LoFreq to call indels SNVs present at 1% used for recalibrating BQ according to GATK best practices, recommended by LoFreq for indel calling. See Jago et al. (2020) reference below.
*.lofreq.variants.vcf Raw SNVs called above minimum depth and including indels pre-filtering 'raw' SNVs
*.lofreq.formatted.vcf Formatted VCF file with SNVs filtered to include those with allele frequency of at least minimum VAF tsv file filtered by --min_vaf
*.feature.annotated.vcf Final processed VCF file containing mutations >= minimum VAF and overlapping genomic features Controlled by --gff genomic features. See Mutation coordinate system section
*.mapped.primertrimmed.sorted.{bam,bai} Sorted alignment of primer-trimmed reads to reference sequence & associated index file -
*_qualimap_alignment_qc.csv Formatted alignment quality report including breadth of coverage statistics -
*_rrna_counts.csv Number of hits for each query sequence in --search_seqs multifasta within forward and reverse raw sequencing read files of each sample. Used as a proxy to detect presence of same amplicons from different repeats by using query sequences unique to each amplicon -
*_samtools_stats_coverage_distribution.tsv Distribution of alignment depth per covered reference site from samtools stats ^COV
*_samtools_stats_insert_sizes.tsv Total read pairs and their orientation (inward vs. outward) per insert fragment size -
*_samtools_stats_summary.csv Formatted read and mapping quality report -
qc_summary.csv Aggregated QC results summary for all samples in directory including QC pass/fail status Only produced if either --collect_outputs or --apply_qc param is included
collected_samtools_stats_summary.csv Formatted read and mapping quality report for all samples in fastq_input dir Only produced if either --collect_outputs or --apply_qc param is included
collected_qualimap_alignment_qc.csv Alignment quality report including breadth of coverage statistics for all samples in run Only produced if either --collect_outputs or --apply_qc param is included
collected_rrna_counts.csv Collective results of "amplicon counts"/ query sequence hits from raw reads. See *_rrna_counts.csv for more details Only produced if either --collect_outputs or --apply_qc param is included
collected_lofreq.vcf Collective mutation results for all samples in --fastq_input directory. Results are aggregated *.feature.annotated.vcf files, produced by processing *.lofreq.formatted.vcf files by adjust_variant_coordinates.py script. Only produced if either --collect_outputs or --apply_qc param is included in command. Annotated with genomic feature(s) coordinates if --gff specified, else mutation coordinates remain the same (relative to reference sequence instead of genes/ features).
collected_fastp.csv Aggregated read trimming and filtering read results of all samples in input directory Only produced if either --collect_outputs or --apply_qc param is active
collected_bracken.tsv Collective Bracken taxonomic classification of Kraken2 reported reads at specified taxonomic level (--taxonomy_level) for all samples pre and post dehosting Only produced if either --collect_outputs or --apply_qc param is included
collected_kraken2_summary.tsv Aggregated Kraken2 classification results summarized by user specified pathogen (see Dehosting section), host and unclassified reads per sample both before and after dehosting Only produced if either --collect_outputs or --apply_qc param is included

Parameters

Note: Pipeline parameters listed below are preceded by a double-dash -- (ex. --fastq_input) while Nextflow related parameters use single dashes (ex. -profile)

Parameter Description Required Default
fastq_input Absolute path to directory containing raw FASTQ reads to be analyzed. Riboscope accepts gzip compressed or uncompressed files (*.fastq.gz, *.fq.gz, *.fastq, *.fq). yes none
outdir Absolute path to directory to write results to. no ./results
apply_qc Activate QC filtering of samples and generation of html results report no off
ref Reference genome used to align reads to during guided assembly yes none
bed Primer scheme BED file in the 6 column format yes none
search_seqs MultiFASTA file contain short sequences to query raw reads. Used to verify presence of expected gene copies. yes none
cache Directory to cache conda environments for future use. no ./work/conda
align_untrimmed_reads Skips read trimming by fastp. Allows alignment of raw untrimmed reads using bwa. no off
min_depth Minimum number of reads covering a genomic position. no 10
collect_outputs Summarize outputs of multiple samples into one. no off
collected_outputs_prefix Prefix to name multi-sample summary files with. no 'collected'
min_vaf Minimum allowable variant allele frequency (VAF) to filter. This porportion is applied to lofreq VCF and resulting *lofreq.formatted.vcf contains only SNPs present above this VAF. no 0
gff Path to the GFF3 file containing features user would like to identify mutations in. If provided, mutations are reported according to feature coordinates. If no GFF3 provided, mutations are reported as default coordinates in reference sequence. The reported features and mutation positions are dictated by the user in the GFF3 file. For example, a mutation may be reported relative to the start of a Treponema pallidum 23S rRNA gene or relative to an Escherichia coli 23S rRNA gene by simply aligning this gene to the reference and specifying the start and end positions in the GFF3 file. no none
amplicon_groups Comma-separated amplicon prefixes used for grouping counts of search seqs in raw reads. Median and minimum are calculated of reads per group of unique search sequences that represent each Treponema pallidum amplicon. Note that these prefixes are used in amplicon_qc df from syphilis_amplicons function & should match the way primers are named in bed file - ie. should contain information about gene and copy number no "rpt1_pool1_,rpt1_pool2_,rpt2_pool1_,rpt2_pool2_"
min_amplicon_count Minimum median query sequence counts to consider an amplicon as present no 300
min_pathogen Minimum percentage of reads classified as target pathogen in a sample no 85.0
max_host Maximum percentage of host reads in a sample no 5.0
min_q20_rate Minimum threshold for Q20 rate no 0.95
min_q30_rate Minimum threshold for Q30 rate no 0.85
min_mean_bq Minimum average base quality threshold no 30
min_mean_depth Minimum threshold for average depth no 100
min_mapped_paired Minimum threshold for number of reads that are mapped and paired no 1000
min_percent_mapped Minimum threshold for percentage of mapped reads no 0.8
min_mapped_reads Minimum threshold for number of mapped reads no 100
min_mean_mq Minimum threshold for average mapping quality no 50
min_10X_cov Minimum threshold for proportion of genome covered over 10X no 0.85
min_50X_cov Minimum threshold for proportion of genome covered over 50X no 0.85
max_secondary Maximum allowable number of secondary alignments no 0
min_pos_count Minimum allowable number of positive control query sequence counts for testing presence of amplicons no 100
max_qc_flags Maximum number of soft fail QC flags allowable before sample is failed no 5
qc_output Name or prefix of output file to write QC summary to no 'qc_summary'
html_template Path to html template to use while generating report no ./assets/report_template.html
kraken2_db Path to prebuilt kraken2 database (ex. standard_16gb) yes none
bracken_db Path to prebuilt bracken database (ex. 50, 75, 100-mers, etc) yes none
taxonomy_level Taxonomy level used by Bracken (ex. subspecies = S1) no S1
read_length Length of sequencing reads used by Bracken no 150
host_name Name of host used to parse Kraken2 results for taxonomic classification section of report and QC filtering (max_host) no Homo sapiens
pathogen_name Name of target pathogen used to parse Kraken2 results for taxonomic classification section of report and QC filtering (min_pathogen) no Treponema
reporting_vaf Minimum variant allele frequency in a given sample to report SNVs no 0.9
reporting_sample_perc Minimum percentage of total successfully sequenced samples containing a given SNV no 0.3

References

  1. Jago, M.J., Soley, J.K., Denisov, S. et al. High-throughput method characterizes hundreds of previously unknown antibiotic resistance mutations. Nat Commun 16, 780 (2025). https://doi.org/10.1038/s41467-025-56050-2
  2. https://github.com/BCCDC-PHL/amplicon-consensus

About

Scoping prokaryotic ribosomal repeat regions for SNVs from short paired-end sequences.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors