Rigorous Alignment Curation: Cleanup Of Outliers and Noise
Raccoon is a lightweight toolkit for post-consensus genomic QC and phylogenetic quality control. It provides modular tools for sequence metadata harmonization, alignment curation, and phylogenetic tree assessment. Raccoon identifies problematic sequences and sites (e.g., clustered SNPs, SNPs near Ns/gaps, frame‑breaking indels, long branches, and convergent mutations) and produces detailed reports, mask files, and curated datasets for downstream analyses.
Rationale: Quality assessment and curation of genomic sequence data is essential for robust phylogenetic inference. By systematically evaluating sequence quality, alignment accuracy, and tree topology, raccoon helps researchers identify and address data issues that could compromise epidemiological or evolutionary conclusions before proceeding with downstream analysis.
- Use cases
- Integrated workflows
- Installation
- Quickstart
- Typical Workflow
- CLI usage
- Output Descriptions
- Mask notes
- Example data
- Tutorial
- Harmonise sequence headers using metadata files (CSV/TSV) with flexible templating.
- Match sequence identifiers to metadata and flag mismatches or missing fields.
- Filter sequences by length and ambiguous base content.
- Generate combined FASTA files with structured, epidemiologically-informative headers.
- Flag clustered SNPs that may indicate contamination, recombination, or misalignment.
- Detect SNPs adjacent to low-coverage regions (Ns) or gaps, which may reflect data quality issues.
- Identify frame-breaking indels in coding regions using a GenBank reference.
- Generate detailed reports on alignment quality with visual summaries of site conservation and N-content.
- Generate mask files to exclude suspect sites or sequences prior to phylogenetic analysis.
- Apply masks to alignments using customizable masking characters.
- Assess tree topology for anomalies including long branches and unexpected clustering.
- Evaluate molecular clock assumptions via root-to-tip regression analysis.
- Detect sources of bias including APOBEC3-mediated and ADAR-induced mutations.
- Flag convergent mutations and reversion events (when ancestral state reconstruction is available).
- Identify sequences for removal prior to downstream temporal or evolutionary analyses.
Running best-practice phylogenetics can be challenging, however with raccoon a simple alignment and phylogenetic workflow can be customised with data quality in mind.
A) Input files
- input sequences (one or more fasta files or directory containing fasta file)
- input metadata (one or more metadata files (csv or tsv) or directory containing metadata files)
B) raccoon seq-qc
Outputs:
- a combined fasta file with sequence headers harmonised and populated from the metadata fields
- seq-qc_report.html (a report describing the dataset, the matching, the output and any issues identified with the data)
- seq-qc_filter_failures.csv (sequences that do not pass qc filters, max n and min length)
- seq-qc_metadata_issues.csv (flagging missing metadata fields or sequences that failed to match metadata)
C) alignment
Multiple sequence alignment is a key step prior to running phylogenetics. It is the scaffold upon which we can begin to reconstruct the evolutionary relationships between different sequences in the tree. We will run alignment using MAFFT, which is a popular software tool for creating multiple sequence alignments.
Output:
- An aligned fasta file
D) raccoon aln-qc
A high-quality alignment is crucial to generating a good phylogenetic tree. Being able to accurately assess whether there are issues with your multiple sequence alignment is a key skill that we will cover today.
The alignment is checked for various issues that may impact the quality of the phylogenetic inference. Different kinds of SNPs (clustered SNPs, N-adjacent SNPs, gap-adjacent SNPs) are flagged that may suggest issues with the alignment or with a given sequence. If a given sequence has many issues flagged (default >20), that sequence is flagged for removal from the analysis. Flagged SNPs do not necessarily mean there is anything wrong with the SNP, it may reflect genuine biological variation. However, these sites may need to be investigated closely.
Output:
- aln-qc_report.html (a report describing the input alignment, n content and any SNPs that were flagged as possibly pro)
- mask_sites.csv (describes the sites flagged for investigation or masking and the sequences flagged for removal)
E) tree estimation Tree building is run using IQTREE. The substitution model used is configurable and an outgroup can optionally be included. If an outgroup is included, ancestral state reconstruction will be run during the tree building process to provide additional checks on the tree, and the outgroup sequence will be pruned off from the final tree. In this case, as we are not yet familiar with the data, we will not select an outgroup as it is not clear what an appropriate outgroup would be.
Key output:
- *.treefile (a maximum likelihood tree file)
F) raccoon tree-qc
Output:
- tree-qc_report.html (report showing the tree, a root to tip and any issues that were flagged during the tree-qc process)
- *.phylo_flags.csv
- A midpoint rooted tree (if no outgroup provided)
- Branch reconstruction file (if outgroup provided)
- State difference file (if outgroup provided)
A typical raccoon workflow progresses through four main stages:
- Sequence QC – Combine and harmonize sequence metadata across multiple input files
- Alignment – Generate a multiple sequence alignment (using external tools like MAFFT)
- Alignment QC – Flag problematic sites and generate a mask file
- Optional Masking - If the sites flagged appear to need removal from the sequences, the alignment can be masked using this step
- Tree estimation - Estimate a maximum likelihood phylogeny (using external tools like IQTREE)
- Phylogenetic QC – Assess tree quality and identify outlier sequences
Start by harmonizing sequence headers and combining multiple FASTA files:
raccoon seq-qc -f samples_batch1.fasta samples_batch2.fasta \
-m metadata.csv \
--metadata-id-field sample_id \
--metadata-location-field location \
--metadata-date-field collection_date \
-o combined_sequences.fastaInput:
- Raw FASTA files and metadata CSV/TSV files Output:
combined_sequences.fastawith structured headers (e.g.,sample_id|location|date)- A seq-qc html report describing the metadata matching, header harmonisation and sequence QC filters applied
Align sequences using MAFFT (or another aligner):
mafft --auto combined_sequences.fasta > alignment.fastaInput: Combined FASTA file
Output: Multiple sequence alignment in FASTA format
Assess alignment quality and identify suspect sites:
raccoon aln-qc alignment.fasta -d alignment_qc_results \
--reference-id reference_seq_idInput:
- Aligned FASTA file Output:
- Mask file, detailed flags, and HTML report (see Output Descriptions)
- aln-qc html report describing the
Exclude flagged sites from downstream analysis:
raccoon mask alignment.fasta \
--mask-file alignment_qc_results/mask_sites.csv \
-d alignment_qc_results \
-o alignment.masked.fastaInput: Alignment FASTA file and mask CSV
Output: Masked alignment with flagged sites replaced by mask character
Build phylogeny (using IQ-TREE or similar):
iqtree -s alignment.masked.fasta -m GTR+G -bb 1000 -alrt 1000Output: .treefile (phylogeny) and .state (ancestral state reconstruction, if using IQ-TREE)
Assess tree topology and identify problematic sequences:
raccoon tree-qc --tree alignment.masked.fasta.treefile \
--alignment alignment.masked.fasta \
--asr-state alignment.masked.fasta.state \
-d tree_qc_results \
--run-adar --adar-window 300 --adar-min-count 3 \
--run-apobecInput: Tree file, alignment FASTA, and ASR state file
Output: Interactive HTML report and flagged sequence list
For complete phylogenetic quality-control workflows, raccoon-nf integrates raccoon's modular tools with alignment and tree-building software (MAFFT, IQ-TREE) in a production-ready Nextflow pipeline. The raccoon-nf pipeline coordinates all QC steps in sequence:
- Sequence QC – harmonise headers and filter sequences
- Alignment – run MAFFT on combined sequences
- Alignment QC – assess alignment quality and flag problematic sites
- Tree estimation – build phylogenetic tree with IQTREE
- Tree QC – evaluate tree topology and identify outliers
raccoon-nf can be run through the EPI2ME desktop interface for users without command-line expertise. See the tutorial for a complete walkthrough.
From source:
pip install artic-raccoonShow help:
raccoon --helpBasic usage:
raccoon seq-qc -f a.fasta b.fasta -o combined.fastaWith metadata-driven headers:
raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
-m metadata.csv other_metadata.csv \
--metadata-id-field sample \
--metadata-location-field location \
--metadata-date-field date \
--header-separator '|'With a custom header template:
raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
-m metadata.csv --header-fields "{id}|{country}|{date}"Key options:
-f, --fasta: input FASTA files (one or more) (required)-o, --outfile: output FASTA file (default:combined.fasta; use-for stdout)-m, --metadata: metadata CSV file(s) for header harmonisation--metadata-delimiter: metadata delimiter (default,;.tsvauto-detected)--metadata-id-field: metadata ID column (default:sample)--metadata-location-field: metadata location column (default:location)--metadata-date-field: metadata date column (default:date)--header-fields: template for custom headers (e.g.{id}|{country}|{date})--header-separator: separator used for non-template harmonised headers (default:|)--seq-id-delimiter: delimiter for parsing IDs from input headers (default:|)--seq-id-field-index: 0-based field index for parsed sequence ID (default:0)--min-length: minimum sequence length to keep--max-n-content: maximum N-content proportion to keep
Basic usage:
raccoon aln-qc <alignment.fasta> -d outdirWith GenBank reference for frame-break checks:
raccoon aln-qc <alignment.fasta> -d outdir \
--genbank <reference.gb> --reference-id <ref_id>Disable selected flag classes:
raccoon aln-qc <alignment.fasta> -d outdir \
--no-flag-n-adjacent --no-flag-gap-adjacentKey options:
alignment(positional): input alignment FASTA file (required)-d, --outdir: output directory (default:.)-t, --sequence-type: sequence type,ntoraa(default:nt)--genbank: GenBank file for frame-breaking indel checks--reference-id: reference sequence ID in alignment (for GenBank features)--max-n-content: N-content threshold for flagging--cluster-window: window size (bp) for clustered SNP detection--cluster-count: minimum SNPs in-window to mark as clustered--no-flag-clustered: skip clustered SNP flagging--no-flag-n-adjacent: skip N-adjacent SNP flagging--no-flag-gap-adjacent: skip gap-adjacent SNP flagging--no-flag-frame-break: skip frame-breaking indel flagging--flag-removal-threshold: mark sequence for removal above this flagged-site count
raccoon mask <alignment.fasta> \
--mask-file results/alignment_qc/mask_sites.csv \
-d results/alignment_qcKey options:
--mask-file: mask CSV file fromaln-qc--mask-character: character to use for masking (default:?)-o, --outfile: output masked alignment file name-d, --outdir: output directory-t, --sequence-type:ntoraa(default:nt)
Basic usage:
raccoon tree-qc --tree <treefile> -d outdir \
--alignment <alignment.fasta> --asr-state <treefile>.state \
--run-adar --adar-window 300 --adar-min-count 3Key options:
-t, --tree: input phylogeny file (required)-d, --outdir: output directory (default:.)--tree-format:auto,newick, ornexus(default:auto)--alignment: alignment FASTA used with ASR state file--asr-state: ancestral state reconstruction file in IQTREE format--assembly-refs: assembly/reference FASTA used for mapping--outgroup-ids: comma-separated outgroup sequence IDs--mask-file: optional mask CSV with sites to ignore--tip-fields: template for parsing tip-label fields--tip-field-delimiter: delimiter used for tip field parsing--tip-date-field: field name treated as date in tip parsing--long-branch-sd: SD threshold for long-branch flagging (default:3.0)--midpoint-root: midpoint-root tree for report visualisation (applied only when--asr-stateis not provided)--run-apobec: run APOBEC3 checks--run-adar: run ADAR checks--adar-window: max distance (bp) for ADAR cluster window (default:300)--adar-min-count: min ADAR sites in window to flag branch (default:3)--height: optional figure height
See full CLI details in docs/cli.md.
- Combined FASTA file (e.g.,
combined.fasta): Sequences with harmonized headers, upper-case, single-line format - Statistics (logged to console): Summary of:
- Total sequences processed
- Sequences retained vs filtered
- Metadata matches/mismatches
- Sequences failing length and N-content filters
Generated in the specified output directory (default: .):
mask_sites.csv: Tab-separated file listing flagged sites with flag types (clustered_snps, N_adjacent, gap_adjacent, frame_break)alignment_flags.tsv: Detailed per-sequence report showing all flagged sites for each sequencealignment_qc_report.html: Interactive HTML report with:- Alignment conservation visualization
- N-content and gap distribution plots
- Interactive flags table
- Recommendations for problematic sequences
alignment_summary.txt: Text summary of flagging statistics
- Masked alignment FASTA (e.g.,
alignment.masked.fasta): Original alignment with flagged sites replaced by mask character (default:?) - Masking report (logged to console): Summary of sites and sequences masked
Generated in the specified output directory (default: .):
tree_qc_report.html: Interactive HTML report featuring:- Phylogenetic tree visualization with branch coloring by flagged issues
- Long-branch flagging with branch-length deviations
- APOBEC3-mediated and ADAR-induced mutation clusters (if runs enabled)
- Convergent mutation and reversion detection
- Root-to-tip regression plots for temporal signal assessment
flagged_sequences.txt: List of sequences recommended for removal with justificationconvergent_mutations.csv(if ASR provided): Convergent mutations detectedtree_summary.txt: Text summary of tree QC findings
Mask output uses the following note values:
| Note | Meaning |
|---|---|
| clustered_snps | Clustered SNPs within the configured window. |
| N_adjacent | SNPs adjacent to an N run within the configured window. |
| gap_adjacent | SNPs adjacent to a gap within the configured window. |
| frame_break | Gap sites that break the CDS frame length. |
The examples folder includes a constructed alignment and GenBank reference suitable for quick testing:
A comprehensive tutorial covering sequence metadata harmonisation, multiple sequence alignment, alignment curation, phylogenetic inference, and tree assessment is available at artic.network/tutorials/raccoon.nf. The tutorial includes:
- Step-by-step guidance on preparing sequence and metadata files.
- Instructions for running raccoon-nf through the EPI2ME interface.
- Interpretation of QC reports and identification of common data issues.
- Best practices for curating alignments and assessing phylogenetic results.
- Interactive exercises using provided example datasets.
The tutorial is suitable for both guided workshop delivery and self-paced learning.