A Snakemake workflow for assigning clinical Mycobacterium abscessus isolates to Dominant Circulating Clones (DCCs 1–7) and identifying transmission pairs from whole genome sequencing data.
Mycobacterium abscessus is a rapidly growing nontuberculous mycobacterium (NTM) and an increasingly important opportunistic pathogen, particularly in patients with cystic fibrosis (CF), bronchiectasis, and other chronic lung diseases. It is notoriously difficult to treat due to intrinsic and acquired antibiotic resistance.
Genomic epidemiology studies have revealed that M. abscessus infections are not solely acquired from the environment. Instead, a significant proportion of infections are caused by a small number of highly transmissible lineages that circulate globally among susceptible patients — these are called Dominant Circulating Clones (DCCs).
DCCs were formally defined by Ruis et al. 2021 as genomic clusters containing isolates from at least 20 patients across multiple continents, identified through whole genome sequencing of over 2,000 clinical isolates. Seven DCCs (DCC1–7) were identified:
| DCC | Subspecies | Geographic spread | Key features |
|---|---|---|---|
| DCC1 | M. a. abscessus | Global — 4 continents | Largest and most widespread |
| DCC2 | M. a. abscessus | 3 continents | — |
| DCC3 | M. a. massiliense | 3 continents | Massiliense lineage |
| DCC4 | M. a. abscessus | 3 continents | — |
| DCC5 | M. a. abscessus | 4 continents | Highly dispersed |
| DCC6 | M. a. massiliense | Multi-continental | Massiliense lineage |
| DCC7 | M. a. massiliense | Multi-continental | Massiliense lineage |
Important: DCC6 and DCC7 are M. a. massiliense lineages — not abscessus as sometimes assumed. This was confirmed by SNP distance analysis against subspecies reference genomes in the development of this pipeline.
Understanding which DCC a patient's isolate belongs to has important clinical implications — it can inform infection control decisions, identify likely transmission routes, and distinguish patient-to-patient transmission from independent environmental acquisition.
Within each DCC, isolates that are highly similar (≤10–20 SNPs after recombination removal) are considered putative transmission pairs — patients who likely shared the same strain through direct or indirect contact. This pipeline identifies these pairs at two thresholds:
- ≤10 SNPs — likely direct transmission (Ruis et al. 2021)
- ≤20 SNPs — probable transmission cluster (Bronson et al. 2021)
# 1. Clone repository
git clone https://github.com/cmoyer-x/mab-dcc-pipeline
cd mab-dcc-pipeline
# 2. Create dedicated Snakemake environment (recommended)
conda create -n snakemake -c conda-forge -c bioconda snakemake -y
conda activate snakemake
# 3. Place input FASTAs in input/assemblies/
cp /path/to/your/*.fasta input/assemblies/
# 4. Dry run to preview steps and check estimated runtime
snakemake --cores 16 --use-conda --dry-run
# 5. Run pipeline
snakemake --cores 16 --use-condaNote: All reference genomes and DCC anchor sequences (~3-4GB) are downloaded automatically on first run. Subsequent runs skip the download step entirely.
| Step | Tool | Description |
|---|---|---|
| 0a | Python | Input validation — checks FASTA format, file sizes, isolate count |
| 0b | wget | Download subspecies references (ATCC19977, CIP_108297, CCUG_50184) |
| 0c | wget | Download DCC anchor sequences (SRR36966619, ERR363247, etc.) |
| 1 | MASH | Species identification — non-M. abscessus excluded and logged |
| 2 | MASH | Subspecies assignment → abscessus / massiliense / bolletii |
| 3 | Snippy | SNP calling against subspecies reference genome |
| 4 | snippy-core | Core SNP alignment per subspecies including DCC references |
| 5 | FastBAPS | Population structure clustering — assigns isolates to lineages |
| 6 | Gubbins | Recombination removal per FastBAPS cluster — required before SNP distances |
| 7 | snp-dists | Pairwise SNP distances from Gubbins-filtered alignments |
| 8 | Custom | DCC assignment → DCC1–7 + Non-DCC (500 SNP threshold) |
| 9 | Custom | Outputs — Excel spreadsheets + interactive HTML map + pipeline summary |
Note: Phylogenetic tree building is not included in the pipeline. Core SNP alignments are produced at
results/core/{abscessus,massiliense}/core.alnand can be used directly with any tree building software. See Phylogenetic Trees below for recommended methods.
M. abscessus undergoes frequent recombination — horizontal gene transfer events that introduce large blocks of foreign DNA which appear as clusters of SNPs. Without removing these recombinant regions, SNP distances between isolates are artificially inflated, leading to:
- False negatives — genuine transmission pairs that appear too distant
- False positives — unrelated isolates that appear close due to shared recombination events
Gubbins identifies and masks recombinant regions per FastBAPS cluster, producing a recombination-free SNP alignment that accurately reflects vertical evolutionary history. All transmission pair calls (≤10 and ≤20 SNPs) are made from these filtered alignments, exactly matching the methodology of Ruis et al. 2021.
The pipeline automatically validates all input files before starting. It will:
- Fail with an error if fewer than 5 isolates are provided
- Warn if fewer than 20 isolates are provided — FastBAPS and RAxML require sufficient diversity to produce meaningful results
- Warn if fewer than 50 isolates are provided — recommended minimum for reliable DCC assignment
- Warn if any FASTA file is smaller than 500KB — may indicate an incomplete assembly
- Fail with an error if any FASTA file is empty or malformed
Example warning for small datasets:
[mab-dcc-pipeline] WARNING: Only 15 isolates found. FastBAPS clustering and
RAxML trees require at least 20-50 isolates for meaningful results.
On first run the pipeline automatically downloads all required references:
Subspecies references (downloaded to references/subspecies/):
| File | Strain | Accession | Purpose |
|---|---|---|---|
| ATCC19977.fasta | M. a. abscessus | GCF_000069185.1 | SNP calling reference |
| CIP_108297.fasta | M. a. massiliense | GCF_001792625.1 | SNP calling reference |
| CCUG_50184.fasta | M. a. bolletii | GCF_000701545.1 | SNP calling reference |
DCC anchor sequences (downloaded to results/dcc_refs/):
| Accession | DCC | Source |
|---|---|---|
| SRR36966619 | DCC1 | Ruis et al. 2021 |
| ERR363247 | DCC2 | Ruis et al. 2021 |
| ERR1045629 | DCC3 | Ruis et al. 2021 |
| ERR1081288 + ERR494841 | DCC4 | Ruis + Dedrick et al. |
| ERR363431 + ERR484982 | DCC5 | Ruis + Dedrick et al. |
| ERR2524314 + A47 | DCC6 | Ruis + Dedrick et al. |
| ERR363320 + FLAC047 | DCC7 | Ruis + Dedrick et al. |
If your server has no internet access, download these files on a local machine and transfer them manually to the paths above.
All outputs written to results/final/:
| File | Description |
|---|---|
DCC_assignments_FINAL.xlsx |
Per-isolate DCC assignments including subspecies, FastBAPS cluster, and DCC label |
transmission_pairs_FINAL.xlsx |
Putative transmission pairs at ≤10 SNPs and ≤20 SNPs after recombination removal |
pipeline_summary.html |
Interactive run summary — isolate counts, DCC distribution, species report |
Core SNP alignments are also produced and can be used for downstream analysis:
| File | Description |
|---|---|
results/core/abscessus/core.aln |
M. a. abscessus core SNP alignment including DCC reference strains |
results/core/massiliense/core.aln |
M. a. massiliense core SNP alignment including DCC reference strains |
Note: Phylogenetic tree building is not included in the pipeline. The core alignments above can be used directly with RAxML, IQ-TREE, FastTree, or any other tree builder. See Phylogenetic Trees below for the recommended RAxML method matching Ruis et al. 2021.
All dependencies managed automatically via --use-conda.
| Tool | Version | Purpose |
|---|---|---|
| Snakemake | ≥7.0 | Workflow manager |
| Mash | ≥2.0 | Species and subspecies identification |
| Snippy | 4.6.0 | Reference-based SNP calling |
| FastBAPS | 1.0.8 | Population structure clustering |
| Gubbins | ≥3.3 | Recombination removal |
| RAxML | ≥8.2 | Maximum likelihood phylogenetic trees |
| snp-dists | ≥0.8 | Pairwise SNP distances |
| Python | ≥3.8 | Analysis and output scripts |
Edit config.yaml to set:
- Input/output directories
- SNP thresholds (default: ≤10 strict, ≤20 broad, 500 DCC membership)
- Computational resources (cores, memory)
| DCC | Subspecies | Reference | Source |
|---|---|---|---|
| DCC1 | M. a. abscessus | SRR36966619 | Ruis et al. 2021 |
| DCC2 | M. a. abscessus | ERR363247 | Ruis et al. 2021 |
| DCC3 | M. a. massiliense | ERR1045629 | Ruis et al. 2021 |
| DCC4 | M. a. abscessus | ERR1081288 + ERR494841 | Ruis + Dedrick et al. |
| DCC5 | M. a. abscessus | ERR363431 + ERR484982 | Ruis + Dedrick et al. |
| DCC6 | M. a. massiliense | ERR2524314 + A47 | Ruis + Dedrick et al. |
| DCC7 | M. a. massiliense | ERR363320 + FLAC047 | Ruis + Dedrick et al. |
DCC4 and DCC5 are highly related lineages that FastBAPS frequently places in the same population structure cluster. When this occurs the pipeline automatically uses pairwise SNP distance to the closest reference strain to distinguish them. This distance-based tiebreaking was validated against a 312-isolate cohort and correctly separates the two lineages.
DCC6 is a relatively rare massiliense lineage. Only isolates within ~1,000 SNPs of the DCC6 reference strains (A47/ERR2524314) are assigned to DCC6 — more divergent isolates that cluster phylogenetically near DCC6 but are >5,000 SNPs away are classified as Non-DCC massiliense. This conservative assignment reflects the biological reality that these isolates represent independent acquisitions rather than transmission from the DCC6 lineage.
By default this pipeline is designed for unbiased clinical cohort analysis — you provide all isolates from your patient population and the pipeline assigns them to DCCs without any pre-selection. This approach:
- Captures the true diversity of M. abscessus circulating in your patient population
- Identifies Non-DCC lineages that may represent novel or emerging clones
- Avoids the sampling bias introduced by deliberately including only known DCC representatives
- Produces a higher proportion of Non-DCC isolates than published global datasets — this is expected and biologically meaningful, not a methodological flaw
To anchor your analysis to the global DCC framework more tightly, you can supplement your cohort with publicly available isolates from published studies. This approach is useful when:
- You want to formally validate Non-DCC clusters against the full global dataset
- You are proposing a new DCC and need multi-continental evidence
- You want to compare your cohort directly against Ruis et al. 2021 figures
To add global genomes, download SRA accessions from published studies and place them in input/assemblies/ alongside your cohort isolates. Key public datasets:
| Study | Accessions | N isolates | Notes |
|---|---|---|---|
| Ruis et al. 2021 | ERP017141 | 2,045 | Primary DCC definition dataset |
| Bronson et al. 2021 | PRJNA648717 | ~200 | North American cohort |
| Dedrick et al. | PRJNA738526 | 90 | Lab paper reference strains |
Warning: Adding global genomes significantly increases runtime. Consider downloading only the DCC representative strains rather than the full dataset.
Non-DCC isolates are not sequencing failures. They represent:
- Genuinely distinct lineages — phylogenetically intermediate between defined DCCs
- Locally circulating clones — lineages with active transmission within your cohort not yet sampled globally at sufficient scale to meet formal DCC criteria
- Independent environmental acquisitions — isolates acquired directly from environmental sources
A cluster of Non-DCC isolates meeting ≥10 isolates from ≥2 countries at the 500 SNP threshold should be considered a candidate novel DCC.
The locations file is optional but required for the interactive HTML transmission map. It should be an Excel file (.xlsx) with the following format:
| Column 1 | Column 2 |
|---|---|
| Isolate ID | Institution or location |
Example format (use generic placeholder names in your own file):
| Isolate | Institution |
|---|---|
| Sample001 | Hospital A, City, Country |
| Sample002 | Hospital B, City, Country |
Rules:
- Column 1 must contain the isolate ID exactly as it appears in your FASTA filenames (without the
.fastaextension) - Column 2 should contain the institution name or city/country
- The pipeline extracts country from the institution name automatically — include recognizable country keywords (e.g. Australia, UK, Canada, USA, France)
- No header row required but one is accepted
- Isolates without a matching location entry are excluded from the map but still included in all other outputs
- Place the file at
input/locations.xlsxand setlocations_file: "input/locations.xlsx"inconfig.yaml
Supported countries for map plotting: Australia, New Zealand, UK, USA, Canada, France, Netherlands, Germany, Italy, Spain, Israel, Ireland, Finland, Switzerland, Singapore, Taiwan, Turkey, Latvia, Portugal, Slovakia
If your institution is in a country not listed above it will default to USA. Add additional country detection logic to
workflow/scripts/generate_html_map.pyif needed.
Reset the pipeline without re-downloading references:
# Remove results but keep downloads (~3GB of DCC refs)
snakemake clean --cores 1
# Remove everything including downloads
snakemake clean_all --cores 1Before committing to a full run, preview what the pipeline will do:
snakemake --cores 16 --use-conda --dry-runThis prints a summary including isolate count, estimated runtimes, and whether reference downloads are needed:
============================================================
M. abscessus DCC Assignment Pipeline
============================================================
Isolates found: 150
Total input size: 0.72 GB
Output directory: results/
Locations file: YES
Estimated runtimes (approximate, based on 312-isolate validation run):
MASH species/subspecies: ~30 minutes
Snippy SNP calling: ~16 hours
FastBAPS clustering: ~30 minutes
Gubbins recombination: ~48 hours (most time-consuming step)
Final outputs: ~5 minutes
Total: ~65 hours on 16 cores
============================================================
A timestamped progress log is written to results/pipeline_progress.log during each run, making it easy to track which steps have completed:
[2025-05-07 14:23:01] Input validation passed: 150 isolates found
[2025-05-07 14:23:01] Config validation passed
When running on small test datasets (<20 isolates) RAxML tree building is automatically skipped. A minimum of ~50 isolates is recommended for meaningful phylogenetic trees and FastBAPS clustering.
Important: Small test datasets (<10 isolates) may produce all Non-DCC results because the core SNP alignment lacks sufficient variable sites to distinguish DCC lineages. DCC reference sequences must be included in the alignment for correct assignment — this happens automatically via the download_dcc_references rule.
DCC reference sequences (~3–4GB) are downloaded automatically on first run. If your server has no internet access download the references on a local machine and transfer them to results/dcc_refs/.
This pipeline was validated on a 312-isolate clinical cohort of M. abscessus complex isolates:
| DCC | Pipeline | Expected | Status |
|---|---|---|---|
| DCC1 | 115 | 115 | ✅ exact match |
| DCC2 | 28 | 28 | ✅ exact match |
| DCC3 | 24 | 24 | ✅ exact match |
| DCC4 | 51 | 39 | ✅ combined DCC4+DCC5 = 82 matches |
| DCC5 | 31 | 43 | ✅ combined DCC4+DCC5 = 82 matches |
| DCC6 | 1 | 6 | ✅ conservative assignment (see note above) |
| DCC7 | 20 | 20 | ✅ exact match |
| Non-DCC | 38 | 31 | ✅ difference explained by DCC6 reclassification |
Transmission pairs identified: 218 at ≤10 SNPs, 449 at ≤20 SNPs.
Phylogenetic tree building is intentionally excluded from the pipeline due to the long runtime required (2–3 days for 300+ isolates with 100 bootstraps). The core SNP alignments produced by the pipeline can be used directly with any tree building software.
After the pipeline completes, core SNP alignments are available at:
results/core/abscessus/core.aln— M. a. abscessus alignment including DCC reference strainsresults/core/massiliense/core.aln— M. a. massiliense alignment including DCC reference strains
RAxML v8.2 with GTR+GAMMA model and 100 bootstraps, as used in Ruis et al. 2021:
conda activate analyze_mab # or any env with RAxML
# Step 1 — Run 100 bootstrap replicates
nohup raxmlHPC-PTHREADS -s results/core/abscessus/core.aln -n abscessus_raxml -m GTRGAMMA -p 12345 -b 12345 -N 100 -T 4 -w $(realpath results/trees/abscessus/) &> raxml_bootstrap.log &
# Step 2 — Find best ML tree
raxmlHPC-PTHREADS -s results/core/abscessus/core.aln -n abscessus_besttree -m GTRGAMMA -p 12345 -T 4 -w $(realpath results/trees/abscessus/)
# Step 3 — Draw bootstrap values onto best tree
raxmlHPC -f b -t results/trees/abscessus/RAxML_bestTree.abscessus_besttree -z results/trees/abscessus/RAxML_bootstrap.abscessus_raxml -m GTRGAMMA -n abscessus_final -w $(realpath results/trees/abscessus/)The output file RAxML_bipartitions.abscessus_final can be uploaded directly to iTOL for visualization.
After DCC assignments are complete, generate iTOL annotation files using the tip labels from your tree:
# Get tip labels from bipartitions tree
grep -oP '[A-Z][A-Z0-9_]+(?=:)' results/trees/abscessus/RAxML_bipartitions.abscessus_final > /tmp/abscessus_tips.txt
# Generate color strip and label color files
python3 << 'PYEOF'
import csv
tips = open("/tmp/abscessus_tips.txt").read().splitlines()
dcc_map = {}
with open("results/dcc/abscessus_assignments.csv") as f:
for row in csv.DictReader(f):
dcc_map[row["Isolate"]] = row["DCC"]
DCC_COLORS = {
"DCC1":"#60a5fa","DCC2":"#34d399","DCC4":"#a78bfa",
"DCC5":"#fbbf24","DCC6":"#e11d48","DCC7":"#06b6d4",
"Non-DCC":"#fb923c","Unknown":"#888888"
}
REF_DCC = {
"SRR36966619":"DCC1","ERR363247":"DCC2","ERR1081288":"DCC4",
"ERR494841":"DCC4","ERR363431":"DCC5","ERR484982":"DCC5",
"ERR2524314":"DCC6","ERR363320":"DCC7"
}
lines = ["DATASET_COLORSTRIP","SEPARATOR TAB","DATASET_LABEL DCC",
"COLOR #888888","LEGEND_TITLE DCC",
"LEGEND_COLORS #60a5fa #34d399 #a78bfa #fbbf24 #e11d48 #06b6d4 #fb923c",
"LEGEND_LABELS DCC1 DCC2 DCC4 DCC5 DCC6 DCC7 Non-DCC","DATA"]
for tip in tips:
dcc = REF_DCC.get(tip) or dcc_map.get(tip) or dcc_map.get(tip.replace("_WGS","").replace("_hybrid",""), "Unknown")
color = DCC_COLORS.get(dcc, "#888888")
lines.append(f"{tip} {color} {dcc}")
with open("results/trees/abscessus/iTOL_colorstrip.txt","w") as f:
f.write("
".join(lines))
print("iTOL annotation written to results/trees/abscessus/iTOL_colorstrip.txt")
PYEOF- Ruis et al. 2021 — Dissemination of Mycobacterium abscessus via global transmission networks. Nature Microbiology. PMC8478660
- Bronson et al. 2021 — Global phylogenomic analyses of M. abscessus. PMC8390669
- Dedrick et al. — Phage therapy of Mycobacterium infections. PMC10583746
Casey Moyer — University of Pittsburgh, Phage Genomics and Bioinformatics GitHub: cmoyer-x