This R workflow automates the organization of genomic data (GD) files into species and subspecies folders based on MASH analysis results. It also generates visualizations showing the distribution of genomes across different taxonomic levels.
MASH Analysis comes from https://github.com/marbl/Mash/releases
mash sketch -o reference path/to/each/genome
mash dist reference.msh NEW_GD_Genomes/*.fasta | sed 's/\t/,/g' > MASHOutSpecies.csv
This outputs the results into a formatted csv document for easy viewing. This can then be repeated for Subspcies identification.
Key Inputs:
MASHOutSpecies.csv: CSV file containing species assignments (NewFilteredStrains_Species.csv)source_dir: Directory containing all GD files (~/Desktop/MASH_New/ALL_GD)target_dir: Output directory for sorted files (~/Desktop/MASH_New/Species_Sort)
Process:
- Reads species assignment data from MASH results
- Creates regex patterns from file prefixes (Query_ID)
- Matches each GD file to its corresponding species (Reference_ID)
- Copies files to species-specific subdirectories
- Generates an
assignment_summary.csvlog file
Configuration Options:
dry_run = FALSE: Set toTRUEto preview assignments without moving filesmove_files = FALSE: Set toTRUEto move files instead of copying
Purpose: Sorts GD files into subspecies folders (specifically for M. abscessus complex).
Key Inputs:
mash_csvSub: CSV file containing subspecies assignments (FilteredSubSpeciesNOV11.csv)source_dir: Same source directory as species sortingtarget_dir: Output directory for subspecies sorting (~/Desktop/MASH_New/SubSpecies_Sort)
Process:
- Follows the same workflow as species sorting but at subspecies level (M. abscessus, M. avium, M. intracellulare, M. chimaera, M. chelonae, M. bovis , M. xenopi, M. terrae, M. simiae, M. szulgai, M. haemophilum, M. fortuitum, M. immunogenum, M. malmonese, M. heckeshornense, M. arupense)
- Matches files to subspecies classifications (M. abscessus, M. massiliense, M. bolletii)
Creates bar charts showing the number of genomes per species:
Two versions generated:
- With Abscessus: Complete species distribution
- Without Abscessus: Filtered view excluding M. abscessus (useful when this species dominates the dataset)
Output:
Species_Dist.png: Full species distribution- Optional:
Species_Dist_wo_ABSCESSUS.png
Creates a bar chart showing the distribution of M. abscessus subspecies:
- Abscessus
- Massiliense
- Bolletii
Output:
- Optional:
Subspecies_sort.png
library(dplyr) # Data manipulation library(readr) # CSV reading/writing library(stringr) # String operations library(ggplot2) # Visualization
Species_Sort/assignment_summary.csv: Complete log of species assignmentsSubSpecies_Sort/assignment_summary.csv: Complete log of subspecies assignments
Species_Dist.png: Species distribution chartSubspecies_sort.png: Species distribution chart