Skip to content

Latest commit

 

History

History
93 lines (59 loc) · 3.32 KB

File metadata and controls

93 lines (59 loc) · 3.32 KB

Species-and-Subspecies-from-MASH

This R workflow automates the organization of genomic data (GD) files into species and subspecies folders based on MASH analysis results. It also generates visualizations showing the distribution of genomes across different taxonomic levels.

MASH Analysis comes from https://github.com/marbl/Mash/releases

Analysis done with this specific workflow:

mash sketch -o reference path/to/each/genome

mash dist reference.msh NEW_GD_Genomes/*.fasta | sed 's/\t/,/g' > MASHOutSpecies.csv

This outputs the results into a formatted csv document for easy viewing. This can then be repeated for Subspcies identification.

1. Species-Level File Sorting

Key Inputs:

  • MASHOutSpecies.csv: CSV file containing species assignments (NewFilteredStrains_Species.csv)
  • source_dir: Directory containing all GD files (~/Desktop/MASH_New/ALL_GD)
  • target_dir: Output directory for sorted files (~/Desktop/MASH_New/Species_Sort)

Process:

  1. Reads species assignment data from MASH results
  2. Creates regex patterns from file prefixes (Query_ID)
  3. Matches each GD file to its corresponding species (Reference_ID)
  4. Copies files to species-specific subdirectories
  5. Generates an assignment_summary.csv log file

Configuration Options:

  • dry_run = FALSE: Set to TRUE to preview assignments without moving files
  • move_files = FALSE: Set to TRUE to move files instead of copying

2. Subspecies-Level File Sorting

Purpose: Sorts GD files into subspecies folders (specifically for M. abscessus complex).

Key Inputs:

  • mash_csvSub: CSV file containing subspecies assignments (FilteredSubSpeciesNOV11.csv)
  • source_dir: Same source directory as species sorting
  • target_dir: Output directory for subspecies sorting (~/Desktop/MASH_New/SubSpecies_Sort)

Process:

  • Follows the same workflow as species sorting but at subspecies level (M. abscessus, M. avium, M. intracellulare, M. chimaera, M. chelonae, M. bovis , M. xenopi, M. terrae, M. simiae, M. szulgai, M. haemophilum, M. fortuitum, M. immunogenum, M. malmonese, M. heckeshornense, M. arupense)
  • Matches files to subspecies classifications (M. abscessus, M. massiliense, M. bolletii)

3. Visualizations

Species Distribution Plot

Creates bar charts showing the number of genomes per species:

Two versions generated:

  1. With Abscessus: Complete species distribution
  2. Without Abscessus: Filtered view excluding M. abscessus (useful when this species dominates the dataset)

Output:

  • Species_Dist.png: Full species distribution
  • Optional: Species_Dist_wo_ABSCESSUS.png

Subspecies Distribution Plot

Creates a bar chart showing the distribution of M. abscessus subspecies:

  • Abscessus
  • Massiliense
  • Bolletii

Output:

  • Optional: Subspecies_sort.png

Required R Packages

library(dplyr) # Data manipulation library(readr) # CSV reading/writing library(stringr) # String operations library(ggplot2) # Visualization

Output Files

Assignment Logs

  • Species_Sort/assignment_summary.csv: Complete log of species assignments
  • SubSpecies_Sort/assignment_summary.csv: Complete log of subspecies assignments

Visualizations

  • Species_Dist.png: Species distribution chart
  • Subspecies_sort.png: Species distribution chart