Skip to content

cmoyer-x/Species-and-Subspecies-from-MASH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Species-and-Subspecies-from-MASH

This R workflow automates the organization of genomic data (GD) files into species and subspecies folders based on MASH analysis results. It also generates visualizations showing the distribution of genomes across different taxonomic levels.

MASH Analysis comes from https://github.com/marbl/Mash/releases

Analysis done with this specific workflow:

mash sketch -o reference path/to/each/genome

mash dist reference.msh NEW_GD_Genomes/*.fasta | sed 's/\t/,/g' > MASHOutSpecies.csv

This outputs the results into a formatted csv document for easy viewing. This can then be repeated for Subspcies identification.

1. Species-Level File Sorting

Key Inputs:

  • MASHOutSpecies.csv: CSV file containing species assignments (NewFilteredStrains_Species.csv)
  • source_dir: Directory containing all GD files (~/Desktop/MASH_New/ALL_GD)
  • target_dir: Output directory for sorted files (~/Desktop/MASH_New/Species_Sort)

Process:

  1. Reads species assignment data from MASH results
  2. Creates regex patterns from file prefixes (Query_ID)
  3. Matches each GD file to its corresponding species (Reference_ID)
  4. Copies files to species-specific subdirectories
  5. Generates an assignment_summary.csv log file

Configuration Options:

  • dry_run = FALSE: Set to TRUE to preview assignments without moving files
  • move_files = FALSE: Set to TRUE to move files instead of copying

2. Subspecies-Level File Sorting

Purpose: Sorts GD files into subspecies folders (specifically for M. abscessus complex).

Key Inputs:

  • mash_csvSub: CSV file containing subspecies assignments (FilteredSubSpeciesNOV11.csv)
  • source_dir: Same source directory as species sorting
  • target_dir: Output directory for subspecies sorting (~/Desktop/MASH_New/SubSpecies_Sort)

Process:

  • Follows the same workflow as species sorting but at subspecies level (M. abscessus, M. avium, M. intracellulare, M. chimaera, M. chelonae, M. bovis , M. xenopi, M. terrae, M. simiae, M. szulgai, M. haemophilum, M. fortuitum, M. immunogenum, M. malmonese, M. heckeshornense, M. arupense)
  • Matches files to subspecies classifications (M. abscessus, M. massiliense, M. bolletii)

3. Visualizations

Species Distribution Plot

Creates bar charts showing the number of genomes per species:

Two versions generated:

  1. With Abscessus: Complete species distribution
  2. Without Abscessus: Filtered view excluding M. abscessus (useful when this species dominates the dataset)

Output:

  • Species_Dist.png: Full species distribution
  • Optional: Species_Dist_wo_ABSCESSUS.png

Subspecies Distribution Plot

Creates a bar chart showing the distribution of M. abscessus subspecies:

  • Abscessus
  • Massiliense
  • Bolletii

Output:

  • Optional: Subspecies_sort.png

Required R Packages

library(dplyr) # Data manipulation library(readr) # CSV reading/writing library(stringr) # String operations library(ggplot2) # Visualization

Output Files

Assignment Logs

  • Species_Sort/assignment_summary.csv: Complete log of species assignments
  • SubSpecies_Sort/assignment_summary.csv: Complete log of subspecies assignments

Visualizations

  • Species_Dist.png: Species distribution chart
  • Subspecies_sort.png: Species distribution chart

About

This code is taking MASH output and sorting fasta files into folders for species and subspecies determination and frequency. At the end of the code visualizations and executed for viewing of outputs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors