Skip to content

inrae/MSpangepop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

634 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mspangenome

mspangenome is a workflow for simulating pangenome variation graphs from coalescent simulations.
A simplified description of the algorithm can be found here.

The official mspangenome repository can be found at the INRAE forge.
A GitHub mirror can be found at INRAE GitHub.
The mirror is especially useful for people with no Renater account to submit issues.

Documentation

Master Configuration How to set up configuration files and parameters
Demographic model Adapt or create a model
Output Files Description of generated results and outputs
Advanced Topics In-depth information for power users

Workflow Stages

Stage Process Scripts Rules
1. Setup → Validate FASTA/YAML
→ Expand configs
→ Create index
input_index.py
sample_ranges.py
recap.py
setup
2. Msprime Simulation → Build demographic model
→ Run msprime
→ Generate visualizations
msprime_simulation.py
visualizer_arg.py
visualizer_tree.py
msprime_simulation
visualization
3. Preprocessing → Split by locus
→ Preorder traverse trees
→ Define SVs type lenght and position
coalescent_traversal.py
draw_variants.py
split_recombination.py
coalescent_traversal
draw_variants
split_recombination
4. Graph Creation Initialize: Build locus ancestral graphs
Mutate: Apply variants using mspangenome library
Save: Assign IDs → Merge subgraphs→ Lint → Export chopped graph
graph_creation.py
graph_classes.py
graph_utils.py
matrix.py
graph_creation
5. Unchop VG unchop command - graph_merging

How to Use

1. Set up

Clone the Git repository

git clone https://forge.inrae.fr/pangepop/MSpangepop 
  • Create an environement for snakemake (from the provided envfile):
conda env create -n wf_env -f dependencies/wf_env.yaml

2. Configure the pipeline for your data

Three elements are needed to run the simulation :

To do a quick test :

Edit the masterconfig file in the .config/ directory with your sample information. (Master Configuration)

nano .config/masterconfig.yaml

Example config:

samples:
  test_run:
    model: "simulation_data/Panmictic_Model.json"
    replicates: 1
  • model is the demographic scenario the simulation will run on. You can create your own or tailor the ones in ./simulation_data (Demographic model configuration)

  • ⚠️ Don't want to create your own model?⚠️ Use the provided Panmictic_Model.json - simply edit it to specify your genome (fasta_gz), then adjust mutation_rate and recombination_rate (start with low values)

3. Run the workflow

On the cluster

  • Run the workflow :
sbatch mspangenome dry # Check for warnings
sbatch mspangenome run # Then

Nb : If your account name can't be automatically determined, add it in the .config/snakemake/profiles/slurm/config.yaml file.

Nb : Use the command squeue --format="%.10i %.9P %.6j %.10k %.8u %.2t %.10M %.6D %.20R" -A $user to see job names

Localy

./mspangenome dry # Check for warnings
./mspangenome local-run # Then

Other running options

mspangenome [dry|run|local-run|dag|rulegraph|unlock|touch] [additional snakemake args]
    dry - run in dry-run mode
    run - run the workflow with SLURM
    local-run - run the workflow localy (on a single node)
    dag - generate the directed acyclic graph for the workflow
    rulegraph - generate the rulegraph for the workflow
    unlock - Unlock the directory if snakemake crashed
    touch - Tell snakemake that all files are up to date (use with caution)
    [additional snakemake args] - for any snakemake arg, like --until hifiasm

Path Operations in mspangenome

mspangenome implements graph path operations to add variants by modifying how lineages traverse the graph.

Core Features:

  • Multi-path targeting - Operations apply to single or multiple lineage paths simultaneously, enabling both unique and shared variants
  • Orientation-aware - All operations preserve node directionality using edges that track exit and entry node sides, creating orientation-aware links (++, +-, -+, --)
  • Composable - Operations can be nested and overlapping (e.g., deletion within inversion), representing complex compound variants

These operations modify paths through existing nodes rather than altering the graph structure, maintaining shared sequences while creating alternative routes for different lineages. New nodes (e.g., for insertions) are generated using an order 1 Markov model to produce realistic sequences.

Operation Function Used For Path Change
bypass(a,b) Skip nodes a to b Deletions Creates shortcut edge
loop(a,b) Duplicate nodes a to b Tandem duplications Adds loop-back + repeat
invert(a,b) Reverse nodes a to b Inversions Flips path direction
swap(a,node) Replace node at position a SNPs Substitutes single node
paste(a,a+1,nodes) Insert between adjacent nodes Insertions Adds new node sequence

Support and Development

mspangenome is developed at INRAE as part of the PangenOak project.

About

MSpangepop generates pangenome variation graphs directly from coalescent simulations.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors