mspangenome is a workflow for simulating pangenome variation graphs from coalescent simulations.
A simplified description of the algorithm can be found here.
The official mspangenome repository can be found at the INRAE forge.
A GitHub mirror can be found at INRAE GitHub.
The mirror is especially useful for people with no Renater account to submit issues.
| Master Configuration | How to set up configuration files and parameters |
| Demographic model | Adapt or create a model |
| Output Files | Description of generated results and outputs |
| Advanced Topics | In-depth information for power users |
| Stage | Process | Scripts | Rules |
|---|---|---|---|
| 1. Setup | → Validate FASTA/YAML → Expand configs → Create index |
input_index.pysample_ranges.pyrecap.py |
setup |
| 2. Msprime Simulation | → Build demographic model → Run msprime → Generate visualizations |
msprime_simulation.pyvisualizer_arg.pyvisualizer_tree.py |
msprime_simulation visualization |
| 3. Preprocessing | → Split by locus → Preorder traverse trees → Define SVs type lenght and position |
coalescent_traversal.pydraw_variants.pysplit_recombination.py |
coalescent_traversal draw_variants split_recombination |
| 4. Graph Creation | Initialize: Build locus ancestral graphs Mutate: Apply variants using mspangenome library Save: Assign IDs → Merge subgraphs→ Lint → Export chopped graph |
graph_creation.pygraph_classes.pygraph_utils.pymatrix.py |
graph_creation |
| 5. Unchop | VG unchop command | - | graph_merging |
Clone the Git repository
git clone https://forge.inrae.fr/pangepop/MSpangepop - Create an environement for snakemake (from the provided envfile):
conda env create -n wf_env -f dependencies/wf_env.yamlThree elements are needed to run the simulation :
- The
masterconfig-> Master Configuration - The
demographic_file-> Demographic model configuration - A reference genome "
fasta_gz" (must be telomere to telomere, you can run on 1 chromosome only, to test the configuration)
Edit the masterconfig file in the .config/ directory with your sample information. (Master Configuration)
nano .config/masterconfig.yamlExample config:
samples:
test_run:
model: "simulation_data/Panmictic_Model.json"
replicates: 1-
modelis the demographic scenario the simulation will run on. You can create your own or tailor the ones in./simulation_data(Demographic model configuration) -
⚠️ Don't want to create your own model?⚠️ Use the providedPanmictic_Model.json- simply edit it to specify your genome (fasta_gz), then adjustmutation_rateandrecombination_rate(start with low values)
- Run the workflow :
sbatch mspangenome dry # Check for warnings
sbatch mspangenome run # ThenNb : If your account name can't be automatically determined, add it in the
.config/snakemake/profiles/slurm/config.yamlfile.
Nb : Use the command
squeue --format="%.10i %.9P %.6j %.10k %.8u %.2t %.10M %.6D %.20R" -A $userto see job names
./mspangenome dry # Check for warnings
./mspangenome local-run # Thenmspangenome [dry|run|local-run|dag|rulegraph|unlock|touch] [additional snakemake args]
dry - run in dry-run mode
run - run the workflow with SLURM
local-run - run the workflow localy (on a single node)
dag - generate the directed acyclic graph for the workflow
rulegraph - generate the rulegraph for the workflow
unlock - Unlock the directory if snakemake crashed
touch - Tell snakemake that all files are up to date (use with caution)
[additional snakemake args] - for any snakemake arg, like --until hifiasm
mspangenome implements graph path operations to add variants by modifying how lineages traverse the graph.
Core Features:
- Multi-path targeting - Operations apply to single or multiple lineage paths simultaneously, enabling both unique and shared variants
- Orientation-aware - All operations preserve node directionality using edges that track exit and entry node sides, creating orientation-aware links (++, +-, -+, --)
- Composable - Operations can be nested and overlapping (e.g., deletion within inversion), representing complex compound variants
These operations modify paths through existing nodes rather than altering the graph structure, maintaining shared sequences while creating alternative routes for different lineages. New nodes (e.g., for insertions) are generated using an order 1 Markov model to produce realistic sequences.
| Operation | Function | Used For | Path Change |
|---|---|---|---|
bypass(a,b) |
Skip nodes a to b | Deletions | Creates shortcut edge |
loop(a,b) |
Duplicate nodes a to b | Tandem duplications | Adds loop-back + repeat |
invert(a,b) |
Reverse nodes a to b | Inversions | Flips path direction |
swap(a,node) |
Replace node at position a | SNPs | Substitutes single node |
paste(a,a+1,nodes) |
Insert between adjacent nodes | Insertions | Adds new node sequence |
mspangenome is developed at INRAE as part of the PangenOak project.
