Skip to content

A collection of Python scripts to facilitate the preprocessing of input sequences to dynamic homology.

License

Notifications You must be signed in to change notification settings

dnakamuraz/prepDyn

Repository files navigation

Preprocessing dynamic characters for tree-alignment

language author license

prepDyn is a collection of Python scripts to facilitate the preprocessing of input sequences for dynamic homology.

In dynamic homology, data should be preprocessed to distinguish differences in sequence length resulting from missing data or insertion-deletion events to avoid grouping from artifacts. However, previous empirical studies using POY (Wheeler et al. 2015) and PhyG (Wheeler et al. 2024) manually preprocessed data with varying approaches (e.g. Grant et al. 2006; Nakamura et al. 2025). Here we present prepDyn, a collection of Python scripts to facilitate the preprocessing of input sequences to POY/PhyG. The main script prepDyn.py comprises four steps: (1) data collection from GenBank, (2) trimming, (3) identification of missing data, and (4) partitioning.

Copyright (C) Daniel Y. M. Nakamura 2025

Installation

The two dependencies that should be installed beforehand by the user are:

  • Python v. 3.10.9 (or newer), including argparse, ast, csv, importlib, re, StringIO, subprocess, sys, tempfile, and time, which are usually part of recent versions of Python.
  • MAFFT v. 7.5.2 (or newer), installed in $PATH as 'mafft'.
# Create a conda environmnt called 'prepdyn'
conda create -n prepdyn python=3.10 --yes

# Inside the newly created environment, install 'mafft'
conda activate prepdyn
conda install bioconda::mafft

Other dependencies are Python modules that will be automatically installed by prepDyn when you run it for the first time:

  • Bio v. 1.73 (or newer), including AlignIO, Entrez, SeqIO, Align, Seq, and SeqRecord.
  • matplotlib v. 3.7.0 (or newer)
  • numpy v. 1.23.5 (or newer)
  • termcolor

If the modules are not installed automatically, try:

conda install conda-forge::biopython
conda install conda-forge::matplotlib
conda install anaconda::numpy
conda install conda-forge::termcolor

Finally, clone the prepDyn repository using the command:

git clone https://github.com/danimelsz/PrepDyn.git

Usage

prepDyn is organized in four Python scripts in the directory src:

  • prepDyn.py: the main script integrating the pipeline.
  • GB2MSA.py: script to download sequences from GenBank and identify internal missing data.
  • addSeq.py: script to align one or a few sequence(s) to a previously preprocessed (profile) alignment.
  • UP2AP.py: script to align sequences containing pound signs.

A summary of parameters used in prepDyn.py:

Parameter Type Default Description
input_file str Path to an alignment file or directory of alignments.
GB_input str Path to a CSV/TSV of GenBank accessions. Overrides input_file.
input_format str "fasta" Format of input file(s) (e.g., "phylip", "clustal").
MSA bool False If True, perform multiple sequence alignment on unaligned input sequences.
output_file str Custom prefix for output files.
output_format str "fasta" Format for the output alignment.
log bool False Write a detailed log file.
sequence_names bool True Write a file with all unique sequence names.
orphan_method str None Method to trim orphan nucleotides: 'percentile', 'integer', or None.
orphan_threshold int 10 Manual length threshold for orphan_method='integer'.
percentile float 25 Percentile for orphan_method='auto'.
del_inv bool True Trim invariant columns from alignment ends.
internal_method str None Method to replace internal gaps with ?: 'manual', 'semi', or None.
internal_column_ranges list Column ranges (e.g., [[10, 20]]) for 'manual' method.
internal_leaves str or list "all" Sequences to apply internal gap replacement to.
internal_threshold int Gap length threshold for 'semi' method.
n2question str, list, None None Replace ambiguous 'N' with ?. Options: 'all', list of names, or None.
partitioning_method str "balanced" Method to insert # markers: 'balanced', 'conservative', 'equal', 'max'.
partitioning_round int 0 Number of partitions/rounds for relevant partitioning methods.
partitioning_size int Partition size for partitioning_method='equal'.

Parameters can be either specified with long or short options. For more information:

python src/prepDyn.py -h
python src/GB2MSA.py -h
python src/addSeq.py -h
python src/UP2AP.py -h

The following examples are designed for users with little experience on Unix. If you have questions, send a message using GitHub issues. Do not move the scripts from the directory src, otherwise the modular structure will break.

Example 1: Simple example

The basic use of prepDyn.py is running all four steps using a single command. Given an input CSV, whose first column is called Terminals and the other columns are the names of genes (each cell containing the correspondent GenBank accession number), the following command will download sequences, trim invariants and orphan nucleotides <10 bp in terminal positions, and identify missing data as ? (all differences in sequence length in terminal positions are missing data). In the CSV file, if more than one GenBank accession number is specified in the same cell refering to non-overlapping fragments of the same gene (e.g. JX155817/JX155844), the space between them is automatically identified as internal missing data (?).

python src/prepDyn.py \
    --GB_input test_data/tutorial/ex1.1/ex1.1_input.csv \
    --output_file test_data/tutorial/ex1.1/ex1.1 \
    --del_inv T \
    --orphan_method integer \
    --orphan_threshold 10 \
    --partitioning_method None \
    --log T 

We specified paritioning_method None, which means that partitioning was not performed. As a heuristic, we recommend testing the impact of adding pound signs to the tree optimality scores using a successive partitioning strategy. For instance, if you specify partitioning_method conservative and partitioning_round 1, the largest block(s) of contiguous invariants will be partitioned.

python src/prepDyn.py \
    --input_file test_data/tutorial/ex1.2/ex1.2_input.fasta \
    --output_file test_data/tutorial/ex1.2/ex1.2 \
    --partitioning_method balanced \
    --partitioning_round 1 \
    --log T

This process can continue until tree costs reported by POY/PhyG remain stationary. For instance, partitioning_round 2 inserts pound signs in the 1- and 2-largest block(s) of contiguous invariants.

python src/prepDyn.py \
    --input_file test_data/tutorial/ex1.2/ex1.2_input.fasta \
    --output_file test_data/tutorial/ex1.3/ex1.3 \
    --partitioning_method balanced \
    --partitioning_round 2 \
    --log T

After tree-alignment, if the tree cost from partitioning_round 2 is higher than that from partitioning_round 1, the analysis can stop. Otherwise, the analysis can proceed with partitioning_round 3 (and so on).

python src/prepDyn.py \
    --input_file test_data/tutorial/ex1.2/ex1.2_input.fasta \
    --output_file test_data/tutorial/ex1.4/ex1.4 \
    --partitioning_method balanced \
    --partitioning_round 3 \
    --log T

Other methods of partitioning are also available (conservative, equal, max) and the user should explore whether they can reduce tree costs. Given the same parameters used for data collection, trimming, and identification of missing data, the different partitioning methods are comparable via tree costs.

Example 2: Multiple alignments

Suppose you have a phylogenomic dataset with hundreds of gene alignmens in the directory ./data/. Phylogenomic datasets are usually unavailable in GenBank, but are available in repositories like Dryad and Zenodo. You can preprocess all unaligned gene files in FASTA format using a single command:

python src/prepDyn.py \
    --input_file test_data/tutorial/ex3.1/ \
    --input_format fasta \
    --output_file test_data/tutorial/ex3.1/out \
    --MSA T \
    --del_inv T \
    --orphan_method integer --orphan_threshold 10 \
    --internal_method semi --internal_threshold 15 \
    --partitioning_method max

If the input files are already aligned, change the boolean parameter MSA to False to accelerate preprocessing:

python src/prepDyn.py \
    --input_file test_data/tutorial/ex3.2/ \
    --input_format fasta \
    --output_file test_data/tutorial/ex3.2/ \
    --MSA F \
    --del_inv T \
    --orphan_method integer --orphan_threshold 10 \
    --internal_method semi --internal_threshold 15 \
    --log T

Example 3: Appending new sequences to a profile alignment

MAFFT is unable to align sequences if pound signs or question marks are present. This is a problem when we try to align new sequences to a prevously preprocessed alignment for dynamic homology. To avoid manual alignment by eye, addSeq.py allows aligning new sequences to profile alignments. Gaps, missing data, and pound signs are not modified for the sequences present in the profile alignment--they are only inserted in the new sequences.

A simple example:

python src/addSeq.py \
    --alignment test_data/tutorial/ex4.1/ex4.1_aln.fas \
    --new_seqs test_data/tutorial/ex4.1/ex4.1_new_seqs.fas \
    --output test_data/tutorial/ex4.1/ex4.1_out.fas \
    --log True

A more complex example, where new sequences were preprocessed using trimming of blocks of orphan nucleotides of length lesser than 45 bp, replacement of internal blocks of gaps longer than 20 with question marks, and replacement of all IUPAC N with question marks in the sequence Thoropa_miliaris_CFBH10125:

python src/addSeq.py \
    --alignment test_data/tutorial/ex4.2/ex4.2_aln.fas \
    --new_seqs test_data/tutorial/ex4.2/ex4.2_new_seqs.fas \
    --output test_data/tutorial/ex4.2/ex4.2_out.fas \
    --orphan_threshold 45 \
    --gaps2question 20 \
    --n2question Thoropa_miliaris_CFBH10125 \
    --write_names True \
    --log True

Warning: The input new_seqs cannot be longer than the input profile alignment.

When the input sequences in the parameter alignment of addSeq.py are not aligned and contain pound signs to delimit partitions, MAFFT and other aligners crash. In this case, use UP2AP.py to convert unaligned sequences containing pound signs (UP) into aligned sequences with pound signs (AP). Finally, use the output from UP2AP.py as input alignment of addSeq.py.

Cite

If you use prepDyn in your research, cite this repository.

References

Grant T et al. (2006) Phylogenetic systematics of dart-poison frogs and their relatives (Amphibia: Athesphatanura: Dendrobatidae). Bull Am Mus Nat Hist 2006(299):1-262.

Nakamura DYM et al. (2025) Museomics reduces taxonomic inflation in the Dendropsophus araguaya complex (Hylinae: Dendropsophini) from the Cerrado. Journal of Vertebrate Biology 74:24112.

Wheeler WC et al. (2015) POY version 5: phylogenetic analysis using dynamic homologies under multiple optimality criteria. Cladistics 31:189-196.

Wheeler WC et al. (2024) PhylogeneticGraph (PhyG) a new phylogenetic graph search and optimization program. Cladistics 40(1):97-105.

About

A collection of Python scripts to facilitate the preprocessing of input sequences to dynamic homology.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages