This project compares a structural guide tree generated by FoldMason against an amino acid maximum-likelihood tree inferred with IQ-TREE. The objective is to evaluate whether protein structural similarity recapitulates sequence-based phylogeny.
The workflow produces:
- A publication-quality tanglegram (PNG)
- Quantitative tree congruence metrics (CSV)
- Fully reproducible outputs parameterized by a single prefix variable
.
├── data/
│ ├── lysB_structure.nw # FoldMason structural guide tree
│ └── lysB_AA.contree # IQ-TREE AA consensus tree
├── results/ # Auto-created; outputs written here
├── scripts/
│ └── tree_comparison.R
└── README.md
install.packages(c("ape", "phangorn", "dendextend", "viridisLite"))Open scripts/tree_comparison.R and set the parameters at the top of the file:
prefix <- "lysB" # shared filename prefix for inputs and outputs
data_dir <- "data" # directory containing input trees
out_dir <- "results" # directory for outputs (created automatically)
h_rel <- 0.25 # relative height for cluster cutoff (0–1)
random_seed <- 1 # for reproducible untanglingThen source the script:
source("scripts/tree_comparison.R")| File | Description |
|---|---|
results/<prefix>_AA_vs_Struct_tanglegram.png |
High-resolution tanglegram (300 dpi) |
results/<prefix>_AA_vs_Struct_tree_metrics.csv |
Normalized RF distance and entanglement score |
Trees are read with ape::read.tree(). Only taxa present in both trees are retained (shared tip intersection). Both trees are midpoint-rooted, polytomies are resolved only if necessary (multi2di()), and trees are ladderized for consistent visual ordering.
This ensures topological comparability without altering underlying evolutionary relationships.
Maximum-likelihood trees are not ultrametric, so a molecular clock cannot be assumed. To work around this:
- Cophenetic (patristic) distance matrices are computed from each tree
- Hierarchical clustering (
hclust(..., method = "average")) is applied - Trees are converted to
dendrogramobjects for use withdendextend
This preserves relative topology while enabling tanglegram comparison.
Clusters are derived exclusively from the structural (left) dendrogram using a relative height cutoff:
cut height = h_rel × max(dendrogram height)
These structural clusters determine:
- Tip label colors (applied to both trees)
- Tanglegram connecting line colors
Branch edges are rendered in neutral gray (gray20) to keep visual focus on cross-tree correspondence. Colors are drawn from the viridis palette for colorblind accessibility.
The dendrograms are untangled with dendextend::untangle(..., method = "step2side") to minimize line crossings, then rendered to a 12 × 9 inch PNG at 300 dpi.
| Metric | Method | Interpretation |
|---|---|---|
| Normalized Robinson–Foulds | phangorn::RF.dist(unroot(tL), unroot(tR), normalize = TRUE) |
0 = identical topology · 1 = maximally different |
| Entanglement | dendextend::entanglement() |
0 = perfect visual alignment · 1 = maximal crossing |
This workflow addresses a core evolutionary question:
Do structural similarity relationships mirror amino acid phylogeny?
| Result | Interpretation |
|---|---|
| Low RF + low entanglement | Strong congruence between structural and sequence evolution |
| High RF | Potential structural convergence, evolutionary decoupling, domain rearrangement, or alignment artifacts |
| Low RF + high entanglement | Topologically similar trees with differing tip ordering — consider re-running with a different random_seed |
| Package | Purpose |
|---|---|
ape |
Tree I/O, manipulation, midpoint rooting |
phangorn |
Robinson–Foulds distance |
dendextend |
Tanglegram construction and entanglement scoring |
viridisLite |
Colorblind-safe cluster palette |