- What is Phyla?
- Disclaimer
- What is in this repo?
- What is the difference between Phyla-α and Phyla-β?
- Getting started with Phyla
- System Requirements and Scalability
- Tree Reasoning Benchmark
- Evaluation Instructions
- Training Instructions
- Citation
Phyla is a protein language model designed to model both intra-sequence and inter-sequence relationships. It leverages a hybrid state-space transformer architecture and is trained with a tree-based loss function. Phyla enables rapid construction of phylogenetic trees using protein sequences, offering insights that differ from classical methods in potentially functionally significant ways.
We are excited to introduce Phyla-β, an early-stage version of our model that is still under active development. Future iterations will incorporate methodological improvements and additional training data as we continue refining the model. Please note that this work is ongoing, and updates will be released as progress is made.
This repo provides a way to perform inference with the Phyla-α/Phyla-β model for your application. After performing the steps you will be able to give Phyla a fasta file and quickly get a phylogenetic tree. We are working on providing training code as well.
| Shorthand | Name in code | Dataset | Description |
|---|---|---|---|
| Phyla-α | phyla-alpha |
13,696 trees from OpenProteinSet | Alpha release of Phyla meant as a proof of concept of ongoing work. |
| Phyla-β | phyla-beta |
3,321 high-quality trees from OpenProteinSet | Beta release of Phyla improving on Phyla-α by training on better data and improved tree loss. |
After releasing Phyla-α we revised our tree loss and retrained our model on a cleaned version of OpenProteinSet, using the methodologies for MSA cleaning introduced by EVE. We also found surprisingly that masked langugae modeling decreased performance, so this was removed in training. In benchmarking (see Evolutionary Reasoning Benchmark section) Phyla-β is better than Phyla-α and should be used for all applications. It is also more lightweight than Phyla-α which allows for longer inputs.
First you need to create an enviornment for mamba, following the instructions from their Github including the causal-conv1d package. I found installing this on a gpu helps get around some problems when installing. Once you can run this import without errors:
from mamba_ssm import Mambathen build the rest of the enviornment from yaml file provided in the envs folder in the phyla folder.
Run
pip install -e .from within this directory to install the Phyla package to your enviornment.
Run "run_phyla_test.py" and if you get a tree printed out then everything is set up correctly!
Once that is done just replace the fasta file in the run_phyla_test script to the fasta file with the protein sequences that you want to align and it will generate a tree.
This script has been tested on an H100 Nvidia GPU and is expected to work on a 32 GB V100 as well. Greater GPU memory capacity allows for generating trees for a larger number of sequences. Reconstructing the tree of life with 3,084 sequences required running Phyla on CPUs with approximately 1 TB of memory. For those interested in running Phyla on a CPU to handle more sequences, raising an issue will help prioritize the addition of that functionality.
The Tree Reasoning Benchmark consists of two tasks across three datasets. It evaluates a model's ability in:
-
Phylogenetic Tree Reconstruction
- Measured by normalized Robinson-Foulds distance (norm-RF).
-
Taxonomic Clustering
- Measured by cluster completeness and Normalized Mutual Information (NMI).
Benchmarking Approach:
- Compute the pairwise distance matrix from protein embeddings.
- Use the Neighbor Joining algorithm to construct a phylogenetic tree.
- Compare against the ground truth using norm-RF.
Datasets Used:
TreeFamfound in a pickle file here: https://dataverse.harvard.edu/api/access/datafile/11564365TreeBASEfound in a zip file here: https://dataverse.harvard.edu/api/access/datafile/11564367
- After downloading and unzipping, you'll find two directories:
sequences/andtrees/. - Filenames are aligned: for example, the tree for
TB2/S137_processed.fainsequences/isTB2/S137_processed_tree.nhintrees/.
- After unzipping, you receive a pickle file.
- Each key corresponds to a sequence/tree name.
- Each entry contains:
sequences: the protein sequencestree_newick: the Newick-formatted tree
⚠️ Note: Trees from TreeFam require formatting before use. A formatting script is provided in the evaluation code. If there is enough interest in the benchmark, preprocessed trees can be generated to avoid this step.
Benchmarking Approach:
- Perform k-means clustering on protein embeddings.
- Evaluate using:
- Cluster Completeness
- Normalized Mutual Information (NMI)
Dataset Used:
GTDB(Genome Taxonomy Database) found in a tar-file here: https://dataverse.harvard.edu/api/access/datafile/11564368
-
After extracting the tar file, you'll find a series of
.tsvand.picklefiles with names like:sampled_bac120_taxonomy_class_0.tsvsampled_bac120_taxonomy_class_sequences_0.pickle
-
The structure of these filenames follows the format:
sampled_bac120_taxonomy_[level]_[replicate].tsvsampled_bac120_taxonomy_[level]_sequences_[replicate].pickle
Where:
[level]refers to the taxonomic rank (e.g., class, order, family).[replicate]is a numeric index indicating a random sampling replicate.
-
Each
.tsvfile contains:- Sequence names
- Taxonomic labels at the specified level
-
The corresponding
.picklefile contains the actual sequences for those entries.
Each replicate includes random groupings of 50 distinct labels, with 10 sequences per label. Use the taxonomic column in the
.tsvand the sequence names to extract the clustering labels.
This task evaluates how well a model can predict functional impacts of protein variants using data from the ProteinGym DMS Substitution Benchmark. We use 83 datasets selected to fit within the memory limits of a single H100 GPU, with performance measured by Spearman correlation.
Benchmarking Approach:
- For all baseline protein language models, a linear probe is trained on the model embeddings to predict variant effects.
- For Phyla, the process involves:
- Constructing a phylogenetic tree from the protein sequences.
- Injecting known functional labels into the corresponding tree leaves.
- Using TreeCluster to cluster the tree into clades.
- Assigning predicted labels to unlabeled leaves by averaging the known labels in their clade.
This tree-based propagation strategy yields the best Spearman correlation for Phyla in functional prediction.
We use the TreeCluster toolkit to perform tree clustering.
To evaluate tree reconstruction, taxonomic clustering, or functional prediction, run:
python -m phyla.eval.evo_reasoning_eval configs/sample_eval_config.yamlOpen and modify configs/sample_eval_config.yaml as needed:
Choose the model to run:
Phyla-beta(default Phyla model)ESM2(ESM2 650M)ESM2_3B(ESM2 3B)ESM3EVOPROGEN2_LARGEPROGEN2_XLARGE
Set to None to download and use default published weights or set to a specific path to use a trained checkpoint.
Set one of the following datasets:
treebase– For tree reconstruction (TreeBASE)treefam– For tree reconstruction (TreeFam)GTB– For taxonomic clustering (GTDB)protein_gym– For functional prediction
Required files will be downloaded automatically.
Set the GPU device to use (e.g., "cuda:0", "cuda:5").
Set this to true to evaluate a randomly initialized model (default is false).
By default the output of an eval run will save in eval/eval_preds/{dataset.dataset}/{dataset.dataset}_results_{trainer.model_type}_{eval.extra_name}.csv, by adding an extra name you can add in extra information about the benchmarking run.
We have built and stress-tested a full training pipeline for Phyla to ensure reproducibility. If your environment is set up correctly, running run.sh is all you need to launch training. Run it from within the phyla directory. All training configuration lives in configs/sample_train_config.yaml:
Below is an explanation of the key parameters you may want to modify when retraining Phyla.
-
trainer.lr
Learning rate. The default value was selected via grid search and is recommended unless you are experimenting. -
trainer.record
Set toTrueto log training runs to Weights & Biases. -
trainer.save_path
Path where model checkpoints will be saved.
-
dataset.dataset_directories
A list of directories containing the cleaned OpenProteinSet trees.
To train Phyla-β, download the cleaned dataset (~396 MB) from the Harvard Dataverse:
https://dataverse.harvard.edu/api/access/datafile/13167774 -
dataset.dataset_size
Number of trees to use for training.
Set toNoneto train on the full dataset.
-
model.d_model
Dimensionality of the model’s hidden representations. -
model.n_layers
Number of Bi-Mamba layers inside each Phyla block. -
model.num_blocks
Number of Phyla blocks.
Each block consists ofn_layersBi-Mamba layers followed by a sparsified attention layer.
If you find the Phyla paper or codebase useful, please cite our work!
@inproceedings{phyla,
title={Evolutionary Reasoning Does Not Arise in Standard Usage of Protein Language Models},
author={Yasha Ektefaie and Andrew Shen and Lavik Jain and Maha Farhat and Marinka Zitnik},
booktitle={NeurIPS},
year={2025}
}
