|
1 | 1 | # GPN (Genomic Pre-trained Network) |
2 | 2 | [](https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis) |
3 | 3 |
|
4 | | -Code and resources from [GPN](https://doi.org/10.1073/pnas.2311219120) and related genomic language models. |
| 4 | +Code and resources for genomic language models [GPN](https://doi.org/10.1073/pnas.2311219120), [GPN-MSA](https://www.nature.com/articles/s41587-024-02511-w), [PhyloGPN](https://link.springer.com/chapter/10.1007/978-3-031-90252-9_7) and [GPN-Star](...). |
5 | 5 |
|
6 | 6 | ## Table of contents |
7 | 7 | - [Installation](#installation) |
8 | | -- [Quick start](#quick-start) |
9 | 8 | - [Modeling frameworks](#modeling-frameworks) |
10 | 9 | - [Applications of the models](#applications-of-the-models) |
11 | 10 | - [GPN](#gpn) |
12 | 11 | - [GPN-MSA](#gpn-msa) |
13 | 12 | - [PhyloGPN](#phylogpn) |
| 13 | +- [GPN-Star](#gpn-star) |
14 | 14 | - [Citation](#citation) |
15 | 15 |
|
16 | 16 | ## Installation |
17 | 17 | ```bash |
18 | 18 | pip install git+https://github.com/songlab-cal/gpn.git |
19 | 19 | ``` |
20 | 20 |
|
21 | | -## Quick start |
22 | | -```python |
23 | | -import gpn.model |
24 | | -from transformers import AutoModelForMaskedLM, AutoModel |
25 | | - |
26 | | -model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-brassicales") |
27 | | -# or |
28 | | -model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-msa-sapiens") |
29 | | -# or |
30 | | -model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True) |
31 | | -``` |
32 | | - |
33 | 21 | ## Modeling frameworks |
34 | 22 | | Model | Paper | Notes | |
35 | 23 | | --------- | --- | ----------- | |
36 | 24 | | GPN | [Benegas et al. 2023](https://doi.org/10.1073/pnas.2311219120) | Requires unaligned genomes | |
37 | | -| GPN-MSA | [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | Requires aligned genomes for both training and inference | |
| 25 | +| GPN-MSA | [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | Requires aligned genomes for both training and inference [deprecated in favor of GPN-Star] | |
38 | 26 | | PhyloGPN | [Albors et al. 2025](https://link.springer.com/chapter/10.1007/978-3-031-90252-9_7) | Uses an alignment during training, but does not require it for inference or fine-tuning | |
| 27 | +| GPN-Star | Upcoming | Requires aligned genomes for both training and inference | |
39 | 28 |
|
40 | 29 | ## Applications of the models |
41 | 30 | | Paper | Model | Dataset | Code | Resources on HuggingFace 🤗 | |
42 | 31 | | -- | --- | ------- | ---- | -------------- | |
43 | | -| [Benegas et al. 2023](https://doi.org/10.1073/pnas.2311219120) | GPN | Arabidopsis and other Brassicale plants | [analysis/gpn_arabidopsis](analysis/gpn_arabidopsis) | [Model, dataset, intermediate results](https://huggingface.co/collections/songlab/gpn-653191edcb0270ed05ad2c3e) | |
44 | | -| [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | GPN-MSA | Human and other vertebrates | [analysis/gpn-msa_human](analysis/gpn-msa_human) | [Model, dataset, benchmarks, predictions](https://huggingface.co/collections/songlab/gpn-msa-65319280c93c85e11c803887) | |
| 32 | +| [Benegas et al. 2023](https://doi.org/10.1073/pnas.2311219120) | GPN | *A. Thaliana* and other Brassicale plants | [analysis/gpn_arabidopsis](analysis/gpn_arabidopsis) | [Model, dataset, intermediate results](https://huggingface.co/collections/songlab/gpn-653191edcb0270ed05ad2c3e) | |
| 33 | +| [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | GPN-MSA | Human | [analysis/gpn-msa_human](analysis/gpn-msa_human) | [Model, dataset, benchmarks, predictions](https://huggingface.co/collections/songlab/gpn-msa-65319280c93c85e11c803887) | |
45 | 34 | | [Benegas et al. 2025b](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v1) | GPN | Animal promoters | [analysis/gpn_animal_promoter](analysis/gpn_animal_promoter) | [Model, dataset, benchmarks](https://huggingface.co/collections/songlab/traitgym-6796d4fbb825d5b94e65d30f) | |
| 35 | +| Upcoming | GPN-Star | Human, Mouse, Chicken, Fruit fly, *C. elegans*, *A. thaliana* | [analysis/gpn-star](analysis/gpn-star) | [Model, dataset, benchmarks](https://huggingface.co/collections/songlab/gpn-star-68c0c055acc2ee51d5c4f129) | |
46 | 36 | | Upcoming | GPN | Sorghum gene expression | [analysis/gpn_sorghum_expression](analysis/gpn_sorghum_expression) | [Model, dataset](https://huggingface.co/collections/songlab/sorghum-gene-expression-prediction-68963dd31658bfb98c07ae1b) | |
47 | 37 |
|
48 | 38 | ## GPN |
@@ -106,6 +96,24 @@ torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') |
106 | 96 | ## PhyloGPN |
107 | 97 | PhyloGPN is a convolutional neural network that takes encoded DNA sequences as input and outputs rate matrix parameters for [Felsenstein's 1981 model](https://en.wikipedia.org/wiki/Models_of_DNA_evolution#F81_model_(Felsenstein_1981)) (the F81 model, for short). It was trained to maximize the likelihood of columns in the [Zoonomia alignment](https://cglgenomics.ucsc.edu/november-2023-nature-zoonomia-with-expanded-primates-alignment/) given a phylogenetic tree. The stationary distribution of the substitution process described by the F81 model indicates the relative viability of each allele at any given locus. As a result, PhyloGPN is formally a (single-sequence) genomic language model. It can be used for transfer learning and zero-shot SNV deleteriousness prediction. It is especially useful for sequences that are not directly in the human reference genome. |
108 | 98 |
|
| 99 | +* Quick start |
| 100 | +```python |
| 101 | +import gpn.model |
| 102 | +from transformers import AutoModel |
| 103 | + |
| 104 | +model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True) |
| 105 | +``` |
| 106 | + |
| 107 | +## GPN-Star |
| 108 | +*Under construction* |
| 109 | +### Examples |
| 110 | +* Play with the model: `examples/star/demo.ipynb` |
| 111 | +### Analyses |
| 112 | +* Main results on variant effect prediction: `analysis/gpn-star/train_and_eval/workflow/notebooks` |
| 113 | +* Complex trait heritability analysis (S-LDSC): `analysis/gpn-star/s-ldsc` |
| 114 | + |
| 115 | +More coming soon! |
| 116 | + |
109 | 117 | ## Citation |
110 | 118 | [GPN](https://doi.org/10.1073/pnas.2311219120): |
111 | 119 | ```bibtex |
|
0 commit comments