Skip to content

Commit b5cb2be

Browse files
authored
GPN-Star (#56)
Initial version of GPN-Star.
1 parent 52c0c6c commit b5cb2be

File tree

97 files changed

+34586
-21
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

97 files changed

+34586
-21
lines changed

README.md

Lines changed: 25 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,38 @@
11
# GPN (Genomic Pre-trained Network)
22
[![hgt_genome_392c4_a47ce0](https://github.com/user-attachments/assets/282b6204-156b-4b6d-83ff-2f4a53a9bb2e)](https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis)
33

4-
Code and resources from [GPN](https://doi.org/10.1073/pnas.2311219120) and related genomic language models.
4+
Code and resources for genomic language models [GPN](https://doi.org/10.1073/pnas.2311219120), [GPN-MSA](https://www.nature.com/articles/s41587-024-02511-w), [PhyloGPN](https://link.springer.com/chapter/10.1007/978-3-031-90252-9_7) and [GPN-Star](...).
55

66
## Table of contents
77
- [Installation](#installation)
8-
- [Quick start](#quick-start)
98
- [Modeling frameworks](#modeling-frameworks)
109
- [Applications of the models](#applications-of-the-models)
1110
- [GPN](#gpn)
1211
- [GPN-MSA](#gpn-msa)
1312
- [PhyloGPN](#phylogpn)
13+
- [GPN-Star](#gpn-star)
1414
- [Citation](#citation)
1515

1616
## Installation
1717
```bash
1818
pip install git+https://github.com/songlab-cal/gpn.git
1919
```
2020

21-
## Quick start
22-
```python
23-
import gpn.model
24-
from transformers import AutoModelForMaskedLM, AutoModel
25-
26-
model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-brassicales")
27-
# or
28-
model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-msa-sapiens")
29-
# or
30-
model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True)
31-
```
32-
3321
## Modeling frameworks
3422
| Model | Paper | Notes |
3523
| --------- | --- | ----------- |
3624
| GPN | [Benegas et al. 2023](https://doi.org/10.1073/pnas.2311219120) | Requires unaligned genomes |
37-
| GPN-MSA | [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | Requires aligned genomes for both training and inference |
25+
| GPN-MSA | [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | Requires aligned genomes for both training and inference [deprecated in favor of GPN-Star] |
3826
| PhyloGPN | [Albors et al. 2025](https://link.springer.com/chapter/10.1007/978-3-031-90252-9_7) | Uses an alignment during training, but does not require it for inference or fine-tuning |
27+
| GPN-Star | Upcoming | Requires aligned genomes for both training and inference |
3928

4029
## Applications of the models
4130
| Paper | Model | Dataset | Code | Resources on HuggingFace 🤗 |
4231
| -- | --- | ------- | ---- | -------------- |
43-
| [Benegas et al. 2023](https://doi.org/10.1073/pnas.2311219120) | GPN | Arabidopsis and other Brassicale plants | [analysis/gpn_arabidopsis](analysis/gpn_arabidopsis) | [Model, dataset, intermediate results](https://huggingface.co/collections/songlab/gpn-653191edcb0270ed05ad2c3e) |
44-
| [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | GPN-MSA | Human and other vertebrates | [analysis/gpn-msa_human](analysis/gpn-msa_human) | [Model, dataset, benchmarks, predictions](https://huggingface.co/collections/songlab/gpn-msa-65319280c93c85e11c803887) |
32+
| [Benegas et al. 2023](https://doi.org/10.1073/pnas.2311219120) | GPN | *A. Thaliana* and other Brassicale plants | [analysis/gpn_arabidopsis](analysis/gpn_arabidopsis) | [Model, dataset, intermediate results](https://huggingface.co/collections/songlab/gpn-653191edcb0270ed05ad2c3e) |
33+
| [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | GPN-MSA | Human | [analysis/gpn-msa_human](analysis/gpn-msa_human) | [Model, dataset, benchmarks, predictions](https://huggingface.co/collections/songlab/gpn-msa-65319280c93c85e11c803887) |
4534
| [Benegas et al. 2025b](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v1) | GPN | Animal promoters | [analysis/gpn_animal_promoter](analysis/gpn_animal_promoter) | [Model, dataset, benchmarks](https://huggingface.co/collections/songlab/traitgym-6796d4fbb825d5b94e65d30f) |
35+
| Upcoming | GPN-Star | Human, Mouse, Chicken, Fruit fly, *C. elegans*, *A. thaliana* | [analysis/gpn-star](analysis/gpn-star) | [Model, dataset, benchmarks](https://huggingface.co/collections/songlab/gpn-star-68c0c055acc2ee51d5c4f129) |
4636
| Upcoming | GPN | Sorghum gene expression | [analysis/gpn_sorghum_expression](analysis/gpn_sorghum_expression) | [Model, dataset](https://huggingface.co/collections/songlab/sorghum-gene-expression-prediction-68963dd31658bfb98c07ae1b) |
4737

4838
## GPN
@@ -106,6 +96,24 @@ torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}')
10696
## PhyloGPN
10797
PhyloGPN is a convolutional neural network that takes encoded DNA sequences as input and outputs rate matrix parameters for [Felsenstein's 1981 model](https://en.wikipedia.org/wiki/Models_of_DNA_evolution#F81_model_(Felsenstein_1981)) (the F81 model, for short). It was trained to maximize the likelihood of columns in the [Zoonomia alignment](https://cglgenomics.ucsc.edu/november-2023-nature-zoonomia-with-expanded-primates-alignment/) given a phylogenetic tree. The stationary distribution of the substitution process described by the F81 model indicates the relative viability of each allele at any given locus. As a result, PhyloGPN is formally a (single-sequence) genomic language model. It can be used for transfer learning and zero-shot SNV deleteriousness prediction. It is especially useful for sequences that are not directly in the human reference genome.
10898

99+
* Quick start
100+
```python
101+
import gpn.model
102+
from transformers import AutoModel
103+
104+
model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True)
105+
```
106+
107+
## GPN-Star
108+
*Under construction*
109+
### Examples
110+
* Play with the model: `examples/star/demo.ipynb`
111+
### Analyses
112+
* Main results on variant effect prediction: `analysis/gpn-star/train_and_eval/workflow/notebooks`
113+
* Complex trait heritability analysis (S-LDSC): `analysis/gpn-star/s-ldsc`
114+
115+
More coming soon!
116+
109117
## Citation
110118
[GPN](https://doi.org/10.1073/pnas.2311219120):
111119
```bibtex
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.snakemake
2+
results
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Interpretation analysis
2+
3+
```bash
4+
snakemake --cores all
5+
```
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
models:
2+
human:
3+
- V_small
4+
- V_medium
5+
- V_large
6+
- M_small
7+
- M_medium
8+
- M_large
9+
- P_small
10+
- P_medium
11+
- P_large
12+
human_M_P:
13+
- M_small
14+
- M_medium
15+
- M_large
16+
- P_small
17+
- P_medium
18+
- P_large
19+
20+
gpn_star:
21+
V_small:
22+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/hg38/fire_1/0.2/multiz100way/100/128/64/True/defined.phastCons100way.percentile-75_0.05_0.001/small/0.1/42/150000/True/True/0.1
23+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/hg38/multiz100way/100
24+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/hg38/multiz100way/100/phylo_dist/0.2
25+
window_size: 128
26+
27+
V_medium:
28+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/hg38/fire_1/0.2/multiz100way/100/128/64/True/defined.phastCons100way.percentile-75_0.05_0.001/medium/0.1/42/150000/True/True/0.1
29+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/hg38/multiz100way/100
30+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/hg38/multiz100way/100/phylo_dist/0.2
31+
window_size: 128
32+
33+
V_large:
34+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/hg38/fire_1/0.2/multiz100way/100/128/64/True/defined.phastCons100way.percentile-75_0.05_0.001/large/0.1/42/200000/True/True/0.1
35+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/hg38/multiz100way/100
36+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/hg38/multiz100way/100/phylo_dist/0.2
37+
window_size: 128
38+
39+
M_small:
40+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/hg38/fire_1/0.05/cactus447way/447/256/128/True/defined.phastCons470way.percentile-75_0.05_0.001/small/0.1/42/150000/True/True/0.1
41+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/hg38/cactus447way/447
42+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/hg38/cactus447way/447/phylo_dist/0.05
43+
window_size: 256
44+
45+
M_medium:
46+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/hg38/fire_1/0.05/cactus447way/447/256/128/True/defined.phastCons470way.percentile-75_0.05_0.001/medium/0.1/42/150000/True/True/0.1
47+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/hg38/cactus447way/447
48+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/hg38/cactus447way/447/phylo_dist/0.05
49+
window_size: 256
50+
51+
M_large:
52+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/hg38/fire_1/0.05/cactus447way/447/256/128/True/defined.phastCons470way.percentile-75_0.05_0.001/large/0.1/42/200000/True/True/0.1
53+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/hg38/cactus447way/447
54+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/hg38/cactus447way/447/phylo_dist/0.05
55+
window_size: 256
56+
57+
P_small:
58+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/hg38/fire_1/0.05/cactus447way/243/256/128/True/defined.phastCons43way.percentile-75_0.05_0.001/small/0.1/42/150000/True/True/0.2
59+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/hg38/cactus447way/243
60+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/hg38/cactus447way/243/phylo_dist/0.05
61+
window_size: 256
62+
63+
P_medium:
64+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/hg38/fire_1/0.05/cactus447way/243/256/128/True/defined.phastCons43way.percentile-75_0.05_0.001/medium/0.1/42/150000/True/True/0.2
65+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/hg38/cactus447way/243
66+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/hg38/cactus447way/243/phylo_dist/0.05
67+
window_size: 256
68+
69+
P_large:
70+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/hg38/fire_1/0.05/cactus447way/243/256/128/True/defined.phastCons43way.percentile-75_0.05_0.001/large/0.1/42/200000/True/True/0.2
71+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/hg38/cactus447way/243
72+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/hg38/cactus447way/243/phylo_dist/0.05
73+
window_size: 256
74+
75+
dm6:
76+
model_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/checkpoints/dm6/fire_1/0.2/multiz124way/124/128/64/True/defined.phastCons124way.percentile-75_0.05_0.001/medium/0.1/42/50000/True/True/0.1
77+
msa_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/msa/dm6/multiz124way/124
78+
phylo_info_path: /scratch/users/czye/GPN/egpn/analysis/human/egpn/workflow/results/phylo_info/dm6/multiz124way/124/phylo_dist/0.2
79+
window_size: 128
80+
81+
nuc_dep:
82+
TH:
83+
chrom: "11"
84+
start: 2171682
85+
end: 2171868
86+
strand: "-"
87+
LDLR:
88+
chrom: "19"
89+
start: 11089299
90+
end: 11089425
91+
strand: "+"
92+
HBA1:
93+
chrom: "16"
94+
start: 176699
95+
end: 176955
96+
strand: "+"
97+
GRIA4:
98+
chrom: "11"
99+
start: 105609444
100+
end: 105609472
101+
strand: "+"
102+
drosophila_MSE:
103+
chrom: "2R"
104+
start: 9977838
105+
end: 9977966
106+
strand: "+"
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
configfile: "config/config.yaml"
2+
3+
4+
include: "rules/common.smk"
5+
include: "rules/nuc_dep.smk"
6+
include: "rules/umap.smk"
7+
8+
9+
models_human = config["models"]["human"]
10+
models_human_M_P = config["models"]["human_M_P"]
11+
12+
13+
rule all:
14+
input:
15+
expand(
16+
"results/plots/umap/{intervals}/{model}/region.svg",
17+
intervals=[
18+
"v8_0_subsamplestratified_20000",
19+
],
20+
model=models_human,
21+
),
22+
expand("results/plots/nuc_dep/TH/{model}.svg", model=models_human_M_P),
23+
expand("results/plots/nuc_dep/LDLR/{model}.svg", model=models_human),
24+
expand("results/plots/nuc_dep/HBA1/{model}.svg", model=models_human_M_P),
25+
expand("results/plots/nuc_dep/GRIA4/{model}.svg", model=models_human),
26+
expand("results/plots/nuc_dep/drosophila_MSE/{model}.svg", model=["dm6"]),

0 commit comments

Comments
 (0)