Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 19 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ There are two games to benchmark: supervised and zero-shot. Each game has its se
- Supervised game is defined in this [dvc.yaml](supervised/dvc.yaml)
- Zero-shot game is defined in this [dvc.yaml](zero_shot/dvc.yaml)

The models and datasets are defined in `vars` at the top, and DVC translates `vars` into a matrix, which is namely a loop defined as the below pseudo-code:
The models and datasets are defined in `vars` at the top, and DVC translates `vars` into a matrix, which is namely a loop defined as the following pseudo-code:

```python
for dataset in datasets:
Expand All @@ -29,18 +29,34 @@ for dataset in datasets:

### Supervised

You can benchmark for a group of supervised models:
You can benchmark a group of supervised models:
```shell
cd supervised && dvc repro
```

### Zero-shot

You can benchmark for a group of zero-shot models:
You can benchmark a group of zero-shot models:
```shell
cd zero_shot && dvc repro
```

## AWS

There are two environments in which to run benchmark: one is the local environment, the other is the AWS environment.

The difference of the AWS environment is that:
* You need to upload the data and model TOML files and the actual data to S3.
* You need to build and push your Docker image to ECR.
* You need to use SageMaker training job to either train or score a model.

> [!IMPORTANT]
> In order to use the AWS environment, you need to set up your AWS profile with the below steps:
> 1. Execute `aws configure sso`.
> 2. Fill in the required fields, especially: "Default client Region" is "us-east-1".
> 3. You can find your account ID and profile by executing `cat ~/.aws/config`.
> 4. Finally, you can run `dvc repro` with environment variables in each game: `AWS_ACCOUNT_ID=xxx AWS_PROFILE=yyy dvc repro`

## Generate dummy data

You can generate dummy data by the following command:
Expand Down
496 changes: 0 additions & 496 deletions datasets/dummy/charge_ladder.csv

This file was deleted.

11 changes: 0 additions & 11 deletions datasets/dummy/charge_ladder.toml

This file was deleted.

Binary file added datasets/dummy/dataset.zip
Binary file not shown.
923 changes: 0 additions & 923 deletions datasets/neime/A0A1I9GEU1_NEIME_Kennouche_2019.csv

This file was deleted.

Binary file added datasets/neime/dataset.zip
Binary file not shown.
9 changes: 0 additions & 9 deletions datasets/neime/neime.toml

This file was deleted.

4,997 changes: 0 additions & 4,997 deletions datasets/ranganathan/BLAT_ECOLX_Ranganathan2015.csv

This file was deleted.

Binary file added datasets/ranganathan/dataset.zip
Binary file not shown.
10 changes: 0 additions & 10 deletions datasets/ranganathan/ranganathan.toml

This file was deleted.

6 changes: 3 additions & 3 deletions models/esm/esm.toml → models/esm/manifest.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[hyper_params]
Comment thread
tintinrevient marked this conversation as resolved.
name = "esm"
offset_idx = 24
location = "esm2_t30_150M_UR50D"
scoring_strategy = "wt-marginals"

[hyper_params]
offset_idx = 24
nogpu = false
57 changes: 34 additions & 23 deletions models/esm/src/pg2_model_esm/__main__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
import torch
import typer
from pathlib import Path
from rich.console import Console
from pg2_dataset.dataset import Manifest
from pg2_dataset.dataset import Dataset
from tqdm import tqdm
from esm import pretrained
from pg2_model_esm.utils import compute_pppl, label_row
from pg2_model_esm.manifest import Manifest as ModelManifest
from pg2_model_esm.manifest import Manifest


app = typer.Typer(
Expand All @@ -16,18 +17,31 @@
err_console = Console(stderr=True)
console = Console()

prefix = Path("/opt/ml")
training_data_path = prefix / "input" / "data" / "training" / "dataset.zip"
manifest_path = prefix / "input" / "data" / "manifest" / "manifest.toml"
params_path = prefix / "input" / "config" / "hyperparameters.json"
output_path = prefix / "model"

model_path = Path("/model.pkl")


@app.command()
def train(
dataset_toml_file: str = typer.Option(help="Path to the dataset TOML file"),
model_toml_file: str = typer.Option(help="Path to the model TOML file"),
nogpu: bool = typer.Option(False, help="GPUs available"),
dataset_zip_file: str = typer.Option(
Comment thread
tintinrevient marked this conversation as resolved.
default="", help="Path to the dataset ZIP file"
Copy link
Copy Markdown
Contributor

@JCZuurmond JCZuurmond Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option is required, right? Also, could you update the syntax to the annotated version where option is at the left side of the equals? And update both types to Path

Suggested change
default="", help="Path to the dataset ZIP file"
help="Path to the dataset ZIP file"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option is not required. because in AWS, there are no paths passed from a user input to use a local file path.

),
model_toml_file: str = typer.Option(default="", help="Path to the model TOML file"),
):
console.print(f"Loading {dataset_toml_file} and {model_toml_file}...")
console.print(f"Loading {dataset_zip_file} and {model_toml_file}...")

manifest = Manifest.from_path(dataset_toml_file)
dataset_name = manifest.name
dataset = manifest.ingest()
dataset_zip_file = dataset_zip_file or training_data_path
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce this or?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in AWS environment, there is no dataset_file passed by a user, and SageMaker training job automatically mounted the S3 path in the fixed location inside the container.

dataset = Dataset.from_path(dataset_zip_file)
dataset_name = dataset.name
Comment thread
tintinrevient marked this conversation as resolved.

model_toml_file = model_toml_file or manifest_path
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question about the or statement

hyper_params = Manifest.from_path(model_toml_file).hyper_params
Comment thread
tintinrevient marked this conversation as resolved.
model_name = hyper_params["name"]
Comment thread
tintinrevient marked this conversation as resolved.

assays = dataset.assays.meta.assays
targets = list(dataset.assays.meta.assays.keys())
Expand All @@ -39,21 +53,14 @@ def train(

console.print(f"Loaded {len(df)} records.")

model_manifest = ModelManifest.from_path(model_toml_file)

model_name = model_manifest.name
location = model_manifest.location
scoring_strategy = model_manifest.scoring_strategy
hyper_params = model_manifest.hyper_params

model, alphabet = pretrained.load_model_and_alphabet(location)
model, alphabet = pretrained.load_model_and_alphabet(hyper_params["location"])
model.eval()

console.print(
f"Loaded the model from {location} with scoring strategy {scoring_strategy}."
f"Loaded the model from {hyper_params['location']} with scoring strategy {hyper_params['scoring_strategy']}."
)

if torch.cuda.is_available() and not nogpu:
if torch.cuda.is_available() and not hyper_params["nogpu"]:
model = model.cuda()
print("Transferred model to GPU")

Expand All @@ -65,7 +72,7 @@ def train(

batch_labels, batch_strs, batch_tokens = batch_converter(data)

match scoring_strategy:
match hyper_params["scoring_strategy"]:
case "wt-marginals":
with torch.no_grad():
token_probs = torch.log_softmax(model(batch_tokens)["logits"], dim=-1)
Expand Down Expand Up @@ -123,12 +130,16 @@ def train(
)

case _:
err_console.print(f"Error: Invalid scoring strategy: {scoring_strategy}")
err_console.print(
f"Error: Invalid scoring strategy: {hyper_params['scoring_strategy']}"
)

df.rename(columns={targets[0]: "test"}, inplace=True)
df.to_csv(f"/output/{dataset_name}_{model_name}.csv", index=False)
df.to_csv(f"{output_path}/{dataset_name}_{model_name}.csv", index=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move path to variable


console.print(f"Saved the metrics in CSV in output/{dataset_name}_{model_name}.csv")
console.print(
f"Saved the metrics in CSV in {output_path}/{dataset_name}_{model_name}.csv"
)
console.print("Done.")


Expand Down
8 changes: 3 additions & 5 deletions models/esm/src/pg2_model_esm/manifest.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
from pydantic import BaseModel, Field
from pydantic import BaseModel, Field, ConfigDict
from pathlib import Path
from typing import Self, Any
import toml


class Manifest(BaseModel):
name: str = ""
Comment thread
tintinrevient marked this conversation as resolved.
hyper_params: dict[str, Any] = Field(default_factory=dict)
model_config = ConfigDict(extra="allow")

location: str = ""
scoring_strategy: str = ""
hyper_params: dict[str, Any] = Field(default_factory=dict)

@classmethod
def from_path(cls, toml_file: Path) -> Self:
Expand Down
3 changes: 1 addition & 2 deletions models/pls/pls.toml → models/pls/manifest.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
name = "pls"

[hyper_params]
name = "pls"
n_components = 2
aa_alphabet = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]
aa_alphabet_length = 20
45 changes: 30 additions & 15 deletions models/pls/src/pg2_model_pls/__main__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import polars as pl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments as for the other script

from pathlib import Path
from rich.console import Console
from pg2_dataset.dataset import Manifest
from pg2_dataset.dataset import Dataset
from pg2_dataset.splits.abstract_split_strategy import TrainTestValid
from pg2_model_pls.manifest import Manifest as ModelManifest
from pg2_model_pls.manifest import Manifest
from pg2_model_pls.utils import load_x_and_y, train_model, predict_model

import typer

app = typer.Typer(
Expand All @@ -14,21 +14,34 @@

console = Console()

prefix = Path("/opt/ml")
training_data_path = prefix / "input" / "data" / "training" / "dataset.zip"
manifest_path = prefix / "input" / "data" / "manifest" / "manifest.toml"
params_path = prefix / "input" / "config" / "hyperparameters.json"
output_path = prefix / "model"

model_path = Path("/model.pkl")


@app.command()
def train(
dataset_toml_file: str = typer.Option(help="Path to the dataset TOML file"),
model_toml_file: str = typer.Option(help="Path to the model TOML file"),
dataset_zip_file: str = typer.Option(
default="", help="Path to the dataset ZIP file"
),
model_toml_file: str = typer.Option(default="", help="Path to the model TOML file"),
):
console.print(f"Loading {dataset_toml_file} and {model_toml_file}...")
console.print(f"Loading {dataset_zip_file} and {model_toml_file}...")

dataset_name = Manifest.from_path(dataset_toml_file).name
dataset_zip_file = dataset_zip_file or training_data_path
dataset = Dataset.from_path(dataset_zip_file)
dataset_name = dataset.name

model_path = "/model.pkl"
model_name = ModelManifest.from_path(model_toml_file).name
model_toml_file = model_toml_file or manifest_path
hyper_params = Manifest.from_path(model_toml_file).hyper_params
model_name = hyper_params["name"]

train_X, train_Y = load_x_and_y(
dataset_toml_file=dataset_toml_file,
dataset=dataset,
split=TrainTestValid.train,
)

Expand All @@ -39,14 +52,14 @@ def train(
train_model(
train_X=train_X,
train_Y=train_Y,
model_toml_file=model_toml_file,
model_path=model_path,
hyper_params=hyper_params,
)

console.print("Finished the training...")

valid_X, valid_Y = load_x_and_y(
dataset_toml_file=dataset_toml_file,
dataset=dataset,
split=TrainTestValid.valid,
)

Expand All @@ -56,8 +69,8 @@ def train(

pred_y = predict_model(
test_X=valid_X,
model_toml_file=model_toml_file,
model_path=model_path,
hyper_params=hyper_params,
)

console.print("Finished the scoring...")
Expand All @@ -70,8 +83,10 @@ def train(
}
)

df.write_csv(f"/output/{dataset_name}_{model_name}.csv")
console.print(f"Saved the metrics in CSV in output/{dataset_name}_{model_name}.csv")
df.write_csv(f"{output_path}/{dataset_name}_{model_name}.csv")
console.print(
f"Saved the metrics in CSV in {output_path}/{dataset_name}_{model_name}.csv"
)

console.print("Done.")

Expand Down
5 changes: 3 additions & 2 deletions models/pls/src/pg2_model_pls/manifest.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
from pydantic import BaseModel, Field
from pydantic import BaseModel, Field, ConfigDict
from pathlib import Path
from typing import Self, Any
import toml


class Manifest(BaseModel):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manifest should probably go into the pg2-benchmark package

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put them into the pg2-benchmark! It is a good point, as in the future, we will update it with model cards, so it is sensible to put it in pg2-benchmark, 🤔

name: str = ""
model_config = ConfigDict(extra="allow")

hyper_params: dict[str, Any] = Field(default_factory=dict)

@classmethod
Expand Down
Loading