ProteinGym
diff --git a/‎models/README.md‎
Lines changed: 72 additions & 84 deletions b/‎models/README.md‎
Lines changed: 72 additions & 84 deletions
diff --git a/‎models/esm/src/pg2_model_esm/__main__.py‎
Lines changed: 3 additions & 9 deletions b/‎models/esm/src/pg2_model_esm/__main__.py‎
Lines changed: 3 additions & 9 deletions
diff --git a/‎models/esm/src/pg2_model_esm/predict.py‎ ‎models/esm/src/pg2_model_esm/model.py‎models/esm/src/pg2_model_esm/predict.py renamed to models/esm/src/pg2_model_esm/model.py
Lines changed: 35 additions & 4 deletions b/‎models/esm/src/pg2_model_esm/predict.py‎ ‎models/esm/src/pg2_model_esm/model.py‎models/esm/src/pg2_model_esm/predict.py renamed to models/esm/src/pg2_model_esm/model.py
Lines changed: 35 additions & 4 deletions
diff --git a/‎models/esm/src/pg2_model_esm/preprocess.py‎
Lines changed: 9 additions & 0 deletions b/‎models/esm/src/pg2_model_esm/preprocess.py‎
Lines changed: 9 additions & 0 deletions
@@ -4,31 +4,40 @@ This README details how to add a model to the benchmark.
 
 ## Entrypoints
 
-A model requires the following entrypoints: `train` and `predict`:
+A model requires only one entrypoint: the `train` method, which you can referecen from below two models:
 
-The `train` entrypoint is only required for **supervised** models.
-The `predict` entrypoint is required for all models.
+* [esm/src/pg2_model_esm/__main__.py](esm/src/pg2_model_esm/__main__.py)
+* [pls/src/pg2_model_pls/__main__.py](pls/src/pg2_model_pls/__main__.py)
 
-Both entrypoints expect a reference to a dataset: `dataset_reference`.
-Additionally, the `train` entrypoint expects a reference to the model card
-and the `predict` entrypoint expects a reference to the peristed model:
-`model_card_reference` and `model_reference`, respectively.
+Both **supervised** models and **zero-shot** models call this `train` method, because it is the glue method to glue the packages: `pg2-dataset`, `pg2-benchmark` and the models' original source code together. The method is named `train`, because for SageMaker, it looks for the `train` method as a entrypoint, thus it becomes the common method for both environments: local and AWS.
 
-Finally, the `train` entrypoint outputs the model reference, which is the input
-for the `predict` entrypoint next to the dataset. The `predict` entrypoints
-outputs the inferred predictions:
+This entrypoint expects a reference to a dataset, e.g., loaded by `pg2-dataset`: 
 
-From the commandline these entrypoints interact as follows:
+```python
+from pg2_dataset.dataset import Dataset
+dataset = Dataset.from_path(dataset_file)
+```
+
+Additionally, this entrypoint also expects a reference to a model card, e.g., loaded by `pg2-benchmark`:
 
-``` bash
-$ train ./path/to/dataset_train.pgdata ./path/to/model_card.md
-./path/to/model.pickle
-$ predict ./path/to/dataset_validate.pgdata ./path/to/model.pickle
-[0.8, 0.5, ..., .04]
+```
+from pg2_benchmark.manifest import Manifest
+manifest = Manifest.from_path(model_toml_file)
 ```
 
-For reference, below an example Python implementation with `typer`:
+Finally, inside this `train` method:
 
+* For a **supervised** model, like [esm](esm/), it calls `load_model` and `predict_model` in order:
+    * `load_model` uses `manifest` as input, and returns a model object as output.
+    * `predict_model` uses `dataset`, `manifest` and the model object as input, and returns the inferred predictions in a data frame as output.
+
+* For a **zero-shot** model, like [pls](pls/), it calls `train_model` and `predict_model` in order:
+    * `train_model` uses `dataset` and `manifest` as input, and returns a model object as output.
+    * `predict_model` uses `dataset`, `manifest` and the model object as input, and returns the inferred predictions in a data frame as output.
+
+The result data frame is saved on the disk in the local environment and stored in AWS S3 in the cloud environment. After the container is destroyed, the result data frame is persisted for the later metric calculation.
+
+For reference, below an example Python implementation with `typer`:
 
 ``` python
 # In `__main__.py`
@@ -58,52 +67,19 @@ def train(
         ),
     ],
 ) -> Path:
-    """Train the model on the dataset.
-    
-    Args:
-        dataset_reference (Path) : Path to the archived dataset.
-        model_reference (Path) : Path to the model card file.
-
-    Returns:
-        Path : The trained and persisted model.
-    """
+
     dataset = Dataset.from_path(dataset_path)
     manifest = Manifest.from_path(model_card_path)
 
-    # Train the model below
-    model_reference = ...
-    return model_reference
-
-
-def predict(
-    dataset_reference: Annotated[
-        Path,
-        typer.Option(
-            help="Path to the archived dataset",
-        ),
-    ],
-    model_reference: Annotated[
-        Path,
-        typer.Option(
-            help="Path to the model file",
-        ),
-    ],
-) -> Iterable[float]:
-    """Predict (aka infer) given the dataset and the model. 
-    
-    Args:
-        dataset_reference (Path) : Path to the archived dataset.
-        model_reference (Path) : Path to the persisted and trained model file.
-    
-    Returns:
-        Iterable[float] : The predictions.
-    """
-    dataset = Dataset.from_path(dataset_path)
-    model = pickle.load(model_reference)
+    # For a supervised model
+    model = load(manifest)
+    df = predict(dataset, manifest, model)
+    df.to_csv(...)
 
-    # Predict the model below 
-    predictions = ...
-    return predictions
+    # For a zero-shot model
+    model = train(dataset, manifest)
+    df = predict(dataset, manifest, model)
+    df.to_csv(...)
 
 
 if __name__ == "__main__":
@@ -121,18 +97,18 @@ following code structure:
 
 ``` tree
 ├── __main__.py
-├── predict.py       # For supervised models only
+├── model.py
 ├── preprocess.py
-└── train.py
+└── utils.py
 ```
 
 ### `__main__.py` 
 
-The `__main__.py` contains the `train` and `predict` entrypoints as shown above.
-The code loads the dataset and model (card) before passing it to the `train_model`
-or `predict_model` methods after preprocessing.
+The `__main__.py` contains the `train` entrypoint as shown above.
+The code loads the dataset and model (card) before passing it to the `load_model`, `train_model`
+or `predict_model` methods.
 
-### `preprocess.py
+### `preprocess.py`
 
 `preprocess.py` contains the data preprocessing code, functions like:
 
@@ -148,35 +124,50 @@ def train_test_split(data: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
     return train_data, test_data
 ```
 
-### `train.py`
+### `model.py`
 
-`train.py` contains the training code, functions like:
+`model.py` contains the code related with model:
 
 ``` python
-def train(model, Any, X: np.ndarray, y: np.array) -> Path
+def train(dataset: Dataset, manifest: Manifest) -> Any
     """Train the model."""
+    X, y = load_x_and_y(
+        dataset=dataset,
+        split="train",
+    )
+
+    model = Model(manifest)
     model.fit(X, y)
-    model_path = model.persist()
-    return model_path
+
+    return model
 ```
 
 ``` python
-def load(model_card_reference: Path) -> Any:
+def load(manifest: Manifest) -> Any:
     """Load the model."""
-    model_config = ModelCard.from_path(model_card_reference)
-    model = Model.from_config(model_config)
+    model = Model.from_manifest(manifest)
     return model
 ```
 
-### `predict.py`
-
 ``` python
-def predict(model: Any, X: np.ndarray) -> np.array:
+def predict(dataset: Dataset, manifest: Manifest, model: Any) -> DataFrame:
     """Infer predictions on the data."""
-    predictions = model.predict(X)
-    return predictions
+    X, y = load_x_and_y(
+        dataset=dataset,
+        split="test",
+    )
+
+    predictions = model.predict(manifest, X)
+
+    df = DataFrame(predictions)
+
+    return df
 ```
 
+### `utils.py`
+
+It contains the supporting methods from the original models' code to facilitate the `model.py`.
+
 ## Backends
 
 This section details common logic per backend.
@@ -197,14 +188,11 @@ class SageMakerPathLayout:
     TRAINING_JOB_PATH: Path = PREFIX / "input" / "data" / "training" / "dataset.zip"
     """Path to training data."""
 
-    MODEL_CARD_PATH: PAth = PREFIX / "input" / "config" / "model_card.md"
-    """Path to the model card."""
-
-    MODEL_PATH: Path = Path("/model.pkl")
-    """Model path."""
+    MANIFEST_PATH: Path = PREFIX / "input" / "data" / "manifest" / "manifest.toml"
+    """Path to the model manifest."""
 
     OUTPUT_PATH = PREFIX / "output"
-    """Output path"""
+    """Path to the output, such as the result data frames."""
 ```
 
 For example, to persist the score for a given dataset and model as csv:
 
@@ -3,7 +3,7 @@
 import typer
 from rich.console import Console
 from pg2_dataset.dataset import Dataset
-from pg2_model_esm.predict import load_model, predict_model
+from pg2_model_esm.model import load, predict
 from pg2_benchmark.manifest import Manifest
 
 
@@ -12,19 +12,15 @@
     add_completion=True,
 )
 
-err_console = Console(stderr=True)
 console = Console()
 
 
 class SageMakerTrainingJobPath:
     PREFIX = Path("/opt/ml")
     TRAINING_JOB_PATH = PREFIX / "input" / "data" / "training" / "dataset.zip"
     MANIFEST_PATH = PREFIX / "input" / "data" / "manifest" / "manifest.toml"
-    PARAMS_PATH = PREFIX / "input" / "config" / "hyperparameters.json"
     OUTPUT_PATH = PREFIX / "model"
 
-    MODEL_PATH = Path("/model.pkl")
-
 
 @app.command()
 def train(
@@ -44,12 +40,11 @@ def train(
     console.print(f"Loading {dataset_file} and {model_toml_file}...")
 
     dataset = Dataset.from_path(dataset_file)
-
     manifest = Manifest.from_path(model_toml_file)
 
-    model, alphabet = load_model(manifest)
+    model, alphabet = load(manifest)
 
-    df = predict_model(
+    df = predict(
         dataset=dataset,
         manifest=manifest,
         model=model,
@@ -64,7 +59,6 @@ def train(
     console.print(
         f"Saved the metrics in CSV in {SageMakerTrainingJobPath.OUTPUT_PATH}/{dataset.name}_{manifest.name}.csv"
     )
-    console.print("Done.")
 
 
 @app.command()
 
@@ -3,16 +3,28 @@
 from tqdm import tqdm
 import pandas as pd
 from esm import pretrained
-from pg2_model_esm.utils import compute_pppl, label_row
-from pg2_benchmark.manifest import Manifest
 from pg2_dataset.dataset import Dataset
+from pg2_benchmark.manifest import Manifest
 from pg2_model_esm.preprocess import encode
+from pg2_model_esm.utils import compute_pppl, label_row
 import logging
 
 logger = logging.getLogger(__name__)
 
 
-def load_model(manifest: Manifest) -> tuple[torch.nn.Module, Alphabet]:
+def load(manifest: Manifest) -> tuple[torch.nn.Module, Alphabet]:
+    """Load and configure an ESM model and its alphabet.
+
+    Loads a pretrained ESM model from the location specified in the manifest,
+    sets it to evaluation mode, and optionally transfers it to GPU if available
+    and not disabled.
+
+    Args:
+        manifest: Configuration object containing model location and GPU settings
+
+    Returns:
+        tuple: The loaded ESM model and its corresponding alphabet
+    """
     model, alphabet = pretrained.load_model_and_alphabet(
         manifest.hyper_params["location"]
     )
@@ -25,12 +37,31 @@ def load_model(manifest: Manifest) -> tuple[torch.nn.Module, Alphabet]:
     return model, alphabet
 
 
-def predict_model(
+def predict(
     dataset: Dataset,
     manifest: Manifest,
     model: torch.nn.Module,
     alphabet: Alphabet,
 ) -> pd.DataFrame:
+    """Generate predictions for protein mutations using an ESM model.
+
+    Computes fitness scores for protein mutations using one of three scoring
+    strategies: wild-type marginals, masked marginals, or pseudo-perplexity.
+    The scoring strategy is determined by the manifest configuration.
+
+    Args:
+        dataset: Dataset containing assay data with mutations to score
+        manifest: Configuration object specifying scoring strategy and parameters
+        model: The loaded ESM model for computing predictions
+        alphabet: ESM alphabet for token encoding/decoding
+
+    Returns:
+        pd.DataFrame: DataFrame with predictions added in 'pred' column and
+                     target column renamed to 'test'
+
+    Raises:
+        ValueError: If an unrecognized scoring strategy is specified
+    """
     assays = dataset.assays.meta.assays
     targets = list(dataset.assays.meta.assays.keys())
 
 
@@ -3,6 +3,15 @@
 
 
 def encode(sequence: str, alphabet: Alphabet) -> torch.Tensor:
+    """Encode a protein sequence into tokens using the ESM alphabet.
+
+    Args:
+        sequence: Protein sequence to encode
+        alphabet: ESM alphabet for tokenization
+
+    Returns:
+        Batch tokens tensor for the sequence
+    """
     data = [
         ("protein1", sequence),
     ]