Regression models trained on Stabl-selected features from MOE (Molecular Operating Environment) molecular descriptors.
This baseline implements the Stabl feature selection framework (Discovery of sparse, reliable omic biomarkers with Stabl, J. Hédou et al., Nature Biotechnology, 2024) for sparse and interpretable feature selection. A set of regressors are then fitted to predict the antibody properties.
Stabl is first called over all target properties to produce target-level lists of MOE features that should be later used for prediction. We loop through cross-validation folds and target properties to generate a stabl_feature_selection_results.pkl data structure that contains:
- For every GDPa1 fold and target property, a list of Stabl-selected features
- Derivation of the optimal threshold from the FDP plot (see
build_stabl_features.ipynband Methods section) - Feature-level Stability paths (see
build_stabl_features.ipynband Methods section)
At training time, we access the precomputed lists of features and fit a set of regressors with a randomized CV search to search for the best hyperparameters. Models, preprocessors and features selected are then stored in some artifact.pkl file.
At inference time, we recover the trained models from the artifact file and make predictions.
The following model have displayed highest performance:
- HIC: Ridge
- AC-SINS_pH7.4: LGBMRegressor
- PR_CHO: XGBRegressor
- Titer: LGBMRegressor
- Tm2: MLP
The results reported for every target were obtained with the specified method which achieved optimal performance (in terms of average test spearman on test fold):
| Target | AC-SINS_pH7.4 | HIC | PR_CHO | Titer | Tm2 |
|---|---|---|---|---|---|
| Models | LGBM | Ridge | XGB | LGBM | MLP |
| Spearman | 0.395 | 0.645 | 0.453 | 0.189 | 0.132 |
| 11 | 17 | 16 | 13 | 3 |
-
Pre-computed MOE features in
../../data/features/processed_features/GDPa1/MOE_properties.csv(training features)heldout_test/MOE_properties.csv(test features)
-
Stabl-selected features for every fold in
stabl_feature_selection_results.pkl.
MOE molecular descriptors computed from predicted antibody structures by Nels Thorsteinsen.
The baseline implements a standardized CLI interface. MOE features are loaded automatically from the centralized feature store.
pixi run python -m moe_stabl_baseline train \
--data ../../data/GDPa1_v1.2_20250814.csv \
--run-dir ./runs/my_run \
[--seed 42]Trains 5 optimized models (one per property) and saves to run-dir/model_artifacts.pkl.
# Training data
pixi run python -m moe_stabl_baseline predict \
--data ../../data/GDPa1_v1.2_20250814.csv \
--run-dir ./runs/my_run \
--out-dir ./outputs/train
# Heldout test set
pixi run python -m moe_stabl_baseline predict \
--data ../../data/heldout-set-sequences.csv \
--run-dir ./runs/my_run \
--out-dir ./outputs/heldoutGenerates predictions for all samples and writes to out-dir/predictions.csv.
From repository root:
pixi run allAutomatically runs all models with 5-fold cross-validation and evaluation.
MOE molecular descriptors capture structural, electrostatic, hydrophobic, geometric, and secondary structure properties computed from predicted antibody structures. The descriptor set includes ~246 features covering:
- Structural: radius of gyration, packing scores, surface areas
- Electrostatic: charge distribution, dipole moments, multipole moments
- Hydrophobic: patch hydrophobicity, hydrophobic moments
- Secondary structure: helicity, strand content
Stabl was first developped in the context of single cell mass cytometry to allow for reliable and interpretable feature selection in high-dimensional datasets (where
Steps: Consider a trainset (or a fold) with
- Generate
$n_{bootstraps}$ bootstraps of the train fold - Then generate
$p$ artificial (uninformative) features from actual permutated features - On every bootstrap with
$2p$ features, run LASSO with hyperparameter$\lambda$ . - Report every selected features frequency across boostraps for this choice of hyperparmeter
$\lambda$ :$(f_i^{\lambda})_i$ - Repeat previous steps for every
$\lambda \in [\lambda_{min}, \lambda_{max}]$ .
For every
For any given threshold
More specifically, applying the previous selection rule, gives us a set of selected features
The Stabl paper gives nice theoretical guarantees that this quantity can serve as an estimator of the false discovery rate in this set of selected features. Moreover, we choose
For each property, Ridge, XGBoost, LightGBM, and MLP models were compared across multiple feature sets. Best configurations were selected based on 5-fold cross-validation performance.
Features are standardized using training set statistics. Ridge models apply linear regression; MLP models use a single hidden layer with early stopping for Tm2.
This baseline implements the BaseModel interface from abdev_core:
from abdev_core import BaseModel, load_features
class MoeStablBaselineModel(BaseModel):
def train(self, df: pd.DataFrame, run_dir: Path, *, seed: int = 42) -> None:
# Load MOE features from centralized store
moe_features = load_features("MOE_properties")
# Train 5 separate models with optimized configs
# ...
def predict(self, df: pd.DataFrame, run_dir: Path) -> pd.DataFrame:
# Load models and MOE features
# Generate predictions for all 5 properties
# ...Features are managed centrally by abdev_core. See the abdev_core documentation for details.
Predictions are written to <out-dir>/predictions.csv with columns:
antibody_namevh_protein_sequence,vl_protein_sequence- Predicted values for:
HIC,Tm2,Titer,PR_CHO,AC-SINS_pH7.4
- MOE descriptors: Nels Thorsteinsen
- Stabl selection: Hédou et al. (2024), "Discovery of sparse, reliable omic biomarkers with Stabl", Nature Biotechnology
- GDPa1 dataset: ginkgo-datapoints/GDPa1