Add dataset and model permutation selection feature by tintinrevient · Pull Request #106 · ProteinGym/proteingym-benchmark

tintinrevient · 2025-09-17T14:59:47Z

Changes

Resolves #92 and #93

The user can select model and dataset permutation by below command in an interactive way:

$ uv run pg2-benchmark select models datasets -g supervised -e local

🔍 Dataset and Model Selection Tool
========================================

Available ItemType.DATASETS:
  1. charge_ladder
  2. neime
  3. ranganathan

Select ItemType.DATASETS (comma-separated numbers, e.g., 1,3,5 or 'all' for all):
Selection: 1,2

Available ItemType.MODELS:
  1. esm
  2. pls

Select ItemType.MODELS (comma-separated numbers, e.g., 1,3,5 or 'all' for all):
Selection: 2

✅ Selected datasets: charge_ladder, neime
✅ Selected models: pls

Update benchmark/supervised/local/dvc.yaml and benchmark/supervised/local/params.yaml? [y/N]: y
✅ Updated benchmark/supervised/local/dvc.yaml with selected models and datasets
2025-09-17 16:58:26,720 - pg2_benchmark - INFO - Successfully updated benchmark/supervised/local/dvc.yaml with 2 datasets and 1 models
✅ Updated benchmark/supervised/local/params.yaml source paths
2025-09-17 16:58:26,723 - pg2_benchmark - INFO - Successfully updated benchmark/supervised/local/params.yaml source section

Afterwards, the user can run benchmarking as usual:

$ uv run dvc repro benchmark/supervised/local/dvc.yaml

The major file changes are all in __main__.py, and all the fixed dvc.yaml and params.yaml are replaced with configurable permutations, powered by Jinja template.

Checklist

I broke the PR down so that it contains a reasonable amount of changes for an effective review
I performed a self-review of my code. Amongst other things, I have commented my code in hard-to-understand areas.
I made corresponding changes to the documentation
I added tests that prove my fix is effective or that my feature works
I accounted for dependent changes to be merged and published in downstream modules

tintinrevient · 2025-09-17T15:21:04Z

@JCZuurmond it is ready for your review of the CLI. I will add the tests tomorrow.

The only major change is in __main__.py's select command.

tintinrevient · 2025-09-17T15:27:01Z

✅ Supervised models have all passed validation.

Metric,Value
Overall ACC,0.0
Overall RACCU,0.005050505050505051
Overall RACC,0.0
Kappa,0.0
Gwet AC1,-0.005076142131979714
Bennett S,-0.005076142131979696
Kappa Standard Error,0.0
Kappa Unbiased,-0.005076142131979696
Scott PI,-0.005076142131979696
Kappa No Prevalence,-1.0
Kappa 95% CI,"(0.0, 0.0)"
Standard Error,0.0
95% CI,"(0.0, 0.0)"
Chi-Squared,None
Phi-Squared,None
Cramer V,None
Response Entropy,6.62935662007962
Reference Entropy,6.62935662007962
Cross Entropy,0
Joint Entropy,6.62935662007962
Conditional Entropy,-0.0
Mutual Information,6.62935662007962
KL Divergence,None
Lambda B,1.0
Lambda A,1.0
Chi-Squared DF,38809
Overall J,"(0.0, 0.0)"
Hamming Loss,1.0
Zero-one Loss,99
NIR,0.010101010101010102
P-Value,1
Overall CEN,0.0
Overall MCEN,0.0
Overall MCC,0.0
RR,0.5
CBA,0.0
AUNU,None
AUNP,None
RCI,1.0
Pearson C,None
TPR Micro,0.0
TPR Macro,None
CSI,None
ARI,None
TNR Micro,0.9949238578680203
TNR Macro,0.9949494949494949
Bangdiwala B,None
Krippendorff Alpha,0.0
SOA1(Landis & Koch),Slight
SOA2(Fleiss),Poor
SOA3(Altman),Poor
SOA4(Cicchetti),Poor
SOA5(Cramer),None
SOA6(Matthews),Negligible
SOA7(Lambda A),Perfect
SOA8(Lambda B),Perfect
SOA9(Krippendorff Alpha),Low
SOA10(Pearson C),None
FPR Macro,0.005050505050505083
FNR Macro,None
PPV Macro,None
NPV Macro,0.9949494949494949
ACC Macro,0.98989898989899
F1 Macro,0.0
FPR Micro,0.005076142131979711
FNR Micro,1.0
PPV Micro,0.0
F1 Micro,0.0
NPV Micro,0.9949238578680203
Spearman,0.667965367965368

✅ Zero-shot models have all passed validation.

Metric,Value
Overall ACC,0.0
Overall RACCU,0.00010007998789141575
Overall RACC,0.0
Kappa,0.0
Gwet AC1,-0.00010009004298543876
Bennett S,-0.00010009008107296567
Kappa Standard Error,0.0
Kappa Unbiased,-0.00010009000489789399
Scott PI,-0.00010009000489789399
Kappa No Prevalence,-1.0
Kappa 95% CI,"(0.0, 0.0)"
Standard Error,0.0
95% CI,"(0.0, 0.0)"
Chi-Squared,None
Phi-Squared,None
Cramer V,None
Response Entropy,12.286557761608659
Reference Entropy,12.286549508613042
Cross Entropy,0
Joint Entropy,12.286549508613042
Conditional Entropy,-0.0
Mutual Information,12.286557761608659
KL Divergence,None
Lambda B,1.0
Lambda A,1.0
Chi-Squared DF,99820081
Overall J,"(0.0, 0.0)"
Hamming Loss,0.9999999999999999
Zero-one Loss,4996
NIR,0.00020016012810248197
P-Value,1
Overall CEN,0.0
Overall MCEN,0.0
Overall MCC,0.0
RR,0.5
CBA,0.0
AUNU,None
AUNP,None
RCI,1.0000006717097922
Pearson C,None
TPR Micro,0.0
TPR Macro,None
CSI,None
ARI,None
TNR Micro,0.999899909957026
TNR Macro,0.9998999199359487
Bangdiwala B,None
Krippendorff Alpha,7.616744806910965e-11
SOA1(Landis & Koch),Slight
SOA2(Fleiss),Poor
SOA3(Altman),Poor
SOA4(Cicchetti),Poor
SOA5(Cramer),None
SOA6(Matthews),Negligible
SOA7(Lambda A),Perfect
SOA8(Lambda B),Perfect
SOA9(Krippendorff Alpha),Low
SOA10(Pearson C),None
FPR Macro,0.00010008006405126668
FNR Macro,None
PPV Macro,None
NPV Macro,0.9998999200121163
ACC Macro,0.999799839948065
F1 Macro,0.0
FPR Micro,0.00010009004297395485
FNR Micro,1.0
PPV Micro,0.0
F1 Micro,0.0
NPV Micro,0.999899909957026
Spearman,

JCZuurmond

It's in the right direction. As discussed, we prefer to pull out the listing capabilities from pg2-benchmark (following the unix philosophy). See below for example user stories:

# User story 1
# pipe datasets into a benchmark with a given set of models

pg2-dataset list --query ".kind = 'awesome'" | uv run dvc ../path/to/public_models.local.yml

# User story 2
# pipe models into a benchmark with a given set of datasets

yq ./models/**/*.md --query ".type = 'one_shot'" | uv run dvc ../path/to/most_popular_models.local.yml

# User story 3
# dynamically create benchmark

pg2-dataset list --query ".kind = 'awesome'" --format json > datasets.json
yq ./models/**/*.md --query ".type = 'one_shot'" --format json > models.json
uv run dvc ../path/to/benchmark.local.yml  # benchmarks points to datasets.json and models.json

hredestig

Nice improvement!

hredestig · 2025-09-18T12:32:09Z

+    """Parameters file for benchmark configuration."""
+
+
+class GameType(str, Enum):


Why not StrEnum ?

Nice point, I'm going to change it.

hredestig · 2025-09-18T12:32:33Z

    """Default location for model card files relative to model root directory."""


+class DatasetPath:


Looks like should be configurable - use pydantic_settings ?

The DatasetsPath will be changed or removed in the future, since the uniform archive format is .pgdata.

I'm breaking this down to 3 PRs, one is ready: ProteinGym/proteingym-base#306. The purpose of this PR is to first list datasets in a local path (Dataset might also exist in S3, GCP, DVC registry or somewhere else, we can extend it later.)

So currently it is to list datasets in a local path with pgdata archived format, then we can export in JSON or YAML format to prefill dvc.yaml, so the local DVC benchmark can work.

tintinrevient · 2025-09-24T17:04:17Z

pipe datasets into a benchmark with a given set of models

The first one is here: ProteinGym/proteingym-base#306

tintinrevient · 2025-10-01T17:09:00Z

Close this PR, since it breaks down into 3 PRs:

Add dataset and model permutation selection feature

b03190f

tintinrevient requested a review from JCZuurmond September 17, 2025 15:00

tintinrevient added 2 commits September 17, 2025 17:11

Refactor the internal function

d3462cf

Add sample dvc.yaml and params.yaml back for CI

1d49071

JCZuurmond suggested changes Sep 18, 2025

View reviewed changes

hredestig reviewed Sep 18, 2025

View reviewed changes

tintinrevient mentioned this pull request Sep 26, 2025

Query model cards by params and select the required ones to a JSON file ProteinGym/proteingym-base#320

Closed

tintinrevient closed this Oct 1, 2025

tintinrevient deleted the feat/dataset-model-permutation branch October 6, 2025 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset and model permutation selection feature#106

Add dataset and model permutation selection feature#106
tintinrevient wants to merge 3 commits intomainfrom
feat/dataset-model-permutation

tintinrevient commented Sep 17, 2025 •

edited

Loading

Uh oh!

tintinrevient commented Sep 17, 2025 •

edited

Loading

Uh oh!

tintinrevient commented Sep 17, 2025

Uh oh!

JCZuurmond left a comment

Uh oh!

hredestig left a comment

Uh oh!

hredestig Sep 18, 2025

Uh oh!

tintinrevient Sep 24, 2025

Uh oh!

hredestig Sep 18, 2025

Uh oh!

tintinrevient Sep 24, 2025

Uh oh!

tintinrevient commented Sep 24, 2025

pipe datasets into a benchmark with a given set of models

Uh oh!

tintinrevient commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		"""Parameters file for benchmark configuration."""


		class GameType(str, Enum):

		"""Default location for model card files relative to model root directory."""


		class DatasetPath:

Conversation

tintinrevient commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Checklist

Uh oh!

tintinrevient commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tintinrevient commented Sep 17, 2025

Uh oh!

JCZuurmond left a comment

Choose a reason for hiding this comment

Uh oh!

hredestig left a comment

Choose a reason for hiding this comment

Uh oh!

hredestig Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

tintinrevient Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

hredestig Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

tintinrevient Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

tintinrevient commented Sep 24, 2025

pipe datasets into a benchmark with a given set of models

Uh oh!

tintinrevient commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tintinrevient commented Sep 17, 2025 •

edited

Loading

tintinrevient commented Sep 17, 2025 •

edited

Loading