Compare model performance across benchmarks using HELM-style win rate calculations. Win rates measure how often one model outperforms another on the same questions.
# Compute win rates from processed results
medarc-eval winrate
# See which models are available
medarc-eval winrate --list-models
# Specify directories
medarc-eval winrate --processed-dir runs/processed --output-dir runs/processed/winrateWin rate computation requires processed parquet files with an env_index.json:
# If you haven't processed yet:
medarc-eval processFor each pair of models (A, B) on each benchmark:
- Average rollouts per
(example_id, model_id) - Compare questions where at least one model has a reward
- If one side is missing, fill it according to
--missing-policy(neg-inforzero) - Count: A wins, B wins, ties
- Win rate = (A wins + 0.5 × ties) / total used questions
The final win rate aggregates across all benchmarks using configurable weighting.
Winrate also emits a missingness summary so partial dataset coverage is visible. The report counts missing
(dataset, model) pairs after rollout averaging, including both absent rows and null reward values.
runs/processed/winrate/
├── winrates-20260114T120000Z.json # Timestamped results
├── winrates-20260114T120000Z.csv # Spreadsheet-friendly
├── latest.json # Always points to newest
└── latest.csv
If you pass --output /path/to/file.json, winrate writes only that JSON file and skips latest.json plus all CSV outputs.
The JSON output includes:
- Per-model aggregate win rates
- Per-opponent
vsbreakdowns - Per-dataset average rewards and question counts
| Flag | Description |
|---|---|
--list-models |
Show available models and exit |
--include-model MODEL |
Only include specified models (repeatable) |
--exclude-model MODEL |
Exclude specified models (repeatable) |
--exclude-dataset DATASET |
Exclude specified datasets/env ids (repeatable) |
| Flag | Description | Default |
|---|---|---|
--missing-policy |
How to handle missing scores: zero or neg-inf |
neg-inf |
--epsilon |
Tie tolerance (scores within epsilon are ties) | 1e-9 |
--min-common |
Minimum shared examples for valid comparison | 0 |
| Flag | Description | Default |
|---|---|---|
--weight-policy |
How to weight benchmarks: equal, ln, sqrt, cap |
ln |
--weight-cap |
Maximum weight per benchmark (for cap policy) |
0 |
| Flag | Description | Default |
|---|---|---|
--dataset-coverage all-models |
Enforce intersection of datasets across the compared models | all-models |
--dataset-coverage per-model |
Legacy behavior (each model may be averaged over different datasets) |
| Flag | Description |
|---|---|
--partial-datasets strict |
When --include-model is set, drop datasets missing any included model |
--partial-datasets include |
When --include-model is set, keep datasets and treat missing models as all-missing |
--partial-datasets include is usually paired with --dataset-coverage per-model. With the default all-models coverage, datasets missing any required model are still dropped later.
# process-config.yaml
runs_dir: runs/raw
process:
dir: processed
winrate:
dir: winrate
missing_policy: neg-inf
epsilon: 1.0e-9
min_common: 10
weight_policy: ln
exclude_model:
- baseline-model
- deprecated-v1
exclude_datasets:
- med_dialogmedarc-eval winrate --config process-config.yamlSupported config schema for medarc-eval winrate:
- Top-level
process:can providediroroutput_dir; this becomes the defaultprocessed_dir. - Top-level
winrate:provides winrate-specific defaults. - Top-level
hf:provides shared HF settings. Usehf.winrate_dirto control where winrate artifacts upload inside the repo.
# Only compare these two models
medarc-eval winrate \
--include-model gpt-4o \
--include-model claude-3-5-sonnetmedarc-eval winrate --exclude-model random-baselineOnly use datasets where all compared models have results:
medarc-eval winrate \
--include-model gpt-4o \
--include-model gpt-4o-mini \
--dataset-coverage all-modelsWeight benchmarks by log of dataset size (larger benchmarks count more):
medarc-eval winrate --weight-policy lnmedarc-eval winrate \
--hf-repo your-org/processed-benchmarks \
--hf-processed-pull \
--hf-token $HF_TOKENmedarc-eval winrate \
--hf-repo your-org/processed-benchmarks \
--hf-winrate-dir winrate \
--hf-token $HF_TOKEN \
--hf-private# process-config.yaml
runs_dir: runs/raw
process:
dir: processed
winrate:
dir: winrate
missing_policy: neg-inf
weight_policy: ln
hf:
repo: your-org/processed-data # Pull processed from here; upload winrate here
winrate_dir: winrate # Subdirectory in repo for winrate artifacts (default: winrate)
branch: main
token: ${HF_TOKEN}
private: truehf.token accepts either a literal token string or an environment reference like $HF_TOKEN / ${HF_TOKEN}.
hf.winrate_dir and --hf-winrate-dir both set the path inside the HF repo where latest.json, latest.csv, and timestamped winrate outputs are uploaded.
| model | weighted_winrate | simple_winrate | medqa | pubmedqa | num_datasets |
|---|---|---|---|---|---|
| gpt-4o | 0.72 | 0.70 | 0.84 | 0.77 | 2 |
| gpt-4o-mini | 0.45 | 0.43 | 0.61 | 0.39 | 2 |
- weighted_winrate / simple_winrate: Aggregate mean winrate across retained datasets
- Dataset columns: Average reward on that dataset, not pairwise winrate columns
num_datasets: Number of datasets retained for that model after filtering/coverage rules
{
"models": {
"gpt-4o": {
"mean_winrate": {
"simple_mean": 0.72,
"weighted_mean": 0.74,
"n_datasets": 2
},
"vs": {
"gpt-4o-mini": {
"mean_winrate": {
"simple_mean": 0.85,
"weighted_mean": 0.84
},
"per_dataset": {
"medqa": 0.90,
"pubmedqa": 0.80
},
"n_datasets": 2
}
},
"avg_reward_per_dataset": {
"medqa": 0.84,
"pubmedqa": 0.77
}
}
},
"datasets": {
"medqa": {
"avg_reward_per_model": {
"gpt-4o": 0.84,
"gpt-4o-mini": 0.61
},
"n_questions": 1273
}
},
}- Ensure
medarc-eval processhas been run - Check that
env_index.jsonexists in--processed-dir
- Check
--min-commonisn't filtering out comparisons - Review
--missing-policy(useneg-infto penalize missing answers) - Verify models were evaluated on the same benchmark variants
- If using
--partial-datasets include, also consider--dataset-coverage per-model