Convert raw benchmark outputs into analysis-ready parquet files. This step prepares data for win rate computation and other analyses.
# Process all completed jobs (uses defaults)
medarc-eval process
# Specify directories explicitly
medarc-eval process --runs-dir runs/raw --output-dir runs/processed
# Preview what would be processed
medarc-eval process --dry-run- Discovers jobs in
runs/raw/and filters by manifest status (default:completed) - Extracts results from each job's output files
- Normalizes data into a fixed output schema
- Writes parquet files organized by model and environment
- Creates an index (
env_index.json) for downstream tools
runs/processed/
├── env_index.json # Dataset inventory for winrate/analysis
├── gpt-4o/
│ ├── medqa.parquet
│ └── pubmedqa.parquet
├── gpt-4o-mini/
│ ├── medqa.parquet
│ └── pubmedqa.parquet
└── ...
On-disk model and env path components are slugified, so filenames may not exactly match raw ids.
| Flag | Description | Default |
|---|---|---|
--runs-dir PATH |
Directory containing raw runs | runs/raw |
--output-dir PATH |
Where to write processed files | runs/processed |
--max-workers N |
Parallel worker processes | 4 |
--dry-run |
Show what would be processed | - |
--yes |
Skip confirmation prompts | - |
--exclude-dataset NAME |
Skip processing specific datasets/env ids (repeatable) | - |
--exclude-model MODEL |
Skip processing specific model ids (repeatable) | - |
By default, medarc-eval process only selects jobs whose manifest status is completed.
Note: successful jobs are written to run_manifest.json with status: completed.
To override that default, pass one or more explicit status filters:
medarc-eval process --status completed --status failedYou can also gate partially complete outputs by missing results.jsonl rows:
# Default tolerance is 2.5 percent missing
medarc-eval process --max-results-missing-pct 2.5
# Effectively disable the gate
medarc-eval process --max-results-missing-pct 100This gate uses manifest job metadata only:
expected_rows = num_examples * rollouts_per_exampleobserved_rows = row_count
It is computed per selected job record and enforced only on the latest selected run for each processed model/environment output. It does not use manifest summary.completed / summary.total, and it does not fall back to older runs if the latest one is too incomplete.
Selected records with missing results.jsonl fail processing immediately.
When multiple runs exist for the same (model, environment) pair, processing uses the latest by default.
Delete all processed outputs and rebuild from scratch:
# Interactive confirmation
medarc-eval process --clean
# Non-interactive (for scripts)
medarc-eval process --clean --yesStore common options in a YAML file:
# process-config.yaml
runs_dir: runs/raw
process:
dir: processed
max_workers: 8
max_results_missing_pct: 2.5
exclude_datasets:
- med_dialog
exclude_models:
- deprecated-v1
winrate:
enabled: true
dir: winratemedarc-eval process --config process-config.yamlCLI flags override config values.
Supported config schema for medarc-eval process:
- Top-level
runs_dir: raw run root. - Top-level
process:: process-specific defaults. - Optional top-level
winrate:: embedded post-process winrate step. - Optional top-level
hf:: shared HF settings. For embedded winrate uploads, usehf.winrate_dir.
Path shortcuts:
process.diris shorthand forprocess.output_dir, resolved relative to the parent ofruns_dir.winrate.diris shorthand for the embedded winrate output directory, resolved under the processed output dir.
Example:
runs_dir: runs/raw
process:
dir: processed
max_workers: 8
winrate:
dir: scorecards
hf:
repo: your-org/medical-benchmarks
winrate_dir: scorecards/latestSync processed datasets to/from the Hugging Face Hub:
# process-config.yaml
runs_dir: runs/raw
process:
dir: processed
hf:
repo: your-org/medical-benchmarks
branch: main
token: ${HF_TOKEN}
private: truehf.token accepts either a literal token string or an environment reference like $HF_TOKEN / ${HF_TOKEN}.
# Prompt before pulling
medarc-eval process --hf-repo your-org/data --hf-pull-policy prompt
# Always pull existing data first
medarc-eval process --hf-repo your-org/data --hf-pull-policy pull
# Start fresh (ignore remote)
medarc-eval process --hf-repo your-org/data --hf-pull-policy clean
# Resume a previously failed HF upload without pulling or cleaning
medarc-eval process --hf-repo your-org/data --hf-pull-policy continue-uploadprompt only prompts when the local processed dir is already non-empty. If the output dir is empty, process pulls the HF baseline immediately.
When prompt is used with a non-empty local processed dir, the menu may show:
pull: download missing baseline data without deleting local filesclean: redownload everything after deleting local filesupload: keep local processed outputs and resume/upload pending HF artifacts
upload is shown only when local parquet files appear to be missing remotely or have a different remote lfs.sha256. Recovery uploads the union of:
- parquet files that were already pending before the current run started
- files touched by the current process run, including
env_index.jsonanddataset_infos.jsonwhen rewritten
When --hf-repo is set, processed files are automatically uploaded after completion.
Process and compute win rates in one step:
medarc-eval process --config process-config.yamlThis runs medarc-eval winrate automatically after processing completes when the config contains a winrate: section.
# 1. Run benchmarks
medarc-eval bench --config my-eval.yaml
# 2. Process results
medarc-eval process
# 3. Compute win rates
medarc-eval winrate# Non-interactive processing with cleanup
medarc-eval process \
--runs-dir ./benchmark-outputs \
--output-dir ./processed \
--clean \
--yes \
--max-workers 16# Process only new runs (default behavior)
medarc-eval process
# env_index.json tracks what's already processedIncremental skipping only reuses an existing parquet when its footer metadata source_runs still matches the newly selected run ids and the existing row count still matches env_index.json.
Rebuild existing outputs for specific models or datasets without using --clean:
# Rebuild every processed dataset for one model
medarc-eval process --replace-model gpt-4o
# Rebuild every model for one dataset
medarc-eval process --replace-env medqa
# Rebuild only the intersection
medarc-eval process --replace-model gpt-4o --replace-env medqaWhen both flags are present, processing only rebuilds outputs that match both filters.
Check that:
--runs-dirpoints to the correct location- Runs have completed (check
run_manifest.jsonjobs[*].status) - Use
--status pendingor--status runningto include non-completed jobs
By default, only jobs with completed status are included. In addition, --max-results-missing-pct fails if a selected latest job record is missing more than 2.5% of its expected results.jsonl rows, using manifest job fields:
row_countnum_examplesrollouts_per_example
The gate is per selected record, not per whole run manifest. If the latest selected run for a model/dataset is too incomplete, processing fails fast instead of silently falling back to an older run. Records with unknown expected rows or unknown row_count are not gated.
Use --max-results-missing-pct 100 to disable the gate, or pass explicit --status values to include other statuses.
If processing stops with an error like:
Existing processed output ... has N parquet rows but env_index.json records M.
the local processed snapshot is inconsistent. Fix it by rebuilding the affected output:
medarc-eval process --replace-model gpt-4o --replace-env medqaOr rebuild everything:
medarc-eval process --clean --yesAfter processing, compute win rates to compare model performance.