LEVANTE comparison (R)

Statistical comparison of model outputs to human response data using IRT-derived ability bins (not raw age bins). Model runs once per item_uid; human response proportions are pre-aggregated by item_uid and 1-logit ability bins from fitted IRT models.

Dependencies

R with: tidyverse, philentropy, nloptr, reticulate (for .npy), jsonlite (for asset index)
Install: install.packages(c("tidyverse", "philentropy", "nloptr", "reticulate", "jsonlite"))

Scripts

stats-helper.R – Softmax, KL, beta optimization, RSA (adapted from DevBench).
compare_levante.R – Reads data/responses/<version>/responses_by_ability/<task>_proportions_by_ability.csv (item_uid, ability_bin, image1..image4) and results/<version>/<model>/<task>.npy (one row per item_uid). Joins IRT item d parameters from data/responses/<version>/irt_models/<task>_item_params.csv. Writes D_KL (per item_uid × ability_bin) and accuracy (per item_uid, with the IRT d column named difficulty) to separate CSVs.

IRT model mapping

The file src/levante_bench/config/irt_model_mapping.csv maps each task to its IRT model .rds file in the Redivis model registry. Columns: task_id, model_file. The download script reads this to know which .rds to fetch. Add rows manually for new tasks.

Usage

Preprocess human data (once): Run the R download script to fetch trials, download IRT models, extract item difficulties and ability scores, and write ability-binned human proportions:
```
Rscript scripts/download_levante_data.R [--version 2026-02-22] \
  [--irt-dataset levante_metadata_scoring:e97h:v1_11] \
  [--irt-table model_registry:rqwv]
```
This produces:
- data/responses/<version>/irt_models/<task>.rds – downloaded IRT model
- data/responses/<version>/irt_models/<task>_item_params.csv – item d parameters (item_uid, difficulty; higher values are empirically easier in the current exports)
- data/responses/<version>/irt_models/<task>_ability_scores.csv – person abilities (run_id, ability, se)
- data/responses/<version>/responses_by_ability/<task>_proportions.csv – overall response proportions (item_uid, image1..image4)
- data/responses/<version>/responses_by_ability/<task>_proportions_by_ability.csv – ability-binned response proportions (item_uid, ability_bin, image1..image4), with 1-logit bins

Run evaluation (Python): One row per item_uid:

levante-bench run-eval --task trog --model clip_base --version 2026-02-22

Run comparison:
```
levante-bench run-comparison --task trog --model clip_base --version 2026-02-22 [--output-dir results/comparison]
```
Or directly: Rscript comparison/compare_levante.R --task trog --model clip_base --version 2026-02-22 --project-root .

Outputs (disaggregated):
- D_KL: results/comparison/<task>_<model>_d_kl.csv — columns: task, model, item_uid, ability_bin, D_KL.
- Accuracy: results/comparison/<task>_<model>_accuracy.csv — columns: task, model, item_uid, correct (0/1), difficulty.
Optional: run full validation + benchmark smoke checks before comparison
```
scripts/validate_all.sh
```

Optional: validate R/Redivis dependencies and comparison flow

scripts/validate_r.sh --check-packages-only
scripts/validate_r.sh --run-comparison-smoke --task trog --model clip_base --version 2026-03-24

Optional: inspect benchmark/prompt run history and metric deltas
```
python3 scripts/list_benchmark_results.py --limit 20
```

Debugging the comparison flow

Use one version everywhere Use the same --version for: R download (trials + IRT → human_by_ability), Python run-eval, and run-comparison.
Responses-by-ability must exist Run Rscript scripts/download_levante_data.R [--version VERSION] so data/responses/<version>/responses_by_ability/<task>_proportions_by_ability.csv exists. The download script joins trials with IRT ability scores (from @scores), bins by 1-logit ability width, and aggregates response proportions by item_uid and ability_bin.
IRT model mapping must be populated Ensure src/levante_bench/config/irt_model_mapping.csv has a row for each task you want to compare. Without it, IRT models won't be downloaded and ability binning falls back to an "all" aggregate.
Run evaluation (Python) levante-bench run-eval --task <TASK> --model <MODEL> --version <VERSION>. The loader deduplicates by item_uid, so the .npy has one row per item_uid.
Run comparison (R) levante-bench run-comparison --task <TASK> --model <MODEL> --version <VERSION>. Writes D_KL and accuracy CSVs to --output-dir (default: results/comparison/).

Sanity-checking the comparison

Item_uid alignment The loader deduplicates by item_uid, so the model runs once per item and the .npy has one row per item_uid. The comparison aligns by item_uid (order from trials = order in .npy).
Accuracy One row per item_uid: correct = 1 if model argmax (after softmax with fitted beta) equals the correct option, else 0. The difficulty column comes from the IRT model's d parameter. In the current exports this parameter is easiness-oriented: higher values correlate with higher human item accuracy. For 4 options, chance = 0.25. A positive correlation between difficulty and correct indicates the model is more accurate on empirically easier items.
D_KL One row per (item_uid, ability_bin): KL(human proportions || model softmax) for that ability bin and item. Beta is fitted once to minimize mean D_KL across all (item_uid, ability_bin) pairs. Use the disaggregated D_KL CSV for per-ability or per-item analysis.
IRT d correlation The comparison script reports the correlation between correct (0/1) and difficulty (IRT d parameter). Positive values mean the model is more likely to answer items with higher d correctly; validate the sign convention with python scripts/analysis/validate_sign_conventions.py.
Spot-check Inspect a few item_uids: in the accuracy CSV check that correct matches your expectation; in the D_KL CSV compare D_KL across ability bins or items.

Vocab Graphics Bundle

Vocab quadrant graphics live in the shared additional images bundle, downloaded by scripts/data_prep/download_levante_assets.py.

Graphics directory: data/assets/additional_images/vocab_graphics/images/
Placement manifest: data/assets/additional_images/vocab_graphics/vocab-quadrants-manifest.csv
Summary stats: data/assets/additional_images/vocab_graphics/vocab-quadrants-summary.json

Regenerate with:

python3 scripts/build_vocab_quadrant_graphics.py \
  --corpus-csv data/assets/2026-03-24/corpus/vocab/vocab-item-bank.csv \
  --visual-dir data/assets/2026-03-24/visual/vocab \
  --out-dir data/assets/additional_images/vocab_graphics/images \
  --manifest-csv data/assets/additional_images/vocab_graphics/vocab-quadrants-manifest.csv \
  --summary-json data/assets/additional_images/vocab_graphics/vocab-quadrants-summary.json \
  --seed 11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LEVANTE comparison (R)

Dependencies

Scripts

IRT model mapping

Usage

Debugging the comparison flow

Sanity-checking the comparison

Vocab Graphics Bundle

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LEVANTE comparison (R)

Dependencies

Scripts

IRT model mapping

Usage

Debugging the comparison flow

Sanity-checking the comparison

Vocab Graphics Bundle