Statistical comparison of model outputs to human response data using IRT-derived ability bins (not raw age bins). Model runs once per item_uid; human response proportions are pre-aggregated by item_uid and 1-logit ability bins from fitted IRT models.
- R with:
tidyverse,philentropy,nloptr,reticulate(for .npy),jsonlite(for asset index) - Install:
install.packages(c("tidyverse", "philentropy", "nloptr", "reticulate", "jsonlite"))
- stats-helper.R – Softmax, KL, beta optimization, RSA (adapted from DevBench).
- compare_levante.R – Reads
data/responses/<version>/responses_by_ability/<task>_proportions_by_ability.csv(item_uid, ability_bin, image1..image4) andresults/<version>/<model>/<task>.npy(one row per item_uid). Joins IRT itemdparameters fromdata/responses/<version>/irt_models/<task>_item_params.csv. Writes D_KL (per item_uid × ability_bin) and accuracy (per item_uid, with the IRTdcolumn nameddifficulty) to separate CSVs.
The file src/levante_bench/config/irt_model_mapping.csv maps each task to its IRT model .rds file in the Redivis model registry. Columns: task_id, model_file. The download script reads this to know which .rds to fetch. Add rows manually for new tasks.
-
Preprocess human data (once): Run the R download script to fetch trials, download IRT models, extract item difficulties and ability scores, and write ability-binned human proportions:
Rscript scripts/download_levante_data.R [--version 2026-02-22] \ [--irt-dataset levante_metadata_scoring:e97h:v1_11] \ [--irt-table model_registry:rqwv]
This produces:
data/responses/<version>/irt_models/<task>.rds– downloaded IRT modeldata/responses/<version>/irt_models/<task>_item_params.csv– itemdparameters (item_uid, difficulty; higher values are empirically easier in the current exports)data/responses/<version>/irt_models/<task>_ability_scores.csv– person abilities (run_id, ability, se)data/responses/<version>/responses_by_ability/<task>_proportions.csv– overall response proportions (item_uid, image1..image4)data/responses/<version>/responses_by_ability/<task>_proportions_by_ability.csv– ability-binned response proportions (item_uid, ability_bin, image1..image4), with 1-logit bins
-
Run evaluation (Python): One row per item_uid:
levante-bench run-eval --task trog --model clip_base --version 2026-02-22
-
Run comparison:
levante-bench run-comparison --task trog --model clip_base --version 2026-02-22 [--output-dir results/comparison]
Or directly:
Rscript comparison/compare_levante.R --task trog --model clip_base --version 2026-02-22 --project-root .Outputs (disaggregated):
- D_KL:
results/comparison/<task>_<model>_d_kl.csv— columns: task, model, item_uid, ability_bin, D_KL. - Accuracy:
results/comparison/<task>_<model>_accuracy.csv— columns: task, model, item_uid, correct (0/1), difficulty.
- D_KL:
-
Optional: run full validation + benchmark smoke checks before comparison
scripts/validate_all.sh
-
Optional: validate R/Redivis dependencies and comparison flow
scripts/validate_r.sh --check-packages-only scripts/validate_r.sh --run-comparison-smoke --task trog --model clip_base --version 2026-03-24
-
Optional: inspect benchmark/prompt run history and metric deltas
python3 scripts/list_benchmark_results.py --limit 20
-
Use one version everywhere Use the same
--versionfor: R download (trials + IRT → human_by_ability), Python run-eval, and run-comparison. -
Responses-by-ability must exist Run
Rscript scripts/download_levante_data.R [--version VERSION]sodata/responses/<version>/responses_by_ability/<task>_proportions_by_ability.csvexists. The download script joins trials with IRT ability scores (from@scores), bins by 1-logit ability width, and aggregates response proportions by item_uid and ability_bin. -
IRT model mapping must be populated Ensure
src/levante_bench/config/irt_model_mapping.csvhas a row for each task you want to compare. Without it, IRT models won't be downloaded and ability binning falls back to an "all" aggregate. -
Run evaluation (Python)
levante-bench run-eval --task <TASK> --model <MODEL> --version <VERSION>. The loader deduplicates by item_uid, so the .npy has one row per item_uid. -
Run comparison (R)
levante-bench run-comparison --task <TASK> --model <MODEL> --version <VERSION>. Writes D_KL and accuracy CSVs to--output-dir(default: results/comparison/).
-
Item_uid alignment The loader deduplicates by item_uid, so the model runs once per item and the .npy has one row per item_uid. The comparison aligns by item_uid (order from trials = order in .npy).
-
Accuracy One row per item_uid: correct = 1 if model argmax (after softmax with fitted beta) equals the correct option, else 0. The
difficultycolumn comes from the IRT model'sdparameter. In the current exports this parameter is easiness-oriented: higher values correlate with higher human item accuracy. For 4 options, chance = 0.25. A positive correlation betweendifficultyandcorrectindicates the model is more accurate on empirically easier items. -
D_KL One row per (item_uid, ability_bin): KL(human proportions || model softmax) for that ability bin and item. Beta is fitted once to minimize mean D_KL across all (item_uid, ability_bin) pairs. Use the disaggregated D_KL CSV for per-ability or per-item analysis.
-
IRT d correlation The comparison script reports the correlation between
correct(0/1) anddifficulty(IRTdparameter). Positive values mean the model is more likely to answer items with higherdcorrectly; validate the sign convention withpython scripts/analysis/validate_sign_conventions.py. -
Spot-check Inspect a few item_uids: in the accuracy CSV check that correct matches your expectation; in the D_KL CSV compare D_KL across ability bins or items.
Vocab quadrant graphics live in the shared additional images bundle, downloaded
by scripts/data_prep/download_levante_assets.py.
- Graphics directory:
data/assets/additional_images/vocab_graphics/images/ - Placement manifest:
data/assets/additional_images/vocab_graphics/vocab-quadrants-manifest.csv - Summary stats:
data/assets/additional_images/vocab_graphics/vocab-quadrants-summary.json
Regenerate with:
python3 scripts/build_vocab_quadrant_graphics.py \
--corpus-csv data/assets/2026-03-24/corpus/vocab/vocab-item-bank.csv \
--visual-dir data/assets/2026-03-24/visual/vocab \
--out-dir data/assets/additional_images/vocab_graphics/images \
--manifest-csv data/assets/additional_images/vocab_graphics/vocab-quadrants-manifest.csv \
--summary-json data/assets/additional_images/vocab_graphics/vocab-quadrants-summary.json \
--seed 11