0.2.13 (2026-02-26)
- add Global MMLU task (#174) (0d0b227)
- add GoldenSwag task (#175) (a05e032)
- add tasks from the OLMES evaluation suite (#180) (54f295d)
- adding aggregated results with errors, if error free ration is < 1.0 (#181) (6f3e639)
- BalancedCOPA dataset (#177) (25161aa)
- Change to more complete revision of ZeroScrolls dataset (#171) (a4e117e)
- COPA uses appropriate dataset splits (#176) (55ebe44)
- Change to more complete revision of zeroscrolls (#173) (a84286e)
- Flores200 data reading issue (#179) (9bf3155)
0.2.12 (2026-02-04)
- add "top_p" param to AlephAlphaAPIModel (#168) (e52c927)
- Bump datasets to >=4.0.0 and remove all
trust_remote_codereferences. (#158) (c383806)
0.2.11 (2026-01-30)
- Downloaded w&b artifacts are deleted too early (#163) (157d757)
- use aleph-alpha-client concurrency limit and allow >100 concurrent requests (#166) (73b7d97)
- VLLM tokenizer lazy initialization didn't work with W&B (#165) (f38de79)
0.2.10 (2026-01-27)
0.2.9 (2026-01-15)
0.2.8 (2026-01-09)
- normalize math reasoning (#148) (73a8843)
- removed github token from release-please and update image links (#147) (74d59ea)
0.2.7 (2026-01-08)
- add position randomization for LLM pairwise judges (#135) (e4ed3ec)
- added automated documentation through CI and Sphinx (#127) (46ef6b3)
- added badges to github readme to link pypi and docs pages (#139) (778bad2)
- pass AA_TOKEN and AA_INFERENCE_ENDPOINT in the AA model constructor (#134) (93267b6)
- docs: resolve broken source links (#132) (c0e37b2)
- release-please pushes docker to registry and triggers tests (#138) (d291bb4)
- For math reasoning completion, added a finally block that ensures that there is no possibility of the timeout signal going off outside of this block, which crashed the process.
- Move
aleph_alpha.pyto use/completionsendpoint instead of/evaluate./evaluatewas just available for model deployed in the luminous workers and is not supported in vllm.
- Added 11 "I don't know" (IDK) task variants:
ARC_IDK,COPA_IDK,GPQA_IDK,HELLASWAG_IDK,MMLU_IDK,MMLU_PRO_IDK,PIQA_IDK,OPENBOOKQA_IDK,TRUTHFULQA_IDK,WINOGENDER_IDK, andWINOGRANDE_IDK. Call for automated hashing. - Corrected typo in prompt template key for a MTBench LLM-as-a-judge, and implemented tests to ensure these are always what we expect (no typos)
- Updated image urls to be absolute so the pypi page can display them correctly
- Added
llm_judge_promptandllm_judge_responseto MTBENCH metric results
- Cleaned up
OpenAIModelclass. Those models can now also be evaluated and not only used as judges. Loglikelihood evaluation requests are now implemented (although only supported by a limited number of OpenAI models). Implemented tests forOpenAIModelcalls. Added concurrency to completion calls - Added access to Deepseek model API
- Added AidanBench benchmark (measures creative divergent thinking by counting unique, coherent responses to open-ended questions) as well as AidanBenchOriginal (the same, but preserving a typo found in the original implementation).
- Added documentation on
SQUADandSQUAD2benchmark classes - Updated documentation on lists of available tasks
- Added
.vscode/launch.json - Added verbosity levels (0 is critical, 1 is info, 2 is debug) for minimal output
- Modified the Hendrycks Math task to use the same query template as MATH500 to encourage boxed answer formatting.
- Added
post_process_completionmethod toBaseLLMclass to enable model-specific post-processing of completions before task-specific post-processing is applied. - The BASELLM class is equiped with
delcall to clear up resources. VLLM and HF APIs offload the respective models off the gpus. OpenAI class disconnects the client. - Refactored
VLLMandHFLLMinterfaces in backwards-compatible way so that there are identical (and flexible!) checkpoint and formatter specification options across VLLM and HFLLM.VLLMRegistryModel,HFLLMRegistryModel,HFLLM_from_nameare now deprecated. - Added
generate_from_samplesmethod inBaseLLMwhich takes precedence overgenerate_from_messagesif implemented.
SciQ: Previously, the benchmark included instructions with context passages that revealed the answer. A new version has been created that removes this context while keeping the original asSCIQEvalHarness.TruthfulQA: Fixed an indexing error that caused the benchmark to return the first correct item instead of the last. Corrected the ground truth for Accuracy to include all label-1 items, rather than only a single item.GSM8K: In line with the convention of naming the recommended default version as the primary benchmark,GSM8KLlamaVersionhas been renamed toGSM8K, and the originalGSM8Khas been renamed toGSM8KEvalHarness.
MTBenchJudgePairandMTBenchJudgeSingle: The expected error (KeyError) wouldn't be thrown, resulting in uncaught errors. We now use the same error handling that we do in other tasks.- Added
ConfidenceWeightedAccuracy, i.e., the score = probability of the correctly-chosen answer (when it is also the argmax) - Added
DistributionalCorrectnessScore, based on Burns (2025) Measuring Language Model Hallucinations Through Distributional Correctness. - Added
TernaryScore, based on Kalai et al. (2025) Why language models hallucinate. arXiv:2509.04664. JsonFormat: added optionalexact_matchscore based on whether the generated JSON object equals an expected ground-truth object.
- Added
WANDB_ADDITIONAL_ARTIFACT_REFERENCESenvironment variable to reference custom artifacts in W&B. - Added
resource-cleanupargument to run.py; enabling a smooth transition in GPU workflows between response generation/evaluation. - Added
WandbUploader(for uploading results as W&B artifacts) and refactoredHFUploader(no change in functionality). - Config hashes in output directories now do not consider config elements which are irrelevant to actual results.
- Fix: WandB initialization does not crash on overly long model names anymore.
- Fix: "Object of type Role is not JSON serializable" type of errors were fixed.
- Updated examples in the docs to use the updated args and switched default tests to MMLU for more insightful metrics.
- Fix: W&B integration respects WANDB_ARTIFACT_DIR. In addition, new env var WANDB_CACHE_SKIP controls cache use.
- Dropped support for S3 storages without proper SSL certificates.
- Added support for W&B artifacts on local storage which don't need to be downloaded and may be earlier available.
- Fix:
pip install eval_framework[all]uses uv to fixResolveTooDeepdependency resolver errors. - Added a CI workflow to test uv and pip installs (CPU only and GPU for VLLM) and avoid trigger with .md changes.
- Updated the CI workflow graph to decouple CPU only test and full test suite with GPU: cpu tests dont wait for docker build.
- Changed implementation of OpenBookQA to be openbook (gives facts in prompt). Old version is available as task OPENBOOKQA_EVAL_HANRESS
- Added a class variable "BYTES_PER_TOKEN" that controls token fertility to allow max_tokens in dataset to be model-specific.
- Changed implementation of OpenBookQA to be openbook (gives facts in prompt). Old version is available as OPENBOOKQA_EVAL_HANRESS task
- Added automated Docker image versioning in release workflow. Docker images are now tagged with
v{major}.{minor}.{patch},v{major}.{minor}, andlateston each release for reproducible deployments. - Added Docker guide (
docs/docker_guide.md) for both AA users and external contributors. - Added template formatting tests to be run by CI.
- Restructured tests to "test_eval_framework" and "tests_template_formatting".
- Fix LLM judge not being available via CLI in Determined context
- The
--llm-name(and--judge-model-name) argument can now also be a module path likeeval_framework.llm.huggingface.HFLLM. Combining this with--llm-args(-judge-model-args) should cover many use-cases without having to provide amodels.pyfile. - Added
eval_framwork.llm.huggingface.HFLLMRegistryModelandeval_framwork.llm.vllm.VLLMRegistryModelto conveniently load models fromwandb.
- Fix for empty
stop_sequencesineval_framework.llm.huggingface.StopSequenceCriteria. - Fixed dataset loading issues for SQUAD, SQUAD2, FLORES-200, and SPHYR that were causing formatter test failures.
- Pinned
HF_REVISIONfor StructEval tob5512175, since the train split was renamed test upstream - Renamed
_get_eval_kwargsmethod to_get_contextin the StructEval task.
- Removed
torchas a main dependency ofeval_framework - Added wandb logging
- Documentation improvements
- Reduced redundant string/path casting
- Import paths in
llmandmetricsno longer have a_llmand_metricssuffix. E.g.,llm/huggingface.pyinstead ofllm/huggingface_llm.py. - We've also removed all models except those used for testing (they were largely old). The recommended way going forward is to provide your own models implementation to the framework.
DEFAULT_FORMATTERin our models is now a callable, to avoid instantiating formatters at import time.
- Our benchmarks tasks are now registered lazily, which reduces the amount of code that is imported at startup time. Task look-ups are now insensitive to case, hyphens, underscores and whitespace.
- Task names in the registry are now enforced to be equal to the class names.
- Added
subjectsandhf_revisionto BaseTask arguments to replace global task re-definition when running with non default values. - Generate task documentation in
docs/tasks. Moves the generate_task_docs utility to inside the package and added test that documentation is up-to-date. - Renamed
ChemBenchMultipleChoicetoChemBenchfor consistency. - Fixed
ZERO_SCROLLS_QMSUMmissing from task_names.py - Fix inconsistent language code for Croatian/Serbian in INCLUDE task
- Fixed BLEU/CHRF/TER min/max scoring when all completions are empty.
- Special tokens are now ignored when computing compression ratios
- Fixed loading of extra task modules (skip non-evaluation BaseTasks with no NAME attribute), add test that no task with same names get registered
- Packages are now released to PyPI
- Removed and relaxes several main-dependencies
- Added support for weights and biases + determined pre-emption
- Added missing
DOCKER_CODE_EXECUTIONvariable to.env.example - Added accelerate import as default for [transformers] and boto3 in pyproject.toml
- Initial release of
eval-framework.