Skip to content

Latest commit

 

History

History
264 lines (165 loc) · 16.8 KB

File metadata and controls

264 lines (165 loc) · 16.8 KB

Changelog

Main/Unreleased

Models

Tasks

Metrics

General

Bug Fixes

0.2.13 (2026-02-26)

Features

  • add Global MMLU task (#174) (0d0b227)
  • add GoldenSwag task (#175) (a05e032)
  • add tasks from the OLMES evaluation suite (#180) (54f295d)
  • adding aggregated results with errors, if error free ration is < 1.0 (#181) (6f3e639)
  • BalancedCOPA dataset (#177) (25161aa)
  • Change to more complete revision of ZeroScrolls dataset (#171) (a4e117e)
  • COPA uses appropriate dataset splits (#176) (55ebe44)

Bug Fixes

  • Change to more complete revision of zeroscrolls (#173) (a84286e)
  • Flores200 data reading issue (#179) (9bf3155)

Documentation

  • updated with info for release-please (#162) (cf38766)

0.2.12 (2026-02-04)

Features

  • add "top_p" param to AlephAlphaAPIModel (#168) (e52c927)
  • Bump datasets to >=4.0.0 and remove all trust_remote_code references. (#158) (c383806)

0.2.11 (2026-01-30)

Bug Fixes

  • Downloaded w&b artifacts are deleted too early (#163) (157d757)
  • use aleph-alpha-client concurrency limit and allow >100 concurrent requests (#166) (73b7d97)
  • VLLM tokenizer lazy initialization didn't work with W&B (#165) (f38de79)

0.2.10 (2026-01-27)

Bug Fixes

  • prefix dataset paths with hf user id for all tasks that did not have it before (#160) (d5dc178)

0.2.9 (2026-01-15)

Features

Bug Fixes

  • docker push on release has one too many 'v's in the tag name (#153) (99e6096)

0.2.8 (2026-01-09)

Bug Fixes

  • normalize math reasoning (#148) (73a8843)
  • removed github token from release-please and update image links (#147) (74d59ea)

0.2.7 (2026-01-08)

Features

  • add position randomization for LLM pairwise judges (#135) (e4ed3ec)
  • added automated documentation through CI and Sphinx (#127) (46ef6b3)
  • added badges to github readme to link pypi and docs pages (#139) (778bad2)
  • pass AA_TOKEN and AA_INFERENCE_ENDPOINT in the AA model constructor (#134) (93267b6)

Bug Fixes

  • docs: resolve broken source links (#132) (c0e37b2)
  • release-please pushes docker to registry and triggers tests (#138) (d291bb4)

Documentation

  • added documentation for running tests and expected runtimes (#133) (77fd1d3)

0.2.6

Models

Tasks

Metrics

General

  • For math reasoning completion, added a finally block that ensures that there is no possibility of the timeout signal going off outside of this block, which crashed the process.

0.2.5

Models

  • Move aleph_alpha.py to use /completions endpoint instead of /evaluate. /evaluate was just available for model deployed in the luminous workers and is not supported in vllm.

Tasks

  • Added 11 "I don't know" (IDK) task variants: ARC_IDK, COPA_IDK, GPQA_IDK, HELLASWAG_IDK, MMLU_IDK, MMLU_PRO_IDK, PIQA_IDK, OPENBOOKQA_IDK, TRUTHFULQA_IDK, WINOGENDER_IDK, and WINOGRANDE_IDK. Call for automated hashing.
  • Corrected typo in prompt template key for a MTBench LLM-as-a-judge, and implemented tests to ensure these are always what we expect (no typos)

Metrics

General

  • Updated image urls to be absolute so the pypi page can display them correctly
  • Added llm_judge_prompt and llm_judge_response to MTBENCH metric results

0.2.4

Models

  • Cleaned up OpenAIModel class. Those models can now also be evaluated and not only used as judges. Loglikelihood evaluation requests are now implemented (although only supported by a limited number of OpenAI models). Implemented tests for OpenAIModel calls. Added concurrency to completion calls
  • Added access to Deepseek model API

Tasks

  • Added AidanBench benchmark (measures creative divergent thinking by counting unique, coherent responses to open-ended questions) as well as AidanBenchOriginal (the same, but preserving a typo found in the original implementation).

Metrics

General

  • Added documentation on SQUAD and SQUAD2 benchmark classes
  • Updated documentation on lists of available tasks
  • Added .vscode/launch.json
  • Added verbosity levels (0 is critical, 1 is info, 2 is debug) for minimal output
  • Modified the Hendrycks Math task to use the same query template as MATH500 to encourage boxed answer formatting.

0.2.3

Models

  • Added post_process_completion method to BaseLLM class to enable model-specific post-processing of completions before task-specific post-processing is applied.
  • The BASELLM class is equiped with del call to clear up resources. VLLM and HF APIs offload the respective models off the gpus. OpenAI class disconnects the client.
  • Refactored VLLM and HFLLM interfaces in backwards-compatible way so that there are identical (and flexible!) checkpoint and formatter specification options across VLLM and HFLLM. VLLMRegistryModel, HFLLMRegistryModel, HFLLM_from_name are now deprecated.
  • Added generate_from_samples method in BaseLLM which takes precedence over generate_from_messages if implemented.

Tasks

  • SciQ: Previously, the benchmark included instructions with context passages that revealed the answer. A new version has been created that removes this context while keeping the original as SCIQEvalHarness.
  • TruthfulQA: Fixed an indexing error that caused the benchmark to return the first correct item instead of the last. Corrected the ground truth for Accuracy to include all label-1 items, rather than only a single item.
  • GSM8K: In line with the convention of naming the recommended default version as the primary benchmark, GSM8KLlamaVersion has been renamed to GSM8K, and the original GSM8K has been renamed to GSM8KEvalHarness.

Metrics

  • MTBenchJudgePair and MTBenchJudgeSingle: The expected error (KeyError) wouldn't be thrown, resulting in uncaught errors. We now use the same error handling that we do in other tasks.
  • Added ConfidenceWeightedAccuracy, i.e., the score = probability of the correctly-chosen answer (when it is also the argmax)
  • Added DistributionalCorrectnessScore, based on Burns (2025) Measuring Language Model Hallucinations Through Distributional Correctness.
  • Added TernaryScore, based on Kalai et al. (2025) Why language models hallucinate. arXiv:2509.04664.
  • JsonFormat: added optional exact_match score based on whether the generated JSON object equals an expected ground-truth object.

General

  • Added WANDB_ADDITIONAL_ARTIFACT_REFERENCES environment variable to reference custom artifacts in W&B.
  • Added resource-cleanup argument to run.py; enabling a smooth transition in GPU workflows between response generation/evaluation.
  • Added WandbUploader (for uploading results as W&B artifacts) and refactored HFUploader (no change in functionality).
  • Config hashes in output directories now do not consider config elements which are irrelevant to actual results.
  • Fix: WandB initialization does not crash on overly long model names anymore.
  • Fix: "Object of type Role is not JSON serializable" type of errors were fixed.
  • Updated examples in the docs to use the updated args and switched default tests to MMLU for more insightful metrics.
  • Fix: W&B integration respects WANDB_ARTIFACT_DIR. In addition, new env var WANDB_CACHE_SKIP controls cache use.
  • Dropped support for S3 storages without proper SSL certificates.
  • Added support for W&B artifacts on local storage which don't need to be downloaded and may be earlier available.
  • Fix: pip install eval_framework[all] uses uv to fix ResolveTooDeep dependency resolver errors.
  • Added a CI workflow to test uv and pip installs (CPU only and GPU for VLLM) and avoid trigger with .md changes.
  • Updated the CI workflow graph to decouple CPU only test and full test suite with GPU: cpu tests dont wait for docker build.
  • Changed implementation of OpenBookQA to be openbook (gives facts in prompt). Old version is available as task OPENBOOKQA_EVAL_HANRESS
  • Added a class variable "BYTES_PER_TOKEN" that controls token fertility to allow max_tokens in dataset to be model-specific.
  • Changed implementation of OpenBookQA to be openbook (gives facts in prompt). Old version is available as OPENBOOKQA_EVAL_HANRESS task
  • Added automated Docker image versioning in release workflow. Docker images are now tagged with v{major}.{minor}.{patch}, v{major}.{minor}, and latest on each release for reproducible deployments.
  • Added Docker guide (docs/docker_guide.md) for both AA users and external contributors.
  • Added template formatting tests to be run by CI.
  • Restructured tests to "test_eval_framework" and "tests_template_formatting".

0.2.2

General

  • Fix LLM judge not being available via CLI in Determined context

0.2.1

Models

  • The --llm-name (and --judge-model-name) argument can now also be a module path like eval_framework.llm.huggingface.HFLLM. Combining this with --llm-args (-judge-model-args) should cover many use-cases without having to provide a models.py file.
  • Added eval_framwork.llm.huggingface.HFLLMRegistryModel and eval_framwork.llm.vllm.VLLMRegistryModel to conveniently load models from wandb.

Tasks

  • Fix for empty stop_sequences in eval_framework.llm.huggingface.StopSequenceCriteria.
  • Fixed dataset loading issues for SQUAD, SQUAD2, FLORES-200, and SPHYR that were causing formatter test failures.
  • Pinned HF_REVISION for StructEval to b5512175, since the train split was renamed test upstream
  • Renamed _get_eval_kwargs method to _get_context in the StructEval task.

General

  • Removed torch as a main dependency of eval_framework
  • Added wandb logging
  • Documentation improvements
  • Reduced redundant string/path casting

0.2.0

Models

  • Import paths in llm and metrics no longer have a _llm and _metrics suffix. E.g., llm/huggingface.py instead of llm/huggingface_llm.py.
  • We've also removed all models except those used for testing (they were largely old). The recommended way going forward is to provide your own models implementation to the framework.
  • DEFAULT_FORMATTER in our models is now a callable, to avoid instantiating formatters at import time.

Tasks

  • Our benchmarks tasks are now registered lazily, which reduces the amount of code that is imported at startup time. Task look-ups are now insensitive to case, hyphens, underscores and whitespace.
  • Task names in the registry are now enforced to be equal to the class names.
  • Added subjectsand hf_revision to BaseTask arguments to replace global task re-definition when running with non default values.
  • Generate task documentation in docs/tasks. Moves the generate_task_docs utility to inside the package and added test that documentation is up-to-date.
  • Renamed ChemBenchMultipleChoice to ChemBench for consistency.
  • Fixed ZERO_SCROLLS_QMSUM missing from task_names.py
  • Fix inconsistent language code for Croatian/Serbian in INCLUDE task

Metrics

  • Fixed BLEU/CHRF/TER min/max scoring when all completions are empty.

General

  • Special tokens are now ignored when computing compression ratios
  • Fixed loading of extra task modules (skip non-evaluation BaseTasks with no NAME attribute), add test that no task with same names get registered
  • Packages are now released to PyPI
  • Removed and relaxes several main-dependencies
  • Added support for weights and biases + determined pre-emption
  • Added missing DOCKER_CODE_EXECUTION variable to .env.example
  • Added accelerate import as default for [transformers] and boto3 in pyproject.toml

0.1.0

  • Initial release of eval-framework.