Release v0.4.12 · EleutherAI/lm-evaluation-harness

New release with four new model backends, tensor parallel support for transformers based models (hf), new benchmarks, a TaskManager refactor, and a long tail of task correctness fixes.

Highlights

New Model Backends

TensorRT-LLM (trt-llm) — NVIDIA TensorRT-LLM backend for optimized GPU inference by @Tracin in #3628
Megatron-LM (megatron-lm) — Megatron-LM backend with TP/EP/DP support by @shangxiaokang in #3521 (with follow-up hardening in #3607)
Intel Gaudi — Gaudi support via optimum-habana by @12010486 in #3550
LiteLLM AI gateway (litellm) — Use LiteLLM as a unified API gateway for 100+ providers by @RheagalFire in #3721
Native Tensor Parallelism for HF backend — multi-GPU TP for transformers models via tp_plan by @YangKai0616 in #3692

`TaskManager` Refactor (#3549)

TaskManager.load(...) returns a flat {tasks, groups} dict instead of the legacy nested {ConfigurableGroup: {name: Task}}. evaluate() accepts both shapes; load_task_or_group(...) and get_task_dict(...) are deprecated shims that return the old shape.
New Group class directly holds its child tasks; ConfigurableGroup is now a deprecated wrapper around it.
Duplicate task/group configs within the same root are skipped with a log message instead of silently overwritten. (Custom include_path entries still override defaults.)

Breaking Changes

SteeredHF renamed to SteeredModel — update imports if you're using the steering backend by @adrian-sauter in #3592
vLLM minimum bumped to >=0.18 as part of the data-parallel-with-Ray fixes by @baberabb in #3725
enable_thinking is now disallowed for multiple_choice / loglikelihood tasks, and think_end_token is now required when enable_thinking=True. Configurations that combined these previously failed silently by @fxmarty-amd in #3675

New Logger

Trackio logger with per-sample Trace logging by @abidlabs in #3733

New Benchmarks & Tasks

InfiniteBench — long-context evaluation beyond 100K tokens (12 sub-tasks: code debug/run, KV retrieval, longbook QA/summarization, math find, passkey, etc.) by @siddhant-rajhans in #3662
CRUXEval — Python code reasoning benchmark with input/output prediction variants (incl. CoT and pass@k variants) by @ThomasHeap in #3699
Toksuite — multilingual tokenization-robustness benchmark (Chinese, English, and more) by @gsaltintas in #3669
NEREL-bench — Russian named-entity / relation-extraction benchmark by @bond005 in #3650
JFinQA — Japanese Financial Numerical Reasoning QA (1000 questions, with consistency / numerical / temporal splits) by @ajtgjmdjp in #3570

Fixes & Improvements

Task Fixes

Fixed GPQA preprocessing regex that corrupted answer text containing brackets by @Robby955 in #3691 and @Chessing234 in #3735
Fixed MMLU-Pro and MMLU-Pro-Plus few-shot answers leaking into the user role under chat templates by @kiwaku in #3693, #3747
Fixed RACE doc_to_text keeping a blank marker and dropping the question body by @Chessing234 in #3716
Fixed BigBench multiple-choice tasks crashing on mixed-format examples (filtered out free-form examples) by @Chessing234 in #3702
Fixed HeadQA doc_to_decontamination_query pointing at a nonexistent query field by @Chessing234 in #3718
Fixed french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent texte field by @Chessing234 in #3719
Fixed TruthfulQA-gen dataset_path by @zhngstl in #3723
Fixed NorEval/NorIdiom !function imports to use absolute module paths by @Anai-Guo in #3731
Fixed IFEval RephraseChecker.strip_changes greedy-regex bug by @Chessing234 in #3737
Fixed correctness issues in Arabic normalization and prompt loading by @RinZ27 in #3589
Updated BLiMP dataset path by @jmichaelov in #3596
Replaced all references to the CohereForAI org with CohereLabs by @juliafalcao in #3631

What's Changed

refactor(Taskmanager)! by @baberabb in #3549
fix(cli): --cache_requests always fails due to argparse type/choices conflict by @maxidl in #3588
feat: Add Megatron-LM backend with TP/EP/DP support by @shangxiaokang in #3521
Fix: #3293 (pybass UnboundLocalError on outputs in Exception Logging) by @lucafossen in #3601
[fix] Add missing tokenization progress bar by @fxmarty-amd in #3605
fix: improve model_args type coercion in handle_arg_string by @ManasVardhan in #3608
fix: harden Megatron GPT layer spec setup for eval by @shangxiaokang in #3607
Update vLLM import of resolve_hf_chat_template by @DarkLight1337 in #3595
Add docstring for HFLM init keyword arguments by @joshuaswanson in #3630
Update all mentions of the CohereForAI organization to CohereLabs by @juliafalcao in #3631
Skip caching None responses in async generation path by @joshuaswanson in #3633
Fix correctness issues in Arabic normalization and prompt loading by @RinZ27 in #3589
fix(evaluate tests) by @baberabb in #3634
fix: propagate custom aggregation to dict-valued metric result keys by @s-zx in #3626
chore(ci-updates) by @baberabb in #3635
Update BLiMP dataset path by @jmichaelov in #3596
Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) by @ajtgjmdjp in #3570
Rename SteeredHF to SteeredModel in lm_eval/models/init.py by @adrian-sauter in #3592
fix: Update WatsonxLLM class mapping and errors by @Rafal-Chrzanowski-IBM in #3591
Add Intel Gaudi support by @12010486 in #3550
[fix] Disallow enable_thinking with output_type: multiple_choice tasks / loglikelihood tasks; raise error in case think_end_token is not provided with enable_thinking=True by @fxmarty-amd in #3675
fix(vllm): fix dp with ray. remove mp distribution; pin vllm >=0.18 by @baberabb in #3725
refactor(utils): fix mistral tokenizer error; improve doc-strings by @baberabb in #3728
fix(vllm): fix vllm tokenizer for Mistral; rm default gpu_memory_utilization=0.9 by @baberabb in #3732
Fix GPQA preprocess stripping mathematical bracket expressions by @Chessing234 in #3735
Guard vLLM tok_encode against prefix_token_id being None by @Chessing234 in #3724
fix(ifeval): use non-greedy regex in RephraseChecker.strip_changes by @Chessing234 in #3737
fix: bound request cache filename length by @princepal9120 in #3729
fix codeowners by @baberabb in #3738
Fix dataset_path for truthfulqa_gen by @zhngstl in #3723
fix(vllm): disallow data_parallel with enable_expert_parallel by @FazeelUsmani in #3734
Add Trackio logger with per-sample Trace logging by @abidlabs in #3733
Fix headqa doc_to_decontamination_query pointing at nonexistent 'query' field by @Chessing234 in #3718
Fix french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent 'texte' field by @Chessing234 in #3719
fix(noreval/noridiom): use absolute module paths for !function imports (#3624) by @Anai-Guo in #3731
Fix DummyLM.generate_until printing context as gen_kwargs by @Chessing234 in #3711
Fix MultiChoiceRegexFilter.find_match IndexError on all-empty capture groups by @Chessing234 in #3708
fix(model_comparator): fix ImportError from scipy.stats.norm import by @Chessing234 in #3742
Fix zeno_visualize discarding tasks intersection result by @Chessing234 in #3739
fix: don't pass task stop sequences to vLLM for reasoning models by @jwmacd in #3700
feat: Add [ LiteLLM AI gateway ] as model backend by @RheagalFire in #3721
Fix RACE doc_to_text keeping blank marker and dropping the question body by @Chessing234 in #3716
Fix BigBench multiple-choice crash on mixed-format tasks by @Chessing234 in #3702
Fix GPQA preprocessing: remove bracket-stripping regex that corrupts answer text by @Robby955 in #3691
Fix mmlu_pro fewshot answers leaking into user role under chat template by @kiwaku in #3693
fix(mmlu_pro_plus): sync fixes from mmlu_pro by @baberabb in #3747
chore: cleap up deps; fix ci lint by @baberabb in #3748
Fix DummyLM.generate_until write_out printing context as gen_kwargs by @Chessing234 in #3714
Fix median aggregation returning arbitrary element instead of median by @Chessing234 in #3696
fix(api): chat payload leaking top-level text type by @felixmr1 in #3745
[BUGFIX] Consistent handling of None answers and cache by @RawthiL in #3656
Adding Cruxeval by @ThomasHeap in #3699
[Task] NEREL-bench by @bond005 in #3650
Added Toksuite Benchmark by @gsaltintas in #3669
Add InfiniteBench: long-context evaluation beyond 100K tokens by @siddhant-rajhans in #3662
fix: Reset batch_sizes cache before each _loglikelihood_tokens call by @nevertmr in #3654
feat: add TRT-LLM backend. by @Tracin in #3628
[Feat] Add native Tensor Parallelism support for HF backend by @YangKai0616 in #3692
feat(release): 0.4.12 by @baberabb in #3763

New Contributors

@maxidl made their first contribution in #3588
@shangxiaokang made their first contribution in #3521
@ManasVardhan made their first contribution in #3608
@joshuaswanson made their first contribution in #3630
@RinZ27 made their first contribution in #3589
@s-zx made their first contribution in #3626
@ajtgjmdjp made their first contribution in #3570
@adrian-sauter made their first contribution in #3592
@Rafal-Chrzanowski-IBM made their first contribution in #3591
@12010486 made their first contribution in #3550
@Chessing234 made their first contribution in #3735
@princepal9120 made their first contribution in #3729
@zhngstl made their first contribution in #3723
@FazeelUsmani made their first contribution in #3734
@abidlabs made their first contribution in #3733
@Anai-Guo made their first contribution in #3731
@jwmacd made their first contribution in #3700
@RheagalFire made their first contribution in #3721
@Robby955 made their first contribution in #3691
@kiwaku made their first contribution in #3693
@felixmr1 made their first contribution in #3745
@ThomasHeap made their first contribution in #3699
@siddhant-rajhans made their first contribution in #3662
@nevertmr made their first contribution in #3654
@Tracin made their first contribution in #3628
@YangKai0616 made their first contribution in #3692

Full Changelog: v0.4.11...v0.4.12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.12

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

New Model Backends

`TaskManager` Refactor (#3549)

Breaking Changes

New Logger

New Benchmarks & Tasks

Fixes & Improvements

Task Fixes

What's Changed

New Contributors

Contributors

Uh oh!

Uh oh!

v0.4.12

Highlights

New Model Backends

TaskManager Refactor (#3549)

Breaking Changes

New Logger

New Benchmarks & Tasks

Fixes & Improvements

Task Fixes

What's Changed

New Contributors

Contributors

Uh oh!

`TaskManager` Refactor (#3549)