New release with four new model backends, tensor parallel support for transformers based models (hf), new benchmarks, a TaskManager refactor, and a long tail of task correctness fixes.
Highlights
New Model Backends
- TensorRT-LLM (
trt-llm) — NVIDIA TensorRT-LLM backend for optimized GPU inference by @Tracin in #3628 - Megatron-LM (
megatron-lm) — Megatron-LM backend with TP/EP/DP support by @shangxiaokang in #3521 (with follow-up hardening in #3607) - Intel Gaudi — Gaudi support via
optimum-habanaby @12010486 in #3550 - LiteLLM AI gateway (
litellm) — Use LiteLLM as a unified API gateway for 100+ providers by @RheagalFire in #3721 - Native Tensor Parallelism for HF backend — multi-GPU TP for
transformersmodels viatp_planby @YangKai0616 in #3692
TaskManager Refactor (#3549)
TaskManager.load(...)returns a flat{tasks, groups}dict instead of the legacy nested{ConfigurableGroup: {name: Task}}.evaluate()accepts both shapes;load_task_or_group(...)andget_task_dict(...)are deprecated shims that return the old shape.- New
Groupclass directly holds its child tasks;ConfigurableGroupis now a deprecated wrapper around it. - Duplicate task/group configs within the same root are skipped with a log message instead of silently overwritten. (Custom
include_pathentries still override defaults.)
Breaking Changes
SteeredHFrenamed toSteeredModel— update imports if you're using the steering backend by @adrian-sauter in #3592- vLLM minimum bumped to
>=0.18as part of the data-parallel-with-Ray fixes by @baberabb in #3725 enable_thinkingis now disallowed formultiple_choice/ loglikelihood tasks, andthink_end_tokenis now required whenenable_thinking=True. Configurations that combined these previously failed silently by @fxmarty-amd in #3675
New Logger
New Benchmarks & Tasks
- InfiniteBench — long-context evaluation beyond 100K tokens (12 sub-tasks: code debug/run, KV retrieval, longbook QA/summarization, math find, passkey, etc.) by @siddhant-rajhans in #3662
- CRUXEval — Python code reasoning benchmark with input/output prediction variants (incl. CoT and pass@k variants) by @ThomasHeap in #3699
- Toksuite — multilingual tokenization-robustness benchmark (Chinese, English, and more) by @gsaltintas in #3669
- NEREL-bench — Russian named-entity / relation-extraction benchmark by @bond005 in #3650
- JFinQA — Japanese Financial Numerical Reasoning QA (1000 questions, with consistency / numerical / temporal splits) by @ajtgjmdjp in #3570
Fixes & Improvements
Task Fixes
- Fixed GPQA preprocessing regex that corrupted answer text containing brackets by @Robby955 in #3691 and @Chessing234 in #3735
- Fixed MMLU-Pro and MMLU-Pro-Plus few-shot answers leaking into the user role under chat templates by @kiwaku in #3693, #3747
- Fixed RACE
doc_to_textkeeping a blank marker and dropping the question body by @Chessing234 in #3716 - Fixed BigBench multiple-choice tasks crashing on mixed-format examples (filtered out free-form examples) by @Chessing234 in #3702
- Fixed HeadQA
doc_to_decontamination_querypointing at a nonexistentqueryfield by @Chessing234 in #3718 - Fixed french_bench_topic_based_nli
doc_to_decontamination_querypointing at nonexistenttextefield by @Chessing234 in #3719 - Fixed TruthfulQA-gen
dataset_pathby @zhngstl in #3723 - Fixed NorEval/NorIdiom
!functionimports to use absolute module paths by @Anai-Guo in #3731 - Fixed IFEval
RephraseChecker.strip_changesgreedy-regex bug by @Chessing234 in #3737 - Fixed correctness issues in Arabic normalization and prompt loading by @RinZ27 in #3589
- Updated BLiMP dataset path by @jmichaelov in #3596
- Replaced all references to the
CohereForAIorg withCohereLabsby @juliafalcao in #3631
What's Changed
- refactor(Taskmanager)! by @baberabb in #3549
- fix(cli):
--cache_requestsalways fails due to argparsetype/choicesconflict by @maxidl in #3588 - feat: Add Megatron-LM backend with TP/EP/DP support by @shangxiaokang in #3521
- Fix: #3293 (pybass UnboundLocalError on outputs in Exception Logging) by @lucafossen in #3601
- [fix] Add missing tokenization progress bar by @fxmarty-amd in #3605
- fix: improve model_args type coercion in handle_arg_string by @ManasVardhan in #3608
- fix: harden Megatron GPT layer spec setup for eval by @shangxiaokang in #3607
- Update vLLM import of
resolve_hf_chat_templateby @DarkLight1337 in #3595 - Add docstring for HFLM init keyword arguments by @joshuaswanson in #3630
- Update all mentions of the
CohereForAIorganization toCohereLabsby @juliafalcao in #3631 - Skip caching None responses in async generation path by @joshuaswanson in #3633
- Fix correctness issues in Arabic normalization and prompt loading by @RinZ27 in #3589
- fix(evaluate tests) by @baberabb in #3634
- fix: propagate custom aggregation to dict-valued metric result keys by @s-zx in #3626
- chore(ci-updates) by @baberabb in #3635
- Update BLiMP dataset path by @jmichaelov in #3596
- Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) by @ajtgjmdjp in #3570
- Rename SteeredHF to SteeredModel in lm_eval/models/init.py by @adrian-sauter in #3592
- fix: Update
WatsonxLLMclass mapping and errors by @Rafal-Chrzanowski-IBM in #3591 - Add Intel Gaudi support by @12010486 in #3550
- [fix] Disallow
enable_thinkingwithoutput_type: multiple_choicetasks / loglikelihood tasks; raise error in casethink_end_tokenis not provided withenable_thinking=Trueby @fxmarty-amd in #3675 - fix(vllm): fix dp with ray. remove mp distribution; pin vllm >=0.18 by @baberabb in #3725
- refactor(utils): fix mistral tokenizer error; improve doc-strings by @baberabb in #3728
- fix(vllm): fix vllm tokenizer for Mistral; rm default
gpu_memory_utilization=0.9by @baberabb in #3732 - Fix GPQA preprocess stripping mathematical bracket expressions by @Chessing234 in #3735
- Guard vLLM tok_encode against prefix_token_id being None by @Chessing234 in #3724
- fix(ifeval): use non-greedy regex in RephraseChecker.strip_changes by @Chessing234 in #3737
- fix: bound request cache filename length by @princepal9120 in #3729
- fix codeowners by @baberabb in #3738
- Fix dataset_path for truthfulqa_gen by @zhngstl in #3723
- fix(vllm): disallow data_parallel with enable_expert_parallel by @FazeelUsmani in #3734
- Add Trackio logger with per-sample Trace logging by @abidlabs in #3733
- Fix headqa doc_to_decontamination_query pointing at nonexistent 'query' field by @Chessing234 in #3718
- Fix french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent 'texte' field by @Chessing234 in #3719
- fix(noreval/noridiom): use absolute module paths for !function imports (#3624) by @Anai-Guo in #3731
- Fix DummyLM.generate_until printing context as gen_kwargs by @Chessing234 in #3711
- Fix MultiChoiceRegexFilter.find_match IndexError on all-empty capture groups by @Chessing234 in #3708
- fix(model_comparator): fix ImportError from scipy.stats.norm import by @Chessing234 in #3742
- Fix zeno_visualize discarding tasks intersection result by @Chessing234 in #3739
- fix: don't pass task stop sequences to vLLM for reasoning models by @jwmacd in #3700
- feat: Add [ LiteLLM AI gateway ] as model backend by @RheagalFire in #3721
- Fix RACE doc_to_text keeping blank marker and dropping the question body by @Chessing234 in #3716
- Fix BigBench multiple-choice crash on mixed-format tasks by @Chessing234 in #3702
- Fix GPQA preprocessing: remove bracket-stripping regex that corrupts answer text by @Robby955 in #3691
- Fix mmlu_pro fewshot answers leaking into user role under chat template by @kiwaku in #3693
- fix(mmlu_pro_plus): sync fixes from
mmlu_proby @baberabb in #3747 - chore: cleap up deps; fix ci lint by @baberabb in #3748
- Fix DummyLM.generate_until write_out printing context as gen_kwargs by @Chessing234 in #3714
- Fix median aggregation returning arbitrary element instead of median by @Chessing234 in #3696
- fix(api): chat payload leaking top-level text type by @felixmr1 in #3745
- [BUGFIX] Consistent handling of None answers and cache by @RawthiL in #3656
- Adding Cruxeval by @ThomasHeap in #3699
- [Task] NEREL-bench by @bond005 in #3650
- Added Toksuite Benchmark by @gsaltintas in #3669
- Add InfiniteBench: long-context evaluation beyond 100K tokens by @siddhant-rajhans in #3662
- fix: Reset batch_sizes cache before each _loglikelihood_tokens call by @nevertmr in #3654
- feat: add TRT-LLM backend. by @Tracin in #3628
- [Feat] Add native Tensor Parallelism support for HF backend by @YangKai0616 in #3692
- feat(release): 0.4.12 by @baberabb in #3763
New Contributors
- @maxidl made their first contribution in #3588
- @shangxiaokang made their first contribution in #3521
- @ManasVardhan made their first contribution in #3608
- @joshuaswanson made their first contribution in #3630
- @RinZ27 made their first contribution in #3589
- @s-zx made their first contribution in #3626
- @ajtgjmdjp made their first contribution in #3570
- @adrian-sauter made their first contribution in #3592
- @Rafal-Chrzanowski-IBM made their first contribution in #3591
- @12010486 made their first contribution in #3550
- @Chessing234 made their first contribution in #3735
- @princepal9120 made their first contribution in #3729
- @zhngstl made their first contribution in #3723
- @FazeelUsmani made their first contribution in #3734
- @abidlabs made their first contribution in #3733
- @Anai-Guo made their first contribution in #3731
- @jwmacd made their first contribution in #3700
- @RheagalFire made their first contribution in #3721
- @Robby955 made their first contribution in #3691
- @kiwaku made their first contribution in #3693
- @felixmr1 made their first contribution in #3745
- @ThomasHeap made their first contribution in #3699
- @siddhant-rajhans made their first contribution in #3662
- @nevertmr made their first contribution in #3654
- @Tracin made their first contribution in #3628
- @YangKai0616 made their first contribution in #3692
Full Changelog: v0.4.11...v0.4.12