Skip to content

v0.4.12

Latest

Choose a tag to compare

@baberabb baberabb released this 11 May 13:04
· 12 commits to main since this release
6d64254

New release with four new model backends, tensor parallel support for transformers based models (hf), new benchmarks, a TaskManager refactor, and a long tail of task correctness fixes.

Highlights

New Model Backends

  • TensorRT-LLM (trt-llm) — NVIDIA TensorRT-LLM backend for optimized GPU inference by @Tracin in #3628
  • Megatron-LM (megatron-lm) — Megatron-LM backend with TP/EP/DP support by @shangxiaokang in #3521 (with follow-up hardening in #3607)
  • Intel Gaudi — Gaudi support via optimum-habana by @12010486 in #3550
  • LiteLLM AI gateway (litellm) — Use LiteLLM as a unified API gateway for 100+ providers by @RheagalFire in #3721
  • Native Tensor Parallelism for HF backend — multi-GPU TP for transformers models via tp_plan by @YangKai0616 in #3692

TaskManager Refactor (#3549)

  • TaskManager.load(...) returns a flat {tasks, groups} dict instead of the legacy nested {ConfigurableGroup: {name: Task}}. evaluate() accepts both shapes; load_task_or_group(...) and get_task_dict(...) are deprecated shims that return the old shape.
  • New Group class directly holds its child tasks; ConfigurableGroup is now a deprecated wrapper around it.
  • Duplicate task/group configs within the same root are skipped with a log message instead of silently overwritten. (Custom include_path entries still override defaults.)

Breaking Changes

  • SteeredHF renamed to SteeredModel — update imports if you're using the steering backend by @adrian-sauter in #3592
  • vLLM minimum bumped to >=0.18 as part of the data-parallel-with-Ray fixes by @baberabb in #3725
  • enable_thinking is now disallowed for multiple_choice / loglikelihood tasks, and think_end_token is now required when enable_thinking=True. Configurations that combined these previously failed silently by @fxmarty-amd in #3675

New Logger

New Benchmarks & Tasks

  • InfiniteBench — long-context evaluation beyond 100K tokens (12 sub-tasks: code debug/run, KV retrieval, longbook QA/summarization, math find, passkey, etc.) by @siddhant-rajhans in #3662
  • CRUXEval — Python code reasoning benchmark with input/output prediction variants (incl. CoT and pass@k variants) by @ThomasHeap in #3699
  • Toksuite — multilingual tokenization-robustness benchmark (Chinese, English, and more) by @gsaltintas in #3669
  • NEREL-bench — Russian named-entity / relation-extraction benchmark by @bond005 in #3650
  • JFinQA — Japanese Financial Numerical Reasoning QA (1000 questions, with consistency / numerical / temporal splits) by @ajtgjmdjp in #3570

Fixes & Improvements

Task Fixes

  • Fixed GPQA preprocessing regex that corrupted answer text containing brackets by @Robby955 in #3691 and @Chessing234 in #3735
  • Fixed MMLU-Pro and MMLU-Pro-Plus few-shot answers leaking into the user role under chat templates by @kiwaku in #3693, #3747
  • Fixed RACE doc_to_text keeping a blank marker and dropping the question body by @Chessing234 in #3716
  • Fixed BigBench multiple-choice tasks crashing on mixed-format examples (filtered out free-form examples) by @Chessing234 in #3702
  • Fixed HeadQA doc_to_decontamination_query pointing at a nonexistent query field by @Chessing234 in #3718
  • Fixed french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent texte field by @Chessing234 in #3719
  • Fixed TruthfulQA-gen dataset_path by @zhngstl in #3723
  • Fixed NorEval/NorIdiom !function imports to use absolute module paths by @Anai-Guo in #3731
  • Fixed IFEval RephraseChecker.strip_changes greedy-regex bug by @Chessing234 in #3737
  • Fixed correctness issues in Arabic normalization and prompt loading by @RinZ27 in #3589
  • Updated BLiMP dataset path by @jmichaelov in #3596
  • Replaced all references to the CohereForAI org with CohereLabs by @juliafalcao in #3631

What's Changed

New Contributors

Full Changelog: v0.4.11...v0.4.12