Skip to content

Releases: groq/openbench

v0.5.3

09 Dec 00:50
54fa998

Choose a tag to compare

0.5.3 (2025-12-08)

Features

  • add --max-tasks option for concurrent task execution in eval command (#279) (241e653)
  • add bbq benchmark (#255) (46f4744)
  • add ChartQAPro (#289) (677f7c7)
  • add configurable HuggingFace Hub config naming (#261) (8abe2ae)
  • add DocVQA benchmark (#297) (0dd0edf)
  • add fuzzy match suggestion for misspelled evals (#303) (625a7b3)
  • add ifbench benchmark (#326) (bd730c2)
  • add math EvalGroup (#263) (e0f4a9b)
  • add MathVista benchmark (#298) (5c50a8f)
  • add MMLU-Redux benchmark from lighteval (#321) (d22a587)
  • add MMVet V2 benchmark (#296) (66689de)
  • add OCRBench V2 benchmark (#295) (71f3589)
  • add optional extras for simpleqa and toxicity (#266) (2450ddf)
  • add sealqa benchmark (#283) (06b39e4)
  • add SMT 2024 benchmarks (#239) (5d9b475)
  • add tau bench, pass^k metric (#294) (2bb1242)
  • agentdojo: port agentdojo benchmark (#223) (1cf174c)
  • cli: added export command to exposrt specific logs to hf (#265) (62e8d8c)
  • cvebench: added auto prepare env set up for cvebench (#259) (db238a3)
  • deepresearch-bench: add deepresearch bench (#288) (d2b4622)
  • docs: docs for unsupported providers (#312) (3a3d4b8)
  • docs: search capability benchmarks feature page (#287) (9dd27c1)
  • evals: add GSM8K benchmark with shared grade school math scorer (#322) (4559a67)
  • evals: add QA benchmarks and shared scorer (#323) (0ea3733)
  • factscore: added support for factscore (#258) (13aafd7)
  • gpt_oss: add GPT-OSS AIME benchmark, make --epochs optional and stop default 1 from being forced down (#284) (815f51b)
  • groq: implement configurable timeout for GroqAPI client (#271) (be492b6)
  • groq: streaming support (#313) (c1a20be)
  • m2s: added support for single turn conversion of 3 multi turn jailbreak datasets (mhj, safeMT, cosafe) (#222) (6b8f2b1)
  • PolygloToxicityPrompts: add multilingual toxicity evaluation (#262) (46de7ee)
  • provider: add helicone support (#275) (de6ab04)
  • provider: add SiliconFlow provider support (#269) (ce14070)
  • providers: add W&B Inference model provider (#264) (a02c34f)
  • rocketscience: add rocketscience benchmark support (#277) (73bcfc2)
  • simpleqa_verified: add SimpleQA Verified benchmark (#249) (8a512c4)
  • vllm: add openbench override for Inspect AI's built-in vllm provider that doesn't start a server (#272) (d0eff6f)

Bug Fixes

Read more

v0.5.2

16 Oct 05:22
f34ba88

Choose a tag to compare

0.5.2 (2025-10-16)

Chores

  • require manual install for cyber plugin (#252) (090f801)

v0.5.1

16 Oct 01:32
8bda67c

Choose a tag to compare

0.5.1 (2025-10-16)

Bug Fixes

Chores

Refactor

  • extract cybersecurity benchmarks to plugin (#251) (df829e2)

v0.5.0

10 Oct 18:18
035b238

Choose a tag to compare

0.5.0 (2025-10-10)

⚠ BREAKING CHANGES

  • added more groupings under benchmarks catalog (#244)

Features

  • add clockbench evaluation framwork and script for synthesizing public dataset. (#159) (3ba9836)
  • add IFEval (#182) (8d1b939)
  • add local openbench implementation of groq provider in inspect (#131) (52aea35)
  • add mmmlu eval (#193) (a42c2d5)
  • add mmstar benchmark (#174) (5d085ab)
  • add new openbench documentation (#169) (f3e6a37)
  • add overarching bbh command to run all 18 BBH tasks (463a25f)
  • add preset eval group infrastructure (#215) (d9ea03a)
  • added more groupings under benchmarks catalog (#244) (d932cb0)
  • ArabicMMLU: add remaining 32 Arabic exam subsets, total 41 subsets (#219) (006e248)
  • benchmark: add support for arc-agi (#158) (3f32253)
  • benchmark: add support for detailbench (#154) (23fbca5)
  • benchmark: add support for TUMLU (#160) (#161) (885be75)
  • benchmark: multichallenge implementation (#170) (cf2ab4f)
  • change default model to groq/openai/gpt-oss-20b (#138) (8f7f42f)
  • components: export the run_eval entrypoint method (#157) (acbe7f4)
  • configure release-please for pre-v1.0 version bumping (#133) (c432934)
  • cybench: ported over code for cybench (#207) (7949425)
  • cybersecurity, changelog, more docs (39f123c)
  • display results patch to include task duration stats (#167) (e4e480c)
  • docs: add changelog page (#225) (7db9135)
  • docs: add release notes section and update index with new features for v0.5 (#245) (09ab78e)
  • docs: added feature card and docs page for exercism (#243) (2b38147)
  • docs: Added feature eval docs pages and cache command docs (#191) (50501f1)
  • eval: add support for json output (#14) (f335418)
  • exercism: added support for exercism tasks w/ agent support for aider, roo, claude, opencode (#151) (d86f0da)
  • graphwalks token filter (#115) (e38658c)
  • groq reasoning effort + bugfix to override inspect's "groq" (#142) (b919cc7)
  • lighteval: Add 7 core commonsense reasoning benchmarks from LightEval (#197) (7792c45)
  • lighteval: add BigBench eval (122 MCQ tasks) (9f35b1d)
  • lighteval: Add cross-lingual understanding benchmarks (XCOPA, XStoryCloze, XWinograd) (917667a)
  • lighteval: add Global-MMLU eval (42 languages) (3542213)
  • lighteval: register BigBench benchmarks in config and registry (77018e9)
  • lighteval: register Global-MMLU benchmarks in config and registry (156f509)
  • link to subscription form on main page (#240) (988d08c)
  • livemcpbench: Adding support for liveMCPBench (#127) (222f678)
  • make evals dash/undescore insensitive (#185) (5ec5177)
  • mbpp (#117) (93ad88b)
  • mcq_eval: enable abstraction of MCQ eval (#181) (2f53db2)
  • mmmu-pro: added support for mmmu_mcq, mmmu_open, mmmu_pro, mmmu_pro_vision (#134) (a875378)
  • openrouter: add OpenRouter provider support (#145) (47b579e)
  • openrouter: add provider routing args support (#180) (12e1d81)
  • otis-mock-aime: added support for otis mock aime 2024-2025 (#218) (1b9fd5c)
  • plugins: add entry point system for external benchmarks (#216) (71e7257)
  • return eval logs from run_eval function (#173) (ee459d9)
  • rootly_terraform: add initial implementation of Rootly Terraform evals (#195) (cd3acae)

Bug Fixes

  • allow for more python versions (#164) (e6682fe)
  • close headqa metadata entry (947522d)
  • cybench: moved cybench dependency into dependency group (#237) (8d30715)
  • handle missing SciCode dependency lazily in solver (#186) (fed4e88)
  • improve BBH target extraction to handle multi-cha...
Read more

v0.4.1

29 Aug 01:11
5ec0eda

Choose a tag to compare

0.4.1 (2025-08-29)

Bug Fixes

  • rootly_gmcq: handle both string and list content types in scorer (#129) (376624d)

v0.4.0

29 Aug 00:21
93e2a8c

Choose a tag to compare

0.4.0 (2025-08-28)

Features

  • add boolq (#70) (edbd1cc)
  • add BrowseComp (#118) (498c706)
  • add CITATION.cff for software citation (#102) (16960de)
  • add CTI-Bench cybersecurity benchmark suite (#96) (8465075)
  • add GitHub issue and PR templates (#103) (68f0ef0)
  • add gmcq (#114) (bb3c89d)
  • add MuSR variants and grouped metrics (#107) (10ae935)
  • add robust answer extraction scorers from gpt-oss to MathArena benchmarks and gpqa_diamond (#97) (251ba66)
  • add Vercel AI Gateway inference provider (#98) (38e211a)
  • jsonschemabench (#95) (e3d842d)
  • mmmu: added support for mmmu benchmark and all of its subdomains (#121) (801bceb)

Bug Fixes

  • format mmlu_pro.py dataset file (2a9ee65)
  • handle skipped integration tests in CI (#120) (dae9378)
  • hle: added multimodal support for hle (#128) (8c3f212)
  • jsonschemaeval: match paper methodology and add openai subset (#113) (1b6470b)
  • make claude-code-review job optional to prevent PR blocking (#100) (6aad080)

Documentation

  • emphasize pre-commit hooks installation requirement (#106) (e765464)
  • refresh CONTRIBUTING.md and update README references (#105) (bf66747)
  • update installation instructions and clarify dependency architecture in CLAUDE.md and CONTRIBUTING.md (#126) (cd962fd)
  • update README citation to match CITATION.cff (#104) (6219e8c)

Chores

CI

  • add automated PyPI publishing to release workflow (#99) (eddbf70)

v0.3.0

14 Aug 21:06
84c2406

Choose a tag to compare

0.3.0 (2025-08-14)

Features

  • add --debug flag to eval-retry command (b26afaa)
  • add -M and -T flags for model and task arguments (#75) (46a6ba6)
  • add 'openbench' as alternative CLI entry point (#48) (68b3c5b)
  • add AI21 Labs inference provider (#86) (db7bde7)
  • add Baseten inference provider (#79) (696e2aa)
  • add Cerebras and SambaNova model providers (1c61f59)
  • add Cohere inference provider (#90) (8e6e838)
  • add Crusoe inference provider (#84) (3d0c794)
  • add DeepInfra inference provider (#85) (6fedf53)
  • add Friendli inference provider (#88) (7e2b258)
  • Add huggingface inference provider (#54) (f479703)
  • add Hyperbolic inference provider (#80) (4ebf723)
  • add initial GraphWalks benchmark implementation (#58) (1aefd07)
  • add Lambda AI inference provider (#81) (b78c346)
  • add MiniMax inference provider (#87) (09fd27b)
  • add Moonshot inference provider (#91) (e5743cb)
  • add Nebius model provider (#47) (ba2ec19)
  • add Nous Research model provider (#49) (32dd815)
  • add Novita AI inference provider (#82) (6f5874a)
  • add Parasail inference provider (#83) (973c7b3)
  • add Reka inference provider (#89) (1ab9c53)
  • add SciCode (#63) (3650bfa)
  • add support for alpha benchmarks in evaluation commands (#92) (e2ccfaa)
  • push eval data to huggingface repo (#65) (acc600f)

Bug Fixes

  • add missing newline at end of novita.py (ef0fa4b)
  • remove default sampling parameters from CLI (#72) (978638a)

Documentation

  • docs for 0.3.0 (#93) (fe358bb)
  • fix directory structure documentation in CONTRIBUTING.md (#78) (41f8ed9)

Chores

  • fix GraphWalks: Split into three separate benchmarks (#76) (d1ed96e)
  • update version (8b7bbe7)

Refactor

  • move task loading from registry to config and update imports (de6eea2)

CI

  • Enhance Claude code review workflow with updated prompts and model specification (#71) (b605ed2)

v0.2.0

11 Aug 20:14
1bf97a2

Choose a tag to compare

0.2.0 (2025-08-11)

Features

Documentation

  • update CLAUDE.md with pre-commit and dependency pinning requirements (f33730e)

Chores

  • GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (1a00342)

v0.1.1

31 Jul 22:43
ab052ee

Choose a tag to compare

0.1.1 (2025-07-31)

Bug Fixes

  • add missing init.py files and fix package discovery for PyPI (#10) (29fcdf6)

Documentation

  • update README to streamline setup instructions for OpenBench, use pypi (16e08a0)

v0.1.0

31 Jul 09:53
d722a1d

Choose a tag to compare

0.1.0 (2025-07-31)

Features

Chores

  • ci: update release-please workflow to allow label management (b70db16)
  • drop versions for release (58ce995)
  • GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (555658a)
  • update project metadata for version 0.1.0, add license, readme, and repository links (9ea2102)