Releases: groq/openbench
Releases · groq/openbench
v0.5.3
0.5.3 (2025-12-08)
Features
- add --max-tasks option for concurrent task execution in eval command (#279) (241e653)
- add bbq benchmark (#255) (46f4744)
- add ChartQAPro (#289) (677f7c7)
- add configurable HuggingFace Hub config naming (#261) (8abe2ae)
- add DocVQA benchmark (#297) (0dd0edf)
- add fuzzy match suggestion for misspelled evals (#303) (625a7b3)
- add ifbench benchmark (#326) (bd730c2)
- add math EvalGroup (#263) (e0f4a9b)
- add MathVista benchmark (#298) (5c50a8f)
- add MMLU-Redux benchmark from lighteval (#321) (d22a587)
- add MMVet V2 benchmark (#296) (66689de)
- add OCRBench V2 benchmark (#295) (71f3589)
- add optional extras for simpleqa and toxicity (#266) (2450ddf)
- add sealqa benchmark (#283) (06b39e4)
- add SMT 2024 benchmarks (#239) (5d9b475)
- add tau bench, pass^k metric (#294) (2bb1242)
- agentdojo: port agentdojo benchmark (#223) (1cf174c)
- cli: added export command to exposrt specific logs to hf (#265) (62e8d8c)
- cvebench: added auto prepare env set up for cvebench (#259) (db238a3)
- deepresearch-bench: add deepresearch bench (#288) (d2b4622)
- docs: docs for unsupported providers (#312) (3a3d4b8)
- docs: search capability benchmarks feature page (#287) (9dd27c1)
- evals: add GSM8K benchmark with shared grade school math scorer (#322) (4559a67)
- evals: add QA benchmarks and shared scorer (#323) (0ea3733)
- factscore: added support for factscore (#258) (13aafd7)
- gpt_oss: add GPT-OSS AIME benchmark, make --epochs optional and stop default 1 from being forced down (#284) (815f51b)
- groq: implement configurable timeout for GroqAPI client (#271) (be492b6)
- groq: streaming support (#313) (c1a20be)
- m2s: added support for single turn conversion of 3 multi turn jailbreak datasets (mhj, safeMT, cosafe) (#222) (6b8f2b1)
- PolygloToxicityPrompts: add multilingual toxicity evaluation (#262) (46de7ee)
- provider: add helicone support (#275) (de6ab04)
- provider: add SiliconFlow provider support (#269) (ce14070)
- providers: add W&B Inference model provider (#264) (a02c34f)
- rocketscience: add rocketscience benchmark support (#277) (73bcfc2)
- simpleqa_verified: add SimpleQA Verified benchmark (#249) (8a512c4)
- vllm: add openbench override for Inspect AI's built-in vllm provider that doesn't start a server (#272) (d0eff6f)
Bug Fixes
- add args to eval command (#276) (0e06988)
- allow subtasks into eval group summary (#306) (ae82757)
- deps: catch import warnings from optional deps (#327) (434fe88)
- docs: markdown formatting issue (#314) (24af36f)
- docs: reasoning-effort docs clarity (#278) (2644619)
- docvqa: remove docvqa from config and dep group (#328) (162b8b5)
- factscore import issues, vLLM timeout bug (#273) (1674528)
- factscore: fix module level import error for optional dep (#274) (99594ff)
- fix global import warning for optional dep (#307) (c44c8de)
- friendliai token env name (#286) (a197828)
- livemcpbench: catch errors on call_tool and route (#260) (0ab746d)
- math: shorten math group (#268) (19cc66b)
- refactor factscore (#300) (ab3e84e)
- remove nonexistent docvqa import (#318) (90a15a2)
- rename gpt_oss_aime to gpt_oss_aime25 ([b378715](https://github.com/groq/...
v0.5.2
v0.5.1
v0.5.0
0.5.0 (2025-10-10)
⚠ BREAKING CHANGES
- added more groupings under benchmarks catalog (#244)
Features
- add clockbench evaluation framwork and script for synthesizing public dataset. (#159) (3ba9836)
- add IFEval (#182) (8d1b939)
- add local openbench implementation of groq provider in inspect (#131) (52aea35)
- add mmmlu eval (#193) (a42c2d5)
- add mmstar benchmark (#174) (5d085ab)
- add new openbench documentation (#169) (f3e6a37)
- add overarching bbh command to run all 18 BBH tasks (463a25f)
- add preset eval group infrastructure (#215) (d9ea03a)
- added more groupings under benchmarks catalog (#244) (d932cb0)
- ArabicMMLU: add remaining 32 Arabic exam subsets, total 41 subsets (#219) (006e248)
- benchmark: add support for arc-agi (#158) (3f32253)
- benchmark: add support for detailbench (#154) (23fbca5)
- benchmark: add support for TUMLU (#160) (#161) (885be75)
- benchmark: multichallenge implementation (#170) (cf2ab4f)
- change default model to groq/openai/gpt-oss-20b (#138) (8f7f42f)
- components: export the run_eval entrypoint method (#157) (acbe7f4)
- configure release-please for pre-v1.0 version bumping (#133) (c432934)
- cybench: ported over code for cybench (#207) (7949425)
- cybersecurity, changelog, more docs (39f123c)
- display results patch to include task duration stats (#167) (e4e480c)
- docs: add changelog page (#225) (7db9135)
- docs: add release notes section and update index with new features for v0.5 (#245) (09ab78e)
- docs: added feature card and docs page for exercism (#243) (2b38147)
- docs: Added feature eval docs pages and cache command docs (#191) (50501f1)
- eval: add support for json output (#14) (f335418)
- exercism: added support for exercism tasks w/ agent support for aider, roo, claude, opencode (#151) (d86f0da)
- graphwalks token filter (#115) (e38658c)
- groq reasoning effort + bugfix to override inspect's "groq" (#142) (b919cc7)
- lighteval: Add 7 core commonsense reasoning benchmarks from LightEval (#197) (7792c45)
- lighteval: add BigBench eval (122 MCQ tasks) (9f35b1d)
- lighteval: Add cross-lingual understanding benchmarks (XCOPA, XStoryCloze, XWinograd) (917667a)
- lighteval: add Global-MMLU eval (42 languages) (3542213)
- lighteval: register BigBench benchmarks in config and registry (77018e9)
- lighteval: register Global-MMLU benchmarks in config and registry (156f509)
- link to subscription form on main page (#240) (988d08c)
- livemcpbench: Adding support for liveMCPBench (#127) (222f678)
- make evals dash/undescore insensitive (#185) (5ec5177)
- mbpp (#117) (93ad88b)
- mcq_eval: enable abstraction of MCQ eval (#181) (2f53db2)
- mmmu-pro: added support for mmmu_mcq, mmmu_open, mmmu_pro, mmmu_pro_vision (#134) (a875378)
- openrouter: add OpenRouter provider support (#145) (47b579e)
- openrouter: add provider routing args support (#180) (12e1d81)
- otis-mock-aime: added support for otis mock aime 2024-2025 (#218) (1b9fd5c)
- plugins: add entry point system for external benchmarks (#216) (71e7257)
- return eval logs from run_eval function (#173) (ee459d9)
- rootly_terraform: add initial implementation of Rootly Terraform evals (#195) (cd3acae)
Bug Fixes
v0.4.1
v0.4.0
0.4.0 (2025-08-28)
Features
- add boolq (#70) (edbd1cc)
- add BrowseComp (#118) (498c706)
- add CITATION.cff for software citation (#102) (16960de)
- add CTI-Bench cybersecurity benchmark suite (#96) (8465075)
- add GitHub issue and PR templates (#103) (68f0ef0)
- add gmcq (#114) (bb3c89d)
- add MuSR variants and grouped metrics (#107) (10ae935)
- add robust answer extraction scorers from gpt-oss to MathArena benchmarks and gpqa_diamond (#97) (251ba66)
- add Vercel AI Gateway inference provider (#98) (38e211a)
- jsonschemabench (#95) (e3d842d)
- mmmu: added support for mmmu benchmark and all of its subdomains (#121) (801bceb)
Bug Fixes
- format mmlu_pro.py dataset file (2a9ee65)
- handle skipped integration tests in CI (#120) (dae9378)
- hle: added multimodal support for hle (#128) (8c3f212)
- jsonschemaeval: match paper methodology and add openai subset (#113) (1b6470b)
- make claude-code-review job optional to prevent PR blocking (#100) (6aad080)
Documentation
- emphasize pre-commit hooks installation requirement (#106) (e765464)
- refresh CONTRIBUTING.md and update README references (#105) (bf66747)
- update installation instructions and clarify dependency architecture in CLAUDE.md and CONTRIBUTING.md (#126) (cd962fd)
- update README citation to match CITATION.cff (#104) (6219e8c)
Chores
- bump Inspect-AI to 0.3.125 (#124) (d728cbb)
- unpin dependencies except inspect-ai (#108) (50cf90f)
- update uv.lock package version (3583d71)
CI
v0.3.0
0.3.0 (2025-08-14)
Features
- add --debug flag to eval-retry command (b26afaa)
- add -M and -T flags for model and task arguments (#75) (46a6ba6)
- add 'openbench' as alternative CLI entry point (#48) (68b3c5b)
- add AI21 Labs inference provider (#86) (db7bde7)
- add Baseten inference provider (#79) (696e2aa)
- add Cerebras and SambaNova model providers (1c61f59)
- add Cohere inference provider (#90) (8e6e838)
- add Crusoe inference provider (#84) (3d0c794)
- add DeepInfra inference provider (#85) (6fedf53)
- add Friendli inference provider (#88) (7e2b258)
- Add huggingface inference provider (#54) (f479703)
- add Hyperbolic inference provider (#80) (4ebf723)
- add initial GraphWalks benchmark implementation (#58) (1aefd07)
- add Lambda AI inference provider (#81) (b78c346)
- add MiniMax inference provider (#87) (09fd27b)
- add Moonshot inference provider (#91) (e5743cb)
- add Nebius model provider (#47) (ba2ec19)
- add Nous Research model provider (#49) (32dd815)
- add Novita AI inference provider (#82) (6f5874a)
- add Parasail inference provider (#83) (973c7b3)
- add Reka inference provider (#89) (1ab9c53)
- add SciCode (#63) (3650bfa)
- add support for alpha benchmarks in evaluation commands (#92) (e2ccfaa)
- push eval data to huggingface repo (#65) (acc600f)
Bug Fixes
- add missing newline at end of novita.py (ef0fa4b)
- remove default sampling parameters from CLI (#72) (978638a)
Documentation
- docs for 0.3.0 (#93) (fe358bb)
- fix directory structure documentation in CONTRIBUTING.md (#78) (41f8ed9)
Chores
Refactor
- move task loading from registry to config and update imports (de6eea2)
CI
v0.2.0
0.2.0 (2025-08-11)
Features
- add DROP (simple-evals) (#20) (f85bf19)
- add Humanity's Last Exam (HLE) benchmark (#23) (6f10fb7)
- add MATH and MATH-500 benchmarks for mathematical problem solving (#22) (9c6843b)
- add MGSM (#18) (bec1a7c)
- add openai MRCR benchmark for long context recall (#24) (1b09ebd)
- HealthBench (#16) (2caa47d)
Documentation
- update CLAUDE.md with pre-commit and dependency pinning requirements (f33730e)
Chores
- GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (1a00342)