Releases · groq/openbench

09 Dec 00:50

github-actions

v0.5.3

54fa998

v0.5.3 Latest

Latest

0.5.3 (2025-12-08)

Features

add --max-tasks option for concurrent task execution in eval command (#279) (241e653)
add bbq benchmark (#255) (46f4744)
add ChartQAPro (#289) (677f7c7)
add configurable HuggingFace Hub config naming (#261) (8abe2ae)
add DocVQA benchmark (#297) (0dd0edf)
add fuzzy match suggestion for misspelled evals (#303) (625a7b3)
add ifbench benchmark (#326) (bd730c2)
add math EvalGroup (#263) (e0f4a9b)
add MathVista benchmark (#298) (5c50a8f)
add MMLU-Redux benchmark from lighteval (#321) (d22a587)
add MMVet V2 benchmark (#296) (66689de)
add OCRBench V2 benchmark (#295) (71f3589)
add optional extras for simpleqa and toxicity (#266) (2450ddf)
add sealqa benchmark (#283) (06b39e4)
add SMT 2024 benchmarks (#239) (5d9b475)
add tau bench, pass^k metric (#294) (2bb1242)
agentdojo: port agentdojo benchmark (#223) (1cf174c)
cli: added export command to exposrt specific logs to hf (#265) (62e8d8c)
cvebench: added auto prepare env set up for cvebench (#259) (db238a3)
deepresearch-bench: add deepresearch bench (#288) (d2b4622)
docs: docs for unsupported providers (#312) (3a3d4b8)
docs: search capability benchmarks feature page (#287) (9dd27c1)
evals: add GSM8K benchmark with shared grade school math scorer (#322) (4559a67)
evals: add QA benchmarks and shared scorer (#323) (0ea3733)
factscore: added support for factscore (#258) (13aafd7)
gpt_oss: add GPT-OSS AIME benchmark, make --epochs optional and stop default 1 from being forced down (#284) (815f51b)
groq: implement configurable timeout for GroqAPI client (#271) (be492b6)
groq: streaming support (#313) (c1a20be)
m2s: added support for single turn conversion of 3 multi turn jailbreak datasets (mhj, safeMT, cosafe) (#222) (6b8f2b1)
PolygloToxicityPrompts: add multilingual toxicity evaluation (#262) (46de7ee)
provider: add helicone support (#275) (de6ab04)
provider: add SiliconFlow provider support (#269) (ce14070)
providers: add W&B Inference model provider (#264) (a02c34f)
rocketscience: add rocketscience benchmark support (#277) (73bcfc2)
simpleqa_verified: add SimpleQA Verified benchmark (#249) (8a512c4)
vllm: add openbench override for Inspect AI's built-in vllm provider that doesn't start a server (#272) (d0eff6f)

Bug Fixes

add args to eval command (#276) (0e06988)
allow subtasks into eval group summary (#306) (ae82757)
deps: catch import warnings from optional deps (#327) (434fe88)
docs: markdown formatting issue (#314) (24af36f)
docs: reasoning-effort docs clarity (#278) (2644619)
docvqa: remove docvqa from config and dep group (#328) (162b8b5)
factscore import issues, vLLM timeout bug (#273) (1674528)
factscore: fix module level import error for optional dep (#274) (99594ff)
fix global import warning for optional dep (#307) (c44c8de)
friendliai token env name (#286) (a197828)
livemcpbench: catch errors on call_tool and route (#260) (0ab746d)
math: shorten math group (#268) (19cc66b)
refactor factscore (#300) (ab3e84e)
remove nonexistent docvqa import (#318) (90a15a2)
rename gpt_oss_aime to gpt_oss_aime25 ([b378715](https://github.com/groq/...

Assets 2

16 Oct 05:22

github-actions

v0.5.2

f34ba88

v0.5.2

0.5.2 (2025-10-16)

Chores

require manual install for cyber plugin (#252) (090f801)

Assets 2

16 Oct 01:32

github-actions

v0.5.1

8bda67c

v0.5.1

0.5.1 (2025-10-16)

Bug Fixes

broken link (#247) (5ef66e0)

Chores

update lockfile (dff1bd9)

Refactor

extract cybersecurity benchmarks to plugin (#251) (df829e2)

Assets 2

10 Oct 18:18

github-actions

v0.5.0

035b238

v0.5.0

0.5.0 (2025-10-10)

⚠ BREAKING CHANGES

added more groupings under benchmarks catalog (#244)

Features

add clockbench evaluation framwork and script for synthesizing public dataset. (#159) (3ba9836)
add IFEval (#182) (8d1b939)
add local openbench implementation of groq provider in inspect (#131) (52aea35)
add mmmlu eval (#193) (a42c2d5)
add mmstar benchmark (#174) (5d085ab)
add new openbench documentation (#169) (f3e6a37)
add overarching bbh command to run all 18 BBH tasks (463a25f)
add preset eval group infrastructure (#215) (d9ea03a)
added more groupings under benchmarks catalog (#244) (d932cb0)
ArabicMMLU: add remaining 32 Arabic exam subsets, total 41 subsets (#219) (006e248)
benchmark: add support for arc-agi (#158) (3f32253)
benchmark: add support for detailbench (#154) (23fbca5)
benchmark: add support for TUMLU (#160) (#161) (885be75)
benchmark: multichallenge implementation (#170) (cf2ab4f)
change default model to groq/openai/gpt-oss-20b (#138) (8f7f42f)
components: export the run_eval entrypoint method (#157) (acbe7f4)
configure release-please for pre-v1.0 version bumping (#133) (c432934)
cybench: ported over code for cybench (#207) (7949425)
cybersecurity, changelog, more docs (39f123c)
display results patch to include task duration stats (#167) (e4e480c)
docs: add changelog page (#225) (7db9135)
docs: add release notes section and update index with new features for v0.5 (#245) (09ab78e)
docs: added feature card and docs page for exercism (#243) (2b38147)
docs: Added feature eval docs pages and cache command docs (#191) (50501f1)
eval: add support for json output (#14) (f335418)
exercism: added support for exercism tasks w/ agent support for aider, roo, claude, opencode (#151) (d86f0da)
graphwalks token filter (#115) (e38658c)
groq reasoning effort + bugfix to override inspect's "groq" (#142) (b919cc7)
lighteval: Add 7 core commonsense reasoning benchmarks from LightEval (#197) (7792c45)
lighteval: add BigBench eval (122 MCQ tasks) (9f35b1d)
lighteval: Add cross-lingual understanding benchmarks (XCOPA, XStoryCloze, XWinograd) (917667a)
lighteval: add Global-MMLU eval (42 languages) (3542213)
lighteval: register BigBench benchmarks in config and registry (77018e9)
lighteval: register Global-MMLU benchmarks in config and registry (156f509)
link to subscription form on main page (#240) (988d08c)
livemcpbench: Adding support for liveMCPBench (#127) (222f678)
make evals dash/undescore insensitive (#185) (5ec5177)
mbpp (#117) (93ad88b)
mcq_eval: enable abstraction of MCQ eval (#181) (2f53db2)
mmmu-pro: added support for mmmu_mcq, mmmu_open, mmmu_pro, mmmu_pro_vision (#134) (a875378)
openrouter: add OpenRouter provider support (#145) (47b579e)
openrouter: add provider routing args support (#180) (12e1d81)
otis-mock-aime: added support for otis mock aime 2024-2025 (#218) (1b9fd5c)
plugins: add entry point system for external benchmarks (#216) (71e7257)
return eval logs from run_eval function (#173) (ee459d9)
rootly_terraform: add initial implementation of Rootly Terraform evals (#195) (cd3acae)

Bug Fixes

allow for more python versions (#164) (e6682fe)
close headqa metadata entry (947522d)
cybench: moved cybench dependency into dependency group (#237) (8d30715)
handle missing SciCode dependency lazily in solver (#186) (fed4e88)
improve BBH target extraction to handle multi-cha...

Assets 2

29 Aug 01:11

github-actions

v0.4.1

5ec0eda

v0.4.1

0.4.1 (2025-08-29)

Bug Fixes

rootly_gmcq: handle both string and list content types in scorer (#129) (376624d)

Assets 2

29 Aug 00:21

github-actions

v0.4.0

93e2a8c

v0.4.0

0.4.0 (2025-08-28)

Features

add boolq (#70) (edbd1cc)
add BrowseComp (#118) (498c706)
add CITATION.cff for software citation (#102) (16960de)
add CTI-Bench cybersecurity benchmark suite (#96) (8465075)
add GitHub issue and PR templates (#103) (68f0ef0)
add gmcq (#114) (bb3c89d)
add MuSR variants and grouped metrics (#107) (10ae935)
add robust answer extraction scorers from gpt-oss to MathArena benchmarks and gpqa_diamond (#97) (251ba66)
add Vercel AI Gateway inference provider (#98) (38e211a)
jsonschemabench (#95) (e3d842d)
mmmu: added support for mmmu benchmark and all of its subdomains (#121) (801bceb)

Bug Fixes

format mmlu_pro.py dataset file (2a9ee65)
handle skipped integration tests in CI (#120) (dae9378)
hle: added multimodal support for hle (#128) (8c3f212)
jsonschemaeval: match paper methodology and add openai subset (#113) (1b6470b)
make claude-code-review job optional to prevent PR blocking (#100) (6aad080)

Documentation

emphasize pre-commit hooks installation requirement (#106) (e765464)
refresh CONTRIBUTING.md and update README references (#105) (bf66747)
update installation instructions and clarify dependency architecture in CLAUDE.md and CONTRIBUTING.md (#126) (cd962fd)
update README citation to match CITATION.cff (#104) (6219e8c)

Chores

bump Inspect-AI to 0.3.125 (#124) (d728cbb)
unpin dependencies except inspect-ai (#108) (50cf90f)
update uv.lock package version (3583d71)

CI

add automated PyPI publishing to release workflow (#99) (eddbf70)

Assets 2

14 Aug 21:06

github-actions

v0.3.0

84c2406

v0.3.0

0.3.0 (2025-08-14)

Features

add --debug flag to eval-retry command (b26afaa)
add -M and -T flags for model and task arguments (#75) (46a6ba6)
add 'openbench' as alternative CLI entry point (#48) (68b3c5b)
add AI21 Labs inference provider (#86) (db7bde7)
add Baseten inference provider (#79) (696e2aa)
add Cerebras and SambaNova model providers (1c61f59)
add Cohere inference provider (#90) (8e6e838)
add Crusoe inference provider (#84) (3d0c794)
add DeepInfra inference provider (#85) (6fedf53)
add Friendli inference provider (#88) (7e2b258)
Add huggingface inference provider (#54) (f479703)
add Hyperbolic inference provider (#80) (4ebf723)
add initial GraphWalks benchmark implementation (#58) (1aefd07)
add Lambda AI inference provider (#81) (b78c346)
add MiniMax inference provider (#87) (09fd27b)
add Moonshot inference provider (#91) (e5743cb)
add Nebius model provider (#47) (ba2ec19)
add Nous Research model provider (#49) (32dd815)
add Novita AI inference provider (#82) (6f5874a)
add Parasail inference provider (#83) (973c7b3)
add Reka inference provider (#89) (1ab9c53)
add SciCode (#63) (3650bfa)
add support for alpha benchmarks in evaluation commands (#92) (e2ccfaa)
push eval data to huggingface repo (#65) (acc600f)

Bug Fixes

add missing newline at end of novita.py (ef0fa4b)
remove default sampling parameters from CLI (#72) (978638a)

Documentation

docs for 0.3.0 (#93) (fe358bb)
fix directory structure documentation in CONTRIBUTING.md (#78) (41f8ed9)

Chores

fix GraphWalks: Split into three separate benchmarks (#76) (d1ed96e)
update version (8b7bbe7)

Refactor

move task loading from registry to config and update imports (de6eea2)

CI

Enhance Claude code review workflow with updated prompts and model specification (#71) (b605ed2)

Assets 2

11 Aug 20:14

github-actions

v0.2.0

1bf97a2

v0.2.0

0.2.0 (2025-08-11)

Features

add DROP (simple-evals) (#20) (f85bf19)
add Humanity's Last Exam (HLE) benchmark (#23) (6f10fb7)
add MATH and MATH-500 benchmarks for mathematical problem solving (#22) (9c6843b)
add MGSM (#18) (bec1a7c)
add openai MRCR benchmark for long context recall (#24) (1b09ebd)
HealthBench (#16) (2caa47d)

Documentation

update CLAUDE.md with pre-commit and dependency pinning requirements (f33730e)

Chores

GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (1a00342)

Assets 2

31 Jul 22:43

github-actions

v0.1.1

ab052ee

v0.1.1

0.1.1 (2025-07-31)

Bug Fixes

add missing init.py files and fix package discovery for PyPI (#10) (29fcdf6)

Documentation

update README to streamline setup instructions for OpenBench, use pypi (16e08a0)

Assets 2

31 Jul 09:53

github-actions

v0.1.0

d722a1d

v0.1.0

0.1.0 (2025-07-31)

Features

openbench (3265bb0)

Chores

ci: update release-please workflow to allow label management (b70db16)
drop versions for release (58ce995)
GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (555658a)
update project metadata for version 0.1.0, add license, readme, and repository links (9ea2102)

Assets 2

Releases: groq/openbench

v0.5.3

0.5.3 (2025-12-08)

Features

Bug Fixes

Uh oh!

v0.5.2

0.5.2 (2025-10-16)

Chores

Uh oh!

v0.5.1

0.5.1 (2025-10-16)

Bug Fixes

Chores

Refactor

Uh oh!

v0.5.0

0.5.0 (2025-10-10)

⚠ BREAKING CHANGES

Features

Bug Fixes

Uh oh!

v0.4.1

0.4.1 (2025-08-29)

Bug Fixes

Uh oh!

v0.4.0

0.4.0 (2025-08-28)

Features

Bug Fixes

Documentation

Chores

CI

Uh oh!

v0.3.0

0.3.0 (2025-08-14)

Features

Bug Fixes

Documentation

Chores

Refactor

CI

Uh oh!

v0.2.0

0.2.0 (2025-08-11)

Features

Documentation

Chores

Uh oh!

v0.1.1

0.1.1 (2025-07-31)

Bug Fixes

Documentation

Uh oh!

v0.1.0

0.1.0 (2025-07-31)

Features

Chores

Uh oh!