Skip to content

Commit 035b238

Browse files
chore: release 0.5.0 (#132)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent 3c88a42 commit 035b238

File tree

3 files changed

+112
-2
lines changed

3 files changed

+112
-2
lines changed

.release-please-manifest.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
{
2-
".": "0.4.1"
2+
".": "0.5.0"
33
}

CHANGELOG.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,115 @@
11
# Changelog
22

3+
## [0.5.0](https://github.com/groq/openbench/compare/v0.4.1...v0.5.0) (2025-10-10)
4+
5+
6+
### ⚠ BREAKING CHANGES
7+
8+
* added more groupings under benchmarks catalog ([#244](https://github.com/groq/openbench/issues/244))
9+
10+
### Features
11+
12+
* add clockbench evaluation framwork and script for synthesizing public dataset. ([#159](https://github.com/groq/openbench/issues/159)) ([3ba9836](https://github.com/groq/openbench/commit/3ba98367252e4c0938d2841ff673662108cebe07))
13+
* add IFEval ([#182](https://github.com/groq/openbench/issues/182)) ([8d1b939](https://github.com/groq/openbench/commit/8d1b939477a5a2202df4a874cb0fa0586fe2d9e3))
14+
* add local openbench implementation of groq provider in inspect ([#131](https://github.com/groq/openbench/issues/131)) ([52aea35](https://github.com/groq/openbench/commit/52aea3510f03cd49d0fb66c3e6789f61939a7ee0))
15+
* add mmmlu eval ([#193](https://github.com/groq/openbench/issues/193)) ([a42c2d5](https://github.com/groq/openbench/commit/a42c2d5499a87366cfbb62cc254d47021565a628))
16+
* add mmstar benchmark ([#174](https://github.com/groq/openbench/issues/174)) ([5d085ab](https://github.com/groq/openbench/commit/5d085ab4b14a0a176953bfb3134e89e8eb36cb85))
17+
* add new openbench documentation ([#169](https://github.com/groq/openbench/issues/169)) ([f3e6a37](https://github.com/groq/openbench/commit/f3e6a373c3195f90ed3017d81202ded81cde54d2))
18+
* add overarching bbh command to run all 18 BBH tasks ([463a25f](https://github.com/groq/openbench/commit/463a25f308f8613f742f6bf79aba16d3998a30f3))
19+
* add preset eval group infrastructure ([#215](https://github.com/groq/openbench/issues/215)) ([d9ea03a](https://github.com/groq/openbench/commit/d9ea03a76f1a580037cd9760aa5ee3505104e341))
20+
* added more groupings under benchmarks catalog ([#244](https://github.com/groq/openbench/issues/244)) ([d932cb0](https://github.com/groq/openbench/commit/d932cb0b4aa8752a2075706f63485b3fb4dbdec0))
21+
* **ArabicMMLU:** add remaining 32 Arabic exam subsets, total 41 subsets ([#219](https://github.com/groq/openbench/issues/219)) ([006e248](https://github.com/groq/openbench/commit/006e2480fd336642f5bbe0fc3a08b521fbbe7dc9))
22+
* **benchmark:** add support for arc-agi ([#158](https://github.com/groq/openbench/issues/158)) ([3f32253](https://github.com/groq/openbench/commit/3f3225324e7578b7e6e1cbf3261fa73107b75745))
23+
* **benchmark:** add support for detailbench ([#154](https://github.com/groq/openbench/issues/154)) ([23fbca5](https://github.com/groq/openbench/commit/23fbca5b09b4636b9c07886ce20e10ed1d386eb8))
24+
* **benchmark:** add support for TUMLU ([#160](https://github.com/groq/openbench/issues/160)) ([#161](https://github.com/groq/openbench/issues/161)) ([885be75](https://github.com/groq/openbench/commit/885be75b68529e1b81aa964f6fbd0e0ea1de0ccd))
25+
* **benchmark:** multichallenge implementation ([#170](https://github.com/groq/openbench/issues/170)) ([cf2ab4f](https://github.com/groq/openbench/commit/cf2ab4fcf77195a66360408f9b230cceae3732e8))
26+
* change default model to groq/openai/gpt-oss-20b ([#138](https://github.com/groq/openbench/issues/138)) ([8f7f42f](https://github.com/groq/openbench/commit/8f7f42f1a3faba9dafa535f135a6a1638592ed4f))
27+
* **components:** export the run_eval entrypoint method ([#157](https://github.com/groq/openbench/issues/157)) ([acbe7f4](https://github.com/groq/openbench/commit/acbe7f45c3174951c9f6232bd2a624ff2682121f))
28+
* configure release-please for pre-v1.0 version bumping ([#133](https://github.com/groq/openbench/issues/133)) ([c432934](https://github.com/groq/openbench/commit/c4329343e059f6dd8c9a664a67431aba8d8fb853))
29+
* **cybench:** ported over code for cybench ([#207](https://github.com/groq/openbench/issues/207)) ([7949425](https://github.com/groq/openbench/commit/794942597a248a454455acd33e56565807b77f48))
30+
* cybersecurity, changelog, more docs ([39f123c](https://github.com/groq/openbench/commit/39f123cd4c0c98192a965827ab07256a8403d421))
31+
* display results patch to include task duration stats ([#167](https://github.com/groq/openbench/issues/167)) ([e4e480c](https://github.com/groq/openbench/commit/e4e480cc9c73329e6fb9eaecdaa59b304a5db887))
32+
* **docs:** add changelog page ([#225](https://github.com/groq/openbench/issues/225)) ([7db9135](https://github.com/groq/openbench/commit/7db913513cd6b2cefba948cafb11b65588b2eb30))
33+
* **docs:** add release notes section and update index with new features for v0.5 ([#245](https://github.com/groq/openbench/issues/245)) ([09ab78e](https://github.com/groq/openbench/commit/09ab78ece5c6b85516503c0aaa682dff54eb260b))
34+
* **docs:** added feature card and docs page for exercism ([#243](https://github.com/groq/openbench/issues/243)) ([2b38147](https://github.com/groq/openbench/commit/2b38147397814d6d27f51854784630f3e4ef94a8))
35+
* **docs:** Added feature eval docs pages and cache command docs ([#191](https://github.com/groq/openbench/issues/191)) ([50501f1](https://github.com/groq/openbench/commit/50501f13d429ba52d82b55b21916aeac0fb2c972))
36+
* **eval:** add support for json output ([#14](https://github.com/groq/openbench/issues/14)) ([f335418](https://github.com/groq/openbench/commit/f3354188fcbd35067072da8840a444188a3a0221))
37+
* **exercism:** added support for exercism tasks w/ agent support for aider, roo, claude, opencode ([#151](https://github.com/groq/openbench/issues/151)) ([d86f0da](https://github.com/groq/openbench/commit/d86f0da6d98580a2ca6d4d1b63f5b89fd0a86175))
38+
* graphwalks token filter ([#115](https://github.com/groq/openbench/issues/115)) ([e38658c](https://github.com/groq/openbench/commit/e38658c916d987bf8074910dfdefefb88e97faf6))
39+
* groq reasoning effort + bugfix to override inspect's "groq" ([#142](https://github.com/groq/openbench/issues/142)) ([b919cc7](https://github.com/groq/openbench/commit/b919cc7e6bc748c92393a9a9bbf69db560313495))
40+
* **lighteval:** Add 7 core commonsense reasoning benchmarks from LightEval ([#197](https://github.com/groq/openbench/issues/197)) ([7792c45](https://github.com/groq/openbench/commit/7792c45a2bb0aad825403fe7ccb14e0e8a537a9d))
41+
* **lighteval:** add BigBench eval (122 MCQ tasks) ([9f35b1d](https://github.com/groq/openbench/commit/9f35b1daebf87d80666a4a291b12f18b616746be))
42+
* **lighteval:** Add cross-lingual understanding benchmarks (XCOPA, XStoryCloze, XWinograd) ([917667a](https://github.com/groq/openbench/commit/917667ad8238a1319d64c0e9e0babf337c30f57d))
43+
* **lighteval:** add Global-MMLU eval (42 languages) ([3542213](https://github.com/groq/openbench/commit/354221387cf5e3cf8754636810a0570ca64fe196))
44+
* **lighteval:** register BigBench benchmarks in config and registry ([77018e9](https://github.com/groq/openbench/commit/77018e9c0b187b0d4a553f91054e9f1bc7089831))
45+
* **lighteval:** register Global-MMLU benchmarks in config and registry ([156f509](https://github.com/groq/openbench/commit/156f509416d5a8f38214c9a20cf414cb4a6fc933))
46+
* link to subscription form on main page ([#240](https://github.com/groq/openbench/issues/240)) ([988d08c](https://github.com/groq/openbench/commit/988d08c0d63a7c4246bb0123a45145c968ebeb44))
47+
* **livemcpbench:** Adding support for liveMCPBench ([#127](https://github.com/groq/openbench/issues/127)) ([222f678](https://github.com/groq/openbench/commit/222f678369030a34144f8ac4fefba48835d7206f))
48+
* make evals dash/undescore insensitive ([#185](https://github.com/groq/openbench/issues/185)) ([5ec5177](https://github.com/groq/openbench/commit/5ec5177ffa1c42d6f62a02f88226395914d4eba5))
49+
* mbpp ([#117](https://github.com/groq/openbench/issues/117)) ([93ad88b](https://github.com/groq/openbench/commit/93ad88b56b6586405efef87a6a4a4e6c396c988c))
50+
* **mcq_eval:** enable abstraction of MCQ eval ([#181](https://github.com/groq/openbench/issues/181)) ([2f53db2](https://github.com/groq/openbench/commit/2f53db20bb8d1474e0d49610c8e41c469dd07719))
51+
* **mmmu-pro:** added support for mmmu_mcq, mmmu_open, mmmu_pro, mmmu_pro_vision ([#134](https://github.com/groq/openbench/issues/134)) ([a875378](https://github.com/groq/openbench/commit/a875378e62e3bbd3f8050d1a9685d3e05a47a53d))
52+
* **openrouter:** add OpenRouter provider support ([#145](https://github.com/groq/openbench/issues/145)) ([47b579e](https://github.com/groq/openbench/commit/47b579e65c8eeb520023bc1e077ff50641f43bb6))
53+
* **openrouter:** add provider routing args support ([#180](https://github.com/groq/openbench/issues/180)) ([12e1d81](https://github.com/groq/openbench/commit/12e1d818f9cecf2e74b930591981026d288557e2))
54+
* **otis-mock-aime:** added support for otis mock aime 2024-2025 ([#218](https://github.com/groq/openbench/issues/218)) ([1b9fd5c](https://github.com/groq/openbench/commit/1b9fd5c8a65d8132fe1faf0e48d90d8b5f55bbfe))
55+
* **plugins:** add entry point system for external benchmarks ([#216](https://github.com/groq/openbench/issues/216)) ([71e7257](https://github.com/groq/openbench/commit/71e7257298787905b05c0048ed9fbd6e3ff14169))
56+
* return eval logs from run_eval function ([#173](https://github.com/groq/openbench/issues/173)) ([ee459d9](https://github.com/groq/openbench/commit/ee459d97bcb68039509b39b38783360e5c03c823))
57+
* **rootly_terraform:** add initial implementation of Rootly Terraform evals ([#195](https://github.com/groq/openbench/issues/195)) ([cd3acae](https://github.com/groq/openbench/commit/cd3acaebf2a736942456f815f82c3bf54e2ac751))
58+
59+
60+
### Bug Fixes
61+
62+
* allow for more python versions ([#164](https://github.com/groq/openbench/issues/164)) ([e6682fe](https://github.com/groq/openbench/commit/e6682fe60f2b69c5def4f0085f74006367fb6911))
63+
* close headqa metadata entry ([947522d](https://github.com/groq/openbench/commit/947522dd2ef341c861cf1290687594b0765e2929))
64+
* **cybench:** moved cybench dependency into dependency group ([#237](https://github.com/groq/openbench/issues/237)) ([8d30715](https://github.com/groq/openbench/commit/8d30715103dfe2476c8ebee0cb6b7158d8f0ad63))
65+
* handle missing SciCode dependency lazily in solver ([#186](https://github.com/groq/openbench/issues/186)) ([fed4e88](https://github.com/groq/openbench/commit/fed4e888b7673a5e0d76d3c854ec1fa342fa040a))
66+
* improve BBH target extraction to handle multi-char answers ([147e3e0](https://github.com/groq/openbench/commit/147e3e0696ad918d787ec82287cdd2c4b91498d2))
67+
* load Neue Regrade font in Mintlify docs ([#177](https://github.com/groq/openbench/issues/177)) ([550c7f5](https://github.com/groq/openbench/commit/550c7f575c715a5c47aaf8609325070e0d788788))
68+
* make core package actually install ([#235](https://github.com/groq/openbench/issues/235)) ([edeb4b8](https://github.com/groq/openbench/commit/edeb4b81134d9d138b8e9149f99f031288b05297))
69+
* normalize benchmark keys during entry point merge ([#217](https://github.com/groq/openbench/issues/217)) ([d285664](https://github.com/groq/openbench/commit/d285664b40e893f792413810d67f550ea377907b))
70+
* register headqa_en and headqa_es variants ([6a19aa1](https://github.com/groq/openbench/commit/6a19aa15b78570b5fb3bac715546e71d59daf1fe))
71+
* render inspect error correctly ([#241](https://github.com/groq/openbench/issues/241)) ([97ccd10](https://github.com/groq/openbench/commit/97ccd106de80f5a6555d130c3bb7515c39de5fb3))
72+
* resolve registry import conflict ([98a1c79](https://github.com/groq/openbench/commit/98a1c794436f06fc4297696530802125192ae27d))
73+
* **scicode:** add support for test split, fix test_data.h5 import error ([#149](https://github.com/groq/openbench/issues/149)) ([23fa8cb](https://github.com/groq/openbench/commit/23fa8cbf4cc431ce4afdbe87b268afc7310e1b3f))
74+
* update bbh function for programmatic access only ([811ce9e](https://github.com/groq/openbench/commit/811ce9ecb4f1ef0bb3cacce7fae8f298d0edfc61))
75+
* use generic type ignore for bbh task decorator ([fd67171](https://github.com/groq/openbench/commit/fd67171e55a4901d523c517906ff06941410ce30))
76+
77+
78+
### Documentation
79+
80+
* **readme:** clarify benchmark case-sensitivity and grader requirements ([#135](https://github.com/groq/openbench/issues/135)) ([c34a5a3](https://github.com/groq/openbench/commit/c34a5a362ec6764909258350f383ec561274b5bb))
81+
82+
83+
### Chores
84+
85+
* add [@nmayorga7](https://github.com/nmayorga7) to CODEOWNERS ([abad7bf](https://github.com/groq/openbench/commit/abad7bf76e86319e98921b73455e2c06e72d2940))
86+
* alphabetize available benchmarks error ([#214](https://github.com/groq/openbench/issues/214)) ([68f46e9](https://github.com/groq/openbench/commit/68f46e9304994d67b720e6dae40e328292f8ed57))
87+
* **benchmark:** removed combined cti-bench eval ([#183](https://github.com/groq/openbench/issues/183)) ([a77852c](https://github.com/groq/openbench/commit/a77852c69c210ed10ba410bd769c21a94789f5f7))
88+
* bugbot fixes for MCQ ([#190](https://github.com/groq/openbench/issues/190)) ([6ecaefc](https://github.com/groq/openbench/commit/6ecaefcafdd471ec0b9e2949d1ec3ee448da0447))
89+
* **docs:** add docs for openrouter and MCQEval ([#188](https://github.com/groq/openbench/issues/188)) ([7f8cd83](https://github.com/groq/openbench/commit/7f8cd83909f13b260078501fbac72cbeb60e0f97))
90+
* **docs:** alphabetize benchmarks metadata ([#187](https://github.com/groq/openbench/issues/187)) ([ce77812](https://github.com/groq/openbench/commit/ce778121a700f423132809dbf371468b45f150b2))
91+
* **docs:** benchmarks each on new line ([#184](https://github.com/groq/openbench/issues/184)) ([b3c40f8](https://github.com/groq/openbench/commit/b3c40f82fb5b26c0aba2ff0a17c9af45222f177d))
92+
* **docs:** minor cleanup ([#179](https://github.com/groq/openbench/issues/179)) ([80c9e09](https://github.com/groq/openbench/commit/80c9e096885817b7a3dc62220f365760347aa462))
93+
* fixed fonts in openbench docs ([#178](https://github.com/groq/openbench/issues/178)) ([6e3c2a5](https://github.com/groq/openbench/commit/6e3c2a5405e54b274656be34a9385d9b307fcd38))
94+
* GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] ([d0018d1](https://github.com/groq/openbench/commit/d0018d1cac7b024844b530117b31a09088882728))
95+
* **mcq-eval:** accept more dataset types ([#194](https://github.com/groq/openbench/issues/194)) ([cb5e038](https://github.com/groq/openbench/commit/cb5e038578ded316582f82c8d65503e7146e7781))
96+
* move all metrics to discrete files in /metrics ([#168](https://github.com/groq/openbench/issues/168)) ([7caa042](https://github.com/groq/openbench/commit/7caa042318a8ac27b496357c64b2489bd0da6d43))
97+
* release-please pre-1.0: treat BREAKING as minor ([3c88a42](https://github.com/groq/openbench/commit/3c88a429573ef81889751e2b3013a70e07b594f1))
98+
* remove pre-commit benchmark checks for easier CI ([#213](https://github.com/groq/openbench/issues/213)) ([67b07a7](https://github.com/groq/openbench/commit/67b07a7954e6bf6b87966f7347b65323aa159f3e))
99+
* rename OpenBench to openbench ([#196](https://github.com/groq/openbench/issues/196)) ([0621b46](https://github.com/groq/openbench/commit/0621b4691eb1f08c3110fada527752fc95e72add))
100+
* rename task to sample in time metrics ([#172](https://github.com/groq/openbench/issues/172)) ([96e1817](https://github.com/groq/openbench/commit/96e1817d79ab24a01c5680dc31a7dcab01e11d85))
101+
* sync packaging pyproject ([#234](https://github.com/groq/openbench/issues/234)) ([940a879](https://github.com/groq/openbench/commit/940a879f9bba7939c1920065a4593cb899b93999))
102+
* update Claude workflows to enhance permissions and streamline triggers ([#136](https://github.com/groq/openbench/issues/136)) ([effb7da](https://github.com/groq/openbench/commit/effb7da7ce12ba11da311bfb52493fb8b40dac24))
103+
* update readme and contributing ([#176](https://github.com/groq/openbench/issues/176)) ([ea606ba](https://github.com/groq/openbench/commit/ea606ba82ce0a385063288c6b06616e9d1e9f84c))
104+
* update release-please configuration and add lockfile update workflow ([#146](https://github.com/groq/openbench/issues/146)) ([2d6ad9b](https://github.com/groq/openbench/commit/2d6ad9b529e7092c554dabee9f479d6555a698c7))
105+
* user agent ([#163](https://github.com/groq/openbench/issues/163)) ([e20f3c1](https://github.com/groq/openbench/commit/e20f3c1bf9af74c14cee26f3dc92d0e85cbfed0a))
106+
107+
108+
### CI
109+
110+
* add benchmarks validation pre commit hook ([#171](https://github.com/groq/openbench/issues/171)) ([3725638](https://github.com/groq/openbench/commit/37256380e95dce060b8a8ceb48d719356d7b95b3))
111+
* remove PR trigger from release-please ([#166](https://github.com/groq/openbench/issues/166)) ([6440d44](https://github.com/groq/openbench/commit/6440d44ba0a13834c6d77d913861c132de4482d1))
112+
3113
## [0.4.1](https://github.com/groq/openbench/compare/v0.4.0...v0.4.1) (2025-08-29)
4114

5115

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "openbench"
7-
version = "0.4.1"
7+
version = "0.5.0"
88
requires-python = ">=3.10"
99
description = "openbench - open source, replicable, and standardized evaluation infrastructure"
1010
readme = "README.md"

0 commit comments

Comments
 (0)