You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenBench provides standardized, reproducible benchmarking for LLMs across 20+ evaluation suites (and growing) spanning knowledge, math, reasoning, reading comprehension, health, long-context recall, and first-class support for your own local evals to preserve privacy. **Works with any model provider** - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, and more.
9
+
OpenBench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long-context recall, graph reasoning, and first-class support for your own local evals to preserve privacy. **Works with any model provider** - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, Hugging Face, and 30+ other providers.
10
10
11
11
## 🚧 Alpha Release
12
12
13
13
We're building in public! This is an alpha release - expect rapid iteration. The first stable release is coming soon.
14
14
15
+
## 🎉 What's New in v0.3.0
16
+
17
+
-**📡 18 More Model Providers**: Added support for AI21, Baseten, Cerebras, Cohere, Crusoe, DeepInfra, Friendli, Hugging Face, Hyperbolic, Lambda, MiniMax, Moonshot, Nebius, Nous, Novita, Parasail, Reka, SambaNova and more
18
+
-**🧪 New Benchmarks**: DROP (reading comprehension), experimental benchmarks available with `--alpha` flag
19
+
-**⚡ CLI Enhancements**: `openbench` alias, `-M`/`-T` flags for model/task args, `--debug` mode for eval-retry
20
+
-**🔧 Developer Tools**: GitHub Actions integration, Inspect AI extension support
21
+
15
22
## Features
16
23
17
-
-**🎯 30+ Benchmarks**: MMLU, GPQA, HumanEval, SimpleQA, and competition math (AIME, HMMT)
0 commit comments