Skip to content

Commit fe358bb

Browse files
authored
docs: docs for 0.3.0 (#93)
1 parent e2ccfaa commit fe358bb

File tree

1 file changed

+59
-8
lines changed

1 file changed

+59
-8
lines changed

README.md

Lines changed: 59 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,28 @@
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
77
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
88

9-
OpenBench provides standardized, reproducible benchmarking for LLMs across 20+ evaluation suites (and growing) spanning knowledge, math, reasoning, reading comprehension, health, long-context recall, and first-class support for your own local evals to preserve privacy. **Works with any model provider** - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, and more.
9+
OpenBench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long-context recall, graph reasoning, and first-class support for your own local evals to preserve privacy. **Works with any model provider** - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, Hugging Face, and 30+ other providers.
1010

1111
## 🚧 Alpha Release
1212

1313
We're building in public! This is an alpha release - expect rapid iteration. The first stable release is coming soon.
1414

15+
## 🎉 What's New in v0.3.0
16+
17+
- **📡 18 More Model Providers**: Added support for AI21, Baseten, Cerebras, Cohere, Crusoe, DeepInfra, Friendli, Hugging Face, Hyperbolic, Lambda, MiniMax, Moonshot, Nebius, Nous, Novita, Parasail, Reka, SambaNova and more
18+
- **🧪 New Benchmarks**: DROP (reading comprehension), experimental benchmarks available with `--alpha` flag
19+
- **⚡ CLI Enhancements**: `openbench` alias, `-M`/`-T` flags for model/task args, `--debug` mode for eval-retry
20+
- **🔧 Developer Tools**: GitHub Actions integration, Inspect AI extension support
21+
1522
## Features
1623

17-
- **🎯 30+ Benchmarks**: MMLU, GPQA, HumanEval, SimpleQA, and competition math (AIME, HMMT)
18-
- **🔧 Simple CLI**: `bench list`, `bench describe`, `bench eval`
24+
- **🎯 35+ Benchmarks**: MMLU, GPQA, HumanEval, SimpleQA, competition math (AIME, HMMT), SciCode, GraphWalks, and more
25+
- **🔧 Simple CLI**: `bench list`, `bench describe`, `bench eval` (also available as `openbench`)
1926
- **🏗️ Built on inspect-ai**: Industry-standard evaluation framework
2027
- **📊 Extensible**: Easy to add new benchmarks and metrics
21-
- **🤖 Provider-agnostic**: Works with 15+ model providers out of the box
28+
- **🤖 Provider-agnostic**: Works with 30+ model providers out of the box
2229
- **🛠️ Local Eval Support**: Privatized benchmarks can now be run with `bench eval <path>`
30+
- **📤 Hugging Face Integration**: Push evaluation results directly to Hugging Face datasets
2331

2432
## 🏃 Speedrun: Evaluate a Model in 60 Seconds
2533

@@ -64,14 +72,54 @@ bench eval musr --model ollama/llama3.1:70b
6472
# Hugging Face Inference Providers
6573
bench eval mmlu --model huggingface/gpt-oss-120b:groq
6674

67-
# Any provider supported by Inspect AI!
75+
# 30+ providers supported - see full list below
6876
```
77+
78+
## Supported Providers
79+
80+
OpenBench supports 30+ model providers through Inspect AI. Set the appropriate API key environment variable and you're ready to go:
81+
82+
| Provider | Environment Variable | Example Model String |
83+
|----------|---------------------|---------------------|
84+
| **AI21 Labs** | `AI21_API_KEY` | `ai21/model-name` |
85+
| **Anthropic** | `ANTHROPIC_API_KEY` | `anthropic/model-name` |
86+
| **AWS Bedrock** | AWS credentials | `bedrock/model-name` |
87+
| **Azure** | `AZURE_OPENAI_API_KEY` | `azure/<deployment-name>` |
88+
| **Baseten** | `BASETEN_API_KEY` | `baseten/model-name` |
89+
| **Cerebras** | `CEREBRAS_API_KEY` | `cerebras/model-name` |
90+
| **Cohere** | `COHERE_API_KEY` | `cohere/model-name` |
91+
| **Crusoe** | `CRUSOE_API_KEY` | `crusoe/model-name` |
92+
| **DeepInfra** | `DEEPINFRA_API_KEY` | `deepinfra/model-name` |
93+
| **Friendli** | `FRIENDLI_TOKEN` | `friendli/model-name` |
94+
| **Google** | `GOOGLE_API_KEY` | `google/model-name` |
95+
| **Groq** | `GROQ_API_KEY` | `groq/model-name` |
96+
| **Hugging Face** | `HF_TOKEN` | `huggingface/model-name` |
97+
| **Hyperbolic** | `HYPERBOLIC_API_KEY` | `hyperbolic/model-name` |
98+
| **Lambda** | `LAMBDA_API_KEY` | `lambda/model-name` |
99+
| **MiniMax** | `MINIMAX_API_KEY` | `minimax/model-name` |
100+
| **Mistral** | `MISTRAL_API_KEY` | `mistral/model-name` |
101+
| **Moonshot** | `MOONSHOT_API_KEY` | `moonshot/model-name` |
102+
| **Nebius** | `NEBIUS_API_KEY` | `nebius/model-name` |
103+
| **Nous Research** | `NOUS_API_KEY` | `nous/model-name` |
104+
| **Novita AI** | `NOVITA_API_KEY` | `novita/model-name` |
105+
| **Ollama** | None (local) | `ollama/model-name` |
106+
| **OpenAI** | `OPENAI_API_KEY` | `openai/model-name` |
107+
| **OpenRouter** | `OPENROUTER_API_KEY` | `openrouter/model-name` |
108+
| **Parasail** | `PARASAIL_API_KEY` | `parasail/model-name` |
109+
| **Perplexity** | `PERPLEXITY_API_KEY` | `perplexity/model-name` |
110+
| **Reka** | `REKA_API_KEY` | `reka/model-name` |
111+
| **SambaNova** | `SAMBANOVA_API_KEY` | `sambanova/model-name` |
112+
| **Together AI** | `TOGETHER_API_KEY` | `together/model-name` |
113+
| **vLLM** | None (local) | `vllm/model-name` |
114+
115+
## Available Benchmarks
116+
69117
| Category | Benchmarks |
70118
|----------|------------|
71119
| **Knowledge** | MMLU (57 subjects), GPQA (graduate-level), SuperGPQA (285 disciplines), OpenBookQA, HLE (Humanity's Last Exam - 2,500 questions from 1,000+ experts), HLE_text (text-only version) |
72120
| **Coding** | HumanEval (164 problems) |
73121
| **Math** | AIME 2023-2025, HMMT Feb 2023-2025, BRUMO 2025, MATH (competition-level problems), MATH-500 (challenging subset), MGSM (multilingual grade school math), MGSM_en (English), MGSM_latin (5 languages), MGSM_non_latin (6 languages) |
74-
| **Reasoning** | SimpleQA (factuality), MuSR (multi-step reasoning) |
122+
| **Reasoning** | SimpleQA (factuality), MuSR (multi-step reasoning), DROP (discrete reasoning over paragraphs) |
75123
| **Long Context** | OpenAI MRCR (multiple needle retrieval), OpenAI MRCR_2n (2 needle), OpenAI MRCR_4 (4 needle), OpenAI MRCR_8n (8 needle) |
76124
| **Healthcare** | HealthBench (open-ended healthcare eval), HealthBench_hard (challenging variant), HealthBench_consensus (consensus variant) |
77125

@@ -93,17 +141,20 @@ For a complete list of all commands and options, run: `bench --help`
93141

94142
| Command | Description |
95143
|--------------------------|----------------------------------------------------|
96-
| `bench` | Show main menu with available commands |
144+
| `bench` or `openbench` | Show main menu with available commands |
97145
| `bench list` | List available evaluations, models, and flags |
98146
| `bench eval <benchmark>` | Run benchmark evaluation on a model |
147+
| `bench eval-retry` | Retry a failed evaluation |
99148
| `bench view` | View logs from previous benchmark runs |
100149
| `bench eval <path>` | Run your local/private evals built with Inspect AI |
101150

102151
### Key `eval` Command Options
103152

104153
| Option | Environment Variable | Default | Description |
105154
|----------------------|--------------------------|--------------------------------------------------|--------------------------------------------------|
106-
| `--model` | `BENCH_MODEL` | `groq/meta-llama/llama-4-scout-17b-16e-instruct` | Model(s) to evaluate |
155+
| `-M <args>` | None | None | Pass model-specific arguments (e.g., `-M reasoning_effort=high`) |
156+
| `-T <args>` | None | None | Pass task-specific arguments to the benchmark |
157+
| `--model` | `BENCH_MODEL` | None (required) | Model(s) to evaluate |
107158
| `--epochs` | `BENCH_EPOCHS` | `1` | Number of epochs to run each evaluation |
108159
| `--max-connections` | `BENCH_MAX_CONNECTIONS` | `10` | Maximum parallel requests to model |
109160
| `--temperature` | `BENCH_TEMPERATURE` | `0.6` | Model temperature |

0 commit comments

Comments
 (0)