Name	Name	Last commit message	Last commit date
parent directory ..
HyperliquidBench	HyperliquidBench
OSWorld	OSWorld
abliteration-robustness	abliteration-robustness
action-calling	action-calling
adhdbench	adhdbench
agentbench	agentbench
agentbench_matrix	agentbench_matrix
app-eval	app-eval
app_eval	app_eval
bfcl	bfcl
claw-eval	claw-eval
claw_eval_matrix	claw_eval_matrix
clawbench	clawbench
clawbench_matrix	clawbench_matrix
compactbench	compactbench
configbench	configbench
context-bench	context-bench
eliza-1	eliza-1
eliza-adapter	eliza-adapter
elizaos_mmau	elizaos_mmau
evm	evm
experience	experience
framework	framework
gauntlet	gauntlet
hermes-adapter	hermes-adapter
interrupt-bench	interrupt-bench
lib	lib
lifeops-bench	lifeops-bench
loca-bench	loca-bench
mind2web	mind2web
mint	mint
mmau-audio	mmau-audio
mmau	mmau
nl2repo	nl2repo
openclaw-adapter	openclaw-adapter
openclaw-benchmark	openclaw-benchmark
openclaw_benchmark	openclaw_benchmark
orchestrator	orchestrator
orchestrator_lifecycle	orchestrator_lifecycle
personality-bench	personality-bench
qwen-claw-bench	qwen-claw-bench
qwen-web-bench	qwen-web-bench
qwen_claw_bench_matrix	qwen_claw_bench_matrix
realm	realm
registry	registry
rlm-bench	rlm-bench
scambench	scambench
scripts	scripts
skillsbench	skillsbench
social-alpha	social-alpha
solana	solana
standard	standard
swe-bench-multilingual	swe-bench-multilingual
swe-bench-pro	swe-bench-pro
swe_bench	swe_bench
swe_bench_pro_matrix	swe_bench_pro_matrix
tau-bench	tau-bench
terminal-bench	terminal-bench
tests	tests
three-agent-dialogue	three-agent-dialogue
trust	trust
vending-bench	vending-bench
viewer	viewer
vision-language	vision-language
visualwebbench	visualwebbench
voice-emotion	voice-emotion
voice-speaker-validation	voice-speaker-validation
voiceagentbench	voiceagentbench
voicebench-quality	voicebench-quality
voicebench	voicebench
webshop	webshop
woobench	woobench
.gitignore	.gitignore
ORCHESTRATOR_SUBAGENT_BENCHMARK_RUNBOOK.md	ORCHESTRATOR_SUBAGENT_BENCHMARK_RUNBOOK.md
README.md	README.md
__init__.py	__init__.py
bench_cli_types.py	bench_cli_types.py
compare.py	compare.py
run.py	run.py

Name

Last commit message

Last commit date

OSWorld

abliteration-robustness

orchestrator_lifecycle

personality-bench

qwen-claw-bench

qwen-web-bench

qwen_claw_bench_matrix

swe-bench-multilingual

voice-speaker-validation

ORCHESTRATOR_SUBAGENT_BENCHMARK_RUNBOOK.md

`packages/benchmarks`

The elizaOS evaluation suite. 70+ benchmark harnesses spanning agent autonomy, tool-call correctness, long-horizon reasoning, voice/vision multimodal, embodied control, and adversarial robustness.

Python-based. Lives outside the TypeScript workspace; not an npm package.

Categories

Subdir	What it evaluates
`eliza-1/`	Native tool-calling and agentic loop quality on Eliza-1 models.
`swe-bench/`, `agentbench/`	Software engineering tasks.
`lifeops-bench/`	Long-running personal-assistant scenarios.
`orchestrator/`	Multi-agent coordination harness.
`HyperliquidBench/`	Onchain trading agent evaluation.
`OSWorld/`	Desktop OS automation tasks.
`vision-language/`	Multimodal grounding.
`voice-emotion/`	Speech recognition + affect under noise.
`clawbench/`, `claw-eval/`	Claude-specific evaluations.
`compactbench/`	Context compaction quality.
`configbench/`, `context-bench/`	Runtime config and context handling.
`adhdbench/`	Long-horizon focus / distraction resilience.
`abliteration-robustness/`	Behavioral robustness post-abliteration.
`app-eval/`	End-to-end app interaction.

Running

The orchestrator at orchestrator/ accepts a config and runs the named harness, collecting traces and metrics. See ORCHESTRATOR_SUBAGENT_BENCHMARK_RUNBOOK.md for full runbook including remote GPU usage.

cd packages/benchmarks
python3 -m orchestrator.runner --config <path>

Reports

Reference runs are checked into benchmark_results/. Reports include calibration data, scorecards, and per-task traces.

Docs

User-facing summary: Benchmarks track page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

`packages/benchmarks`

Categories

Running

Reports

Docs

FilesExpand file tree

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

packages/benchmarks

Categories

Running

Reports

Docs

`packages/benchmarks`