A faithful re-implementation of Terminal-Bench
(Laude Institute, Apache-2.0) wired into the ElizaOS Python harness. The
real upstream task corpus is vendored under tasks/ and run inside
per-task Docker images driven by tmux, matching upstream semantics.
- Vendored task corpus at
packages/benchmarks/terminal-bench/tasks/(241 tasks, snapshot of upstreamoriginal-tasks/, Apache-2.0). - Tmux-backed Docker environment (
TmuxDockerEnvironment) — builds each task's Dockerfile, launches a persistent tmux session, and routes every agent command throughtmux send-keys+tmux wait. Required for tasks that use interactive tools (vim, python -i, less, ...). - One-shot Docker fallback (
TerminalEnvironment,--one-shot) for images where tmux is unavailable. - Local-temp-workspace path (
LocalTerminalEnvironment,--local-sandbox) for smoke runs without Docker. - Gated mock environment (
MockTerminalEnvironment,--mock) — always reports success, only legal for unit tests. - Fail-loud dataset loader —
TerminalBenchDatasetraisesTerminalBenchDatasetMissingErrorif the corpus is missing rather than quietly falling back toSAMPLE_TASKS.
Categories present in the vendored corpus: algorithms, audio-processing, computer-vision, data-processing, data-querying, data-science, debugging, file-operations, file-system, game(s), machine-learning, math(ematics), model-training, optimization, personal-assistant, protocol-analysis, reproducible-builds, research, scientific-computing, security, software-engineering, system-administration, video-processing.
# From the terminal-bench directory
pip install -e ".[dev]"Docker is required for the default tmux backend. tmux is installed automatically inside containers that don't already ship it.
# Tiny built-in SAMPLE_TASKS — CI / wiring check ONLY, not Terminal-Bench.
terminal-bench --use-sample-tasks --local-sandbox# Default backend = tmux inside per-task Docker images.
terminal-bench --task-ids hello-world
# All 241 tasks (slow).
terminal-bench
# Force the legacy one-shot exec_run path (no tmux).
terminal-bench --task-ids hello-world --one-shot
# Eliza bridge task-agent selection defaults to opencode. If
# ANTHROPIC_API_KEY/CLAUDE_API_KEY is present it resolves to claude; if
# CODEX_API_KEY/OPENAI_API_KEY is present it resolves to codex. Override it:
terminal-bench --task-ids hello-world --task-agent opencode
# Cerebras via the eliza bridge, preserving the configured model name.
terminal-bench --model-provider cerebras --model gpt-oss-120b
# Cerebras gpt-oss-120b through the hermes harness.
terminal-bench --agent-harness hermes --model-provider cerebras --model gpt-oss-120b --task-ids hello-world
# Local baselines for harness sanity checks.
terminal-bench --use-sample-tasks --local-sandbox --agent-harness always-right
terminal-bench --use-sample-tasks --local-sandbox --agent-harness always-wrong
terminal-bench --use-sample-tasks --local-sandbox --agent-harness random --baseline-random-seed 1
# Fail-loud check: missing corpus raises rather than running SAMPLE_TASKS.
terminal-bench --data-path /no/such/path # -> TerminalBenchDatasetMissingErrorNetwork is disabled by default (network_mode="none"). Some upstream
tasks install uv from astral.sh inside run-tests.sh and so need a
bridge network at grading time. Pass --network ... per task or rely
on per-task network_enabled=True in task.yaml to flip this. To
enforce hermetic runs, set network_mode="none" explicitly in
TerminalBenchConfig.
git clone --depth 1 https://github.com/laude-institute/terminal-bench.git /tmp/tb
cp -r /tmp/tb/original-tasks/* packages/benchmarks/terminal-bench/tasks/
cp /tmp/tb/LICENSE packages/benchmarks/terminal-bench/tasks/LICENSE.upstreamLEADERBOARD_SCORES ships empty — leaderboard numbers move too fast
to embed. Compare your score against the live leaderboard at
https://www.tbench.ai/leaderboard.
# Run sample tasks to verify installation
terminal-bench --use-sample-tasks --local-sandbox
# Verbose output
terminal-bench --use-sample-tasks --local-sandbox --verbose# Download and run full Terminal-Bench 2.0 dataset
terminal-bench --data-path ./terminal-bench-data
# Filter by category
terminal-bench --categories scripting code_compilation
# Filter by difficulty
terminal-bench --difficulties easy medium
# Limit number of tasks
terminal-bench --max-tasks 20import asyncio
from elizaos_terminal_bench import (
TerminalBenchRunner,
TerminalBenchConfig,
TaskCategory,
TaskDifficulty,
)
async def main():
# Configure the benchmark
config = TerminalBenchConfig(
output_dir="./results",
max_iterations=20,
model_name="gpt-4",
verbose=True,
)
# Create and setup runner
runner = TerminalBenchRunner(config=config)
await runner.setup(use_sample_tasks=True)
# Run benchmark
report = await runner.run(
categories=[TaskCategory.SCRIPTING],
max_tasks=10,
)
# Print results
print(f"Accuracy: {report.accuracy:.1%}")
print(f"Passed: {report.passed_tasks}/{report.total_tasks}")
if report.leaderboard_comparison:
print(f"Rank: #{report.leaderboard_comparison.rank}")
asyncio.run(main())By default runs are routed through the elizaOS TypeScript benchmark
bridge (packages/app-core/src/benchmark/server.ts). The CLI spawns the
bridge automatically when ELIZA_BENCH_URL is unset, and
TerminalBenchRunner delegates per-task decision-making to
ElizaBridgeTerminalAgent (in eliza_adapter.terminal_bench). The
Python AgentRuntime path has been removed.
Alternative harnesses are selected with --agent-harness. hermes
constructs hermes_adapter.client.HermesClient from the configured
provider/model, so --model-provider cerebras --model gpt-oss-120b
uses the Cerebras OpenAI-compatible endpoint without printing or
embedding an API key. openclaw follows the same provider/model
resolution. The local baselines are always-right, always-wrong, and
random.
| Option | Default | Description |
|---|---|---|
data_path |
./terminal-bench-data |
Path to dataset |
output_dir |
./benchmark_results/terminal-bench |
Output directory |
version |
2.0 |
Terminal-Bench version |
max_iterations |
20 |
Max agent iterations per task |
timeout_per_task_seconds |
300 |
Task timeout |
model_name |
gpt-4 |
LLM model to use |
temperature |
0.0 |
Generation temperature |
agent_harness |
eliza |
Decision harness: eliza, hermes, openclaw, always-right, always-wrong, random |
task_agent |
opencode |
Eliza bridge task agent; auto-resolves to claude/codex when corresponding key env vars are present |
docker_image |
ubuntu:22.04 |
Default Docker image |
memory_limit |
2g |
Container memory limit |
verbose |
False |
Enable verbose logging |
dry_run |
False |
Run without execution |
| Rank | Agent | Model | Accuracy |
|---|---|---|---|
| 1 | Droid (Factory) | GPT-5.2 | 64.9% |
| 2 | Ante (Antigma Labs) | Gemini 3 Pro | 64.7% |
| 3 | Junie CLI (JetBrains) | Gemini 3 Flash | 64.3% |
| 4 | Claude Code | Claude Sonnet 4.6 | 58.2% |
| 5 | OpenHands | GPT-4o | 52.8% |
| 6 | Aider | Claude Sonnet 4.6 | 47.5% |
| ... | GPT-4 (baseline) | - | 28.3% |
| - | Human Expert | - | 92.5% |
Note: No agent has exceeded 65% accuracy, demonstrating the benchmark's challenging nature.
After running the benchmark, you'll find:
benchmark_results/terminal-bench/
├── terminal-bench-20251211_143052.json # Detailed JSON report
├── terminal-bench-20251211_143052.md # Markdown summary
└── sessions-20251211_143052/ # Session logs (optional)
├── task_001.json
└── task_002.json
# Run all tests
pytest
# Run with coverage
pytest --cov=elizaos_terminal_bench
# Run specific test file
pytest tests/test_types.py
# Skip Docker tests
pytest -m "not docker"Terminal-Bench requires Docker for sandboxed execution:
# Verify Docker is running
docker info
# Pull required images
docker pull ubuntu:22.04
docker pull gcc:latest
docker pull python:3.11- Create files and directories
- Execute simple commands
- Write basic scripts
- Compile single-file programs
- Parse and transform text
- Configure environment variables
- Multi-step build processes
- Complex system administration
- ML model setup and training
TerminalEnvironmentError: Failed to connect to Docker
Ensure Docker daemon is running: sudo systemctl start docker
Task timed out after 300 seconds
Increase timeout: --timeout 600
OPENAI_API_KEY environment variable required
Set your API key: export OPENAI_API_KEY=sk-...
MIT License - See LICENSE file for details.