Single-source reference for what this benchmark measures, how the router decides local-vs-cloud, which coding agents we run, and why the numbers in
docs/release-notes/are defensible. If you're new, read top-to-bottom; if you're benchmarking a new model, jump to §9.
Given one developer-class laptop (M-series Mac, 64 GB), one local LLM, and access to a frontier cloud LLM — for which coding task shapes is it worth routing some calls locally instead of sending everything to the cloud?
We answer it empirically. For every cell (local-model, agent, task-class, strategy, seed) we run the same set of tasks and report:
- Pass-rate — fraction of tasks that pass functional tests, with a non-parametric bootstrap 95% CI.
- Cost — USD spent on the cloud LLM, derived from token counts and a versioned pricing table.
- Cloud fraction — Σ cloud tokens ÷ Σ all tokens. Token-based, never call-count-based (call-count is gamed by cascade/llm-classifier meta-calls).
- Wall time — median wall-clock ms per task.
Every number in docs/release-notes/*.md is a slice of those four numbers
across 1,644 rows in v1.4.
┌─────────────────────────────────────────────────────────────────────────┐
│ ./arena sweep --config X.yaml │
│ │
│ ┌─────────────────────┐ │
│ │ cli/bench.py │ spawns one Node router proxy with │
│ │ (orchestrator) │ LOCAL_MODEL + CLOUD_MODEL injected │
│ └─────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ core/experiment.py │ for each│ agents/<agent>.py │ │
│ │ build_task_plan + │ (task, │ (one runner per agent — │ subprocess│
│ │ run_pair() loop │ agent) │ aider / opencode / mini- │ │
│ └─────────┬───────────┘─────────► swe-agent / cline) │ │
│ │ └─────────────┬────────────┘ │
│ │ │ │
│ │ OpenAI-compatible HTTP ▼ │
│ │ ┌────────────────────────┐ │
│ │ │ router/server.mjs │ │
│ │ │ + router/strategies.mjs│ │
│ │ │ (decides local|cloud) │ │
│ │ └─────┬──────────┬───────┘ │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ Ollama OpenAI │
│ │ (local model) (cloud model) │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ scorers/functional_ │ Docker sandbox runs the agent's diff │
│ │ python (sandbox) │ against the fixture's pytest suite │
│ └─────────┬───────────┘ │
│ ▼ │
│ results/runs/<sweep>/raw.jsonl (one ResultRow per task × agent × │
│ strategy × seed; tokens never cost, │
│ cost is derived at analyse-time) │
└─────────────────────────────────────────────────────────────────────────┘
The Python orchestrator owns the experiment loop. The Node router owns the routing decision. The two communicate via OpenAI-compatible HTTP, which lets every off-the-shelf coding agent talk to the router as if it were OpenAI.
We measure agent-loop routes only — agents that own their own multi-turn tool use. The repo doesn't try to be one. Each agent is a battle-tested, externally-maintained tool we wrap thinly.
| Agent | Loop style | Strengths | Caveats |
|---|---|---|---|
aider |
architect → editor | Tight diffs, parsable patches | Markdown-fence parse bug on some local models |
opencode |
free-form tool calls | High ceiling when tool calls are clean | gemma4-specific in v1.4 — qwen variants fail parsing |
mini-swe-agent |
minimalist bash-only ReAct | Closest to SWE-bench reference; small surface area | Needs Docker for SWE-bench Verified |
cline |
Plan / Act with 8-14 turns | Iteration wins puzzles + refactors at 30B local | Highest token cost per cell |
Each wrapper lives in src/hybrid_arena/agents/<agent>.py and exposes
one run(task, *, proxy_url, ...) -> ResultRow function. The orchestrator
calls them through core/experiment.py:_runner_for(agent).
Why not Cursor / Continue / Cody? They were considered and dropped because either (a) they don't expose a headless CLI, or (b) they require a managed account. The four agents above can all be driven from a script with no UI.
The router proxy reads each request, picks a backend (local or cloud), and
forwards the body. Strategies live in router/strategies.mjs. The Python
side never knows which backend served a call — it only sees the token
counts in the OpenAI usage object.
| Strategy | Decision rule | Use when |
|---|---|---|
always-cloud |
Send every request to the cloud model. Control. | Baseline — set the ceiling |
always-local |
Send every request to the local model. Control. | Baseline — set the floor |
rules |
Keyword + regex rules (e.g. "refactor" → local) | Demos / debugging |
heuristic |
Weighted scoring on prompt length, code-block count, agent state | First strategy to actually try in prod |
llm-classifier |
One qwen3:0.6b call returns SIMPLE or COMPLEX |
When you trust a tiny model's taste |
embedding-knn |
nomic-embed-text query → kNN vote against a 50-example labelled corpus |
Stable, no-LLM-overhead routing |
cascade |
Heuristic decides first; on borderline confidence, llm-classifier tie-breaks | Highest pass-rate hybrid in v1.4 |
phase-aware |
Like heuristic, but with an aider-architect-step bonus for aider |
Aider-specific tuning |
cascade-tuned is the same as cascade with the ROUTER_CASCADE_THRESHOLD
env var injected per pass — used for the v1.3 threshold sweep.
Every routing decision is appended to router/logs/decisions.jsonl with the
score, reason, and the picked backend. That file is the single source of
truth for routing audits.
Any agent can override the strategy on a single call by appending !local
or !cloud to the model field. We use this in tests; agents in normal
sweeps don't.
| Class | Source | Shape | Count in v1.4 |
|---|---|---|---|
puzzles |
Exercism Python (via Aider polyglot benchmark) | Single-function, single-file tasks with hidden tests | 5 |
refactors |
Hand-written, real-PR-shaped tasks in single-file fixture repos | Historical class name. The functional canonical cell is feature-adds (D1) + one-shot scripts (D5); v1.5 adds 4 D6 hard single-file builds. True refactor (D3) and code-review (D4) shapes exist in the source but were LLM-judged and are not in the functional cell. | 12 (8 D1/D5 + 4 D6) |
real-prs |
SWE-bench Verified subset | Repo-level patches against Docker testbeds | (v1.6+ work; adapter shipped) |
Each task adapter lives in src/hybrid_arena/tasks/<class>/. A task is
a small dataclass: id, fixture_path, prompt, run_cmd. Scoring is per
class — puzzles uses pytest in a Docker sandbox; refactors uses a
dispatcher in tasks/refactors/scorers.py that picks the right per-task
checker (e.g. "does the diff add the rate-limit guard?").
Every row in results/runs/<sweep>/raw.jsonl is a ResultRow (see
src/hybrid_arena/core/metrics.py):
@dataclass
class ResultRow:
task_id: str # 'exercism-python/grep'
category: str # 'puzzles' | 'refactors' | 'real-prs'
route: str # 'aider' | 'opencode' | 'mini-swe-agent' | 'cline'
router_strategy: str | None # 'heuristic' | 'cascade' | ...
seed: int | None # deterministic seed stamped by the orchestrator
hardware_profile_ref: str # 'Apple M4 Max|64GB|git<sha>|mh<self-hash>'
tokens: TokenUsage # prompt / completion / cached / local_* / cloud_*
latency: Latency # wall_ms + per_call_ms[]
quality: Quality # functional_pass, tests_passed/total, composite
routing: Routing # total_calls, local_calls, cloud_calls
cloud_model_id: str | None # 'gpt-5.5' — stamped from BenchConfig.models.cloud
local_model_id: str | None # 'gemma4:31b'
config_sha: str | None # SHA256 of the BenchConfig that produced the row
error: str | None # set when the runner failed (timeout, parse, etc.)Three invariants:
- Tokens are persisted, cost is derived. No row contains a
cost_usdfield; cost is computed on read againstconfigs/pricing/pricing_tables.json. That lets you re-price an old dataset against new pricing scenarios. always-localrows always havecloud_* = 0. Non-zero cloud tokens in analways-localrow is a routing bug.erroris set ↔ tokens are zero ↔ quality is all-None. Error rows are excluded from bootstrap CIs byanalysis.bootstrap.
./arena analyze <sweep_dir> runs these in order:
aggregate.py— per-(category, route, strategy)medians and totals.bootstrap.py— 95% percentile CIs per cell (non-parametric, 1k resamples). Computespass_rate,cost_usd,cloud_fraction, andwall_ms. Stratifies by(category, route, router_strategy)by default; pools across seeds.decision_matrix.py— renders the cell × strategy table as Markdown, with a "recommended" column = highest pass-rate at the lowest tying cost.cost_scenarios.py— re-prices every row under five pricing scenarios so callers can answer "what would this have cost on gpt-5-mini?".viz/cost_quality_pareto.py+viz/decision_heatmap.py— PNG charts.
Every step is idempotent and re-readable. Re-running on the same raw.jsonl
produces byte-identical JSON outputs.
Cloud spend is always computed from tokens × pricing rates. The rates
live in configs/pricing/pricing_tables.json (USD per 1,000,000 tokens):
{
"rates_per_m": {
"gpt-5.5": { "input": 5.0, "output": 30.0, "cache_read": 0.5 },
"anthropic-claude-opus-4.7": { "input": 15.0, "output": 75.0, "cache_read": 1.5 },
"__local__": { "input": 0.0, "output": 0.0, "cache_read": 0.0 }
}
}router/pricing.mjs reads the same JSON file. Parity is verified by
tests/test_pricing_parity.py — the router and the Python harness compute
identical costs for any usage payload.
Cost formula:
usd = (prompt − cached) × input/1e6 + cached × cache_read/1e6 + completion × output/1e6
completion_tokens already includes reasoning_tokens — they're surfaced
in the row for transparency only, not added again.
Drop-in recipe for benchmarking a new local model against the v1.4 canonical matrix:
ollama pull <new-model>
./arena setup # idempotent; first run only
./arena sweep \
--config configs/v1.4-canonical-gemma4.yaml \
--set models.local=<new-model> \
--set out_dir=results/runs/v1.4-<new-model> \
--strategies always-cloud,always-local,heuristic,cascade \
--seeds 42,7,13./arena setup checks prereqs (Docker, Ollama, Node, API keys) and
builds the sandbox image / installs the agent CLIs. Long-form lifecycle
commands (./arena start / pause / resume / stop / status) are
documented inline at ./arena --help and exist so you can detach the
sweep and reclaim the laptop.
Expected runtime on an M4 Max 64 GB: 10–15 hours, ≈ $30–50 cloud spend at
gpt-5.5 list price. The router auto-spawns from models.local; you don't
need a separate router terminal.
When the sweep completes:
./arena analyze results/runs/v1.4-<new-model>
jq '.cells["refactors::cline::heuristic"].pass_rate' \
results/runs/v1.4-<new-model>/bootstrap_cis.jsonThat cell — cline + your-model + heuristic + refactors — is the headline
number. Compare against the v1.4.1 release notes for context.
Cell-key naming. Cell keys use the same human-readable task-class name end-to-end:
puzzles::aider::heuristic,refactors::cline::cascade,real-prs::mini-swe-agent::always-local. Pre-v1.4.3 datasets used single-letter codes (A/B/D); those keys are retired — re-render legacy datasets with the v1.4.3+ harness to migrate.
- No LLM-as-judge for prose tasks. Functional-pass only. The v1.0–v1.3 experiments with Opus-as-judge are documented in CHANGELOG but the judge scorer was deleted in v1.4 — it added too much variance for too little signal on coding tasks.
- No Continue / Cursor / Cody. No headless CLI ⇒ no reproducible driver ⇒ no apples-to-apples cell in the matrix.
- No "average across vendors" cost claims. Every cost number in the
release notes is gpt-5.5 specifically. The
cost_scenariosre-pricing exists for what-if analysis only. - No statistical claims past 95% CI. The dataset is small (24-row cells are typical); we publish CIs and let readers decide.
- No router-side caching. The cloud's prompt-cache hits are visible in
cached_tokensand charged atcache_read, but the router never inserts its own cache layer that could leak across sweeps.
- One laptop, M4 Max 64 GB. Memory bandwidth (546 GB/s) and the 30B-class quantization sweet spot are both Apple-Silicon-specific. A 4090 + Linux will trade more compute for less RAM; numbers may swing.
- 18 tasks per cell. Bootstrap CIs reflect that — 96% on 24 rows has a CI of roughly [88, 100]. Treat single-point claims with appropriate skepticism; cite the CI.
- Python-only fixtures. TypeScript / Go / Java are obvious next steps; they would change the routing-strategy weights (heuristic's "code block count" heuristic is regex-keyed on Python markers).
- Cloud model is gpt-5.5 across all sweeps. Anthropic Opus / Sonnet cells haven't been run end-to-end; they exist in the pricing table for re-pricing the gpt-5.5 datasets, not for direct comparison.
- Aider's markdown-fence parser rejects some valid completions; the v1.4.0 "23/24 = 96%" cell loses 1/24 to this bug, not a model failure.
- Add a new agent → write a
src/hybrid_arena/agents/<name>.pywith arun(task, *, proxy_url, ...) -> ResultRowfunction, register it incore/experiment.py:_runner_for, add<name>to theAgentliteral incore/config/schema.py. Tests live intests/agents/test_<name>.py. - Add a new strategy → write a function in
router/strategies.mjs, register it in theSTRATEGY_REGISTRYat the bottom of that file, add the name toRouteStrategyincore/config/schema.py. - Add a new task class → create
src/hybrid_arena/tasks/<class>/withadapter.py(loads tasks) andscorers.py(scores a row); add the class name toCATEGORY_SOURCESincore/experiment.pyandTaskClassincore/config/schema.py. - Add a new pricing scenario → append to
configs/pricing/pricing_tables.jsonand toPRICING_SCENARIOSinanalysis/cost_scenarios.py. Both Python and Node pick it up automatically.
The full repo map lives in AGENTS.md for AI coding agents reading the
codebase; README.md is the human-facing entry point.