Should this coding task run on my laptop, the cloud, or split between them? Local vs cloud vs hybrid LLM routing for coding agents. Answer it empirically, on your hardware, with reproducible numbers. From RunAnywhere.
A reproducible benchmark harness that measures four coding agents
(aider, opencode, mini-swe-agent, cline) across eight routing
strategies and three local models, against frontier cloud LLMs, on one
M4 Max 64 GB laptop. Every published number traces back to a single row
in results/runs/<sweep>/raw.jsonl, priced by a versioned pricing table.
1,704 rows · 3 local models · 3 coding agents · 8 routing strategies · 17 tasks · 7 releases in 8 days.
| Cell | Pass-rate | Cloud-fraction | Notes |
|---|---|---|---|
cline + qwen3.6 + cascade + refactors (D1/D5) |
24/24 = 100% | 8% | The cleanest hybrid cell in the benchmark, $0.022/task |
cline + qwen3.6 + always-local + refactors (D6 hard tasks) |
**8/12 = 67%**¹ | 0% | 30B local-only ceiling, zero cloud spend |
cline + qwen3.6 + always-local + puzzles |
15/15 = 100% | 0% | Local-only nails Exercism Python |
aider + gemma4 + heuristic + refactors (D1/D5) |
23/24 = 96% [88, 100] | 34% | v1.3 marquee, replicates |
aider + gemma4 + heuristic + refactors (D6 hard tasks) |
7/12 = 58% | 68% | Where heuristic routing breaks on harder tasks |
aider + gemma4 / cline + qwen3.6 + always-cloud (D6 hard tasks) |
12/12 = 100% | 100% | gpt-5.5 ceiling on D6 |
¹ Conservative reading: 3 of the 4 misses are cline session-management bugs (the model never wrote code), not quality failures. The analyzer, which excludes error rows, scores this cell 8/9 = 89%. We quote the stricter 67% in the headline.
Full numbers, confidence intervals, per-task breakdowns, and the
real-world walkthrough of every task we measured live in
docs/release-notes/v1.5.0.md and
docs/release-notes/v1.4.1.md.
LLM coding agents got good, fast. The cost gap between a frontier cloud model and a 30B local model is now ~100×. The useful question is which tasks can stay on your laptop. This repo measures the answer end to end:
- Same agent, same task, different routes → quality and cost are comparable.
- Per-row tokens, not per-row cost → pricing scenarios swap in at analysis time.
- Bootstrap 95% CIs per cell → "X beats Y" claims rest on the intervals.
- One laptop, no cluster → results are reproducible from a clean clone.
Five minutes to a green smoke run. About an hour to a full canonical sweep.
| Tool | Why | Install (macOS) | Install (Linux) |
|---|---|---|---|
| Python 3.11 or 3.12 | Harness + agent runners | brew install python@3.12 |
sudo apt install python3.12 python3.12-venv |
| git | Clone the repo | (built-in) | sudo apt install git |
| Docker | Sandbox for the functional Python scorer | Docker Desktop | sudo apt install docker.io (+ add user to docker group) |
| Node ≥ 18 | Router proxy (router/server.mjs) |
brew install node |
sudo apt install nodejs npm |
| Ollama | Serves the local model on :11434 |
ollama.com/download |
curl -fsSL https://ollama.com/install.sh | sh |
An OPEN_AI_API_KEY |
The cloud half of every hybrid call | https://platform.openai.com/api-keys | same |
git clone https://github.com/RunanywhereAI/hybrid-arena
cd hybrid-arena
python3.12 -m venv .venv
.venv/bin/pip install -e ".[dev,agents]"
cp .env.example .env
# edit .env and paste your OPEN_AI_API_KEY./arena setupThis builds the Python sandbox Docker image, installs the cline and
opencode CLIs via npm if missing, and runs a quick health check. It
is idempotent, so you can re-run it any time.
The smoke config runs one task, cloud-only, so you don't need a local model pulled yet:
./arena sweep --config configs/v1.4-smoke.yaml --strategies always-cloud --seeds 42
./arena analyze results/runs/v1.4-smokeIf arena analyze produces a bootstrap_cis.json and a chart, the
harness is wired up correctly.
Pull a local model and run the canonical 4-strategy sweep:
ollama pull gemma4:31b # ~18 GB
./arena sweep \
--config configs/v1.4-canonical-gemma4.yaml \
--strategies always-cloud,always-local,heuristic,cascade \
--seeds 42,7,13
./arena analyze results/runs/v1.4-canonical-gemma4Expected wall-time on M4 Max 64 GB: 10 to 15 hours. Expected cloud spend at gpt-5.5 list pricing: $30 to $50. Pause and resume any time:
./arena pause # frees the laptop, keeps Ollama warm
./arena resume # picks up at the next un-written row
./arena status # PID + row count + RUNNING/PAUSEDWhen it finishes, compare your numbers against the canonical dataset
(gh release download v1.6.0 -p results-v1.6.0.tar.gz).
Three commands:
ollama pull <new-model>
./arena sweep --config configs/v1.4-canonical-gemma4.yaml \
--set models.local=<new-model> \
--set out_dir=results/runs/v1.4-<new-model> \
--strategies always-cloud,always-local,heuristic,cascade --seeds 42,7,13
./arena analyze results/runs/v1.4-<new-model>Headline cell to compare:
jq '.cells["refactors::cline::heuristic"].pass_rate' \
results/runs/v1.4-<new-model>/bootstrap_cis.jsonReference points on the same cell from the canonical sweeps: 96% (qwen3.6), 92% (qwen3-coder), 96% (gemma4, error-adjusted; 71% on the conservative n=24 reading that counts cline-session errors as failures).
| Component | What |
|---|---|
| 4 coding agents | aider · opencode · mini-swe-agent · cline |
| 8 routing strategies | always-cloud · always-local · rules · heuristic · llm-classifier · embedding-knn · cascade · phase-aware |
| 3 task classes | puzzles (Exercism Python, 5 tasks) · refactors (historical class name; the canonical cell is feature-adds (D1) + one-shot scripts (D5), 8 tasks, plus 4 D6 hard single-file builds. True refactor/review shapes (D3/D4) exist but were LLM-judged and sit outside the functional cell) · real-prs (SWE-bench Verified, adapter shipped, sweep is v1.6+ work) |
| 6 pricing scenarios | gpt-5.5 · gpt-5 · gpt-5-mini · claude-opus-4-7 · claude-sonnet-4-6 · claude-haiku-4-5 |
| Functional scoring | Sandboxed Python via Docker (--network none, memory caps, 60s timeout) |
| Statistics | Per-cell bootstrap 95% CIs on pass-rate, cost, cloud-fraction, wall-ms |
See docs/HYBRID_ROUTING_DESIGN.md for
the canonical design doc: what each agent does, how each routing strategy
decides, what each task class measures, and the result schema.
Deliberately the everyday tier: single-function puzzles up to senior single-file builds (an LRU+TTL cache, a recursive-descent template engine). The long-horizon frontier sits beyond it. SWE-Marathon runs multi-hour, whole-repo tasks where Opus 4.8 tops out near 26%. We measured real merged PRs (SWE-bench Verified) in the MVP era; the v1.x adapter ships and the agentic sweep is v1.6 work. We scope to where local-vs-cloud cost matters most.
./arena sweep --config configs/v1.4-canonical-gemma4.yaml
│
├── spawns ONE Node router proxy on :8787
│ (LOCAL_MODEL + CLOUD_MODEL injected from config)
│
└── for each (strategy, seed):
for each (task, agent):
agent.run(task, proxy_url=":8787")
│
├── agent makes N LLM calls through the router
├── router picks local-vs-cloud per call (current strategy)
├── router logs the decision to logs/decisions.jsonl
├── tokens come back via OpenAI-shape `usage` object
│
scorer runs the diff in a Docker sandbox
row written to results/runs/<sweep>/<strategy>/seed-<seed>/raw.jsonl
./arena analyze results/runs/<sweep>/
│
├── aggregate.json # per-cell medians + totals
├── bootstrap_cis.json # 95% CIs on pass_rate / cost / cloud_fraction
├── decision_matrix.md # Markdown table with "Recommended" column
└── charts/ # Pareto scatter + quality/cost heatmaps
Each agent is a thin wrapper around an externally-maintained tool. This repo owns the routing, the scoring, the analysis, and the result schema. It does not try to be a coding agent.
Distilled from the v1.5 leaderboard:
| You want… | Use this config | Why |
|---|---|---|
| Best everyday-task quality + lowest cost | cline + qwen3.6 + cascade |
100% on the D1/D5 tasks at 8% cloud, $0.022/task |
| Zero cloud spend, still serious quality | cline + qwen3.6 + always-local |
100% on puzzles, 67% on D6 hard tasks, $0 cloud |
| Maximum quality, cost is no object | Any agent + always-cloud (gpt-5.5) |
100% across every cell we measured |
| You know the task is hard | Force !cloud on the model field |
Cascade's router cannot always tell hard from easy |
What to avoid:
opencode + qwen models: opencode's prompting fits gemma4, not the qwen variants.aider + heuristicon D6-class tasks: 58% pass while the router still sends 68% of tokens to the cloud (scores less, spends more).mini-swe-agent + any local model: it trailsaiderandclineon this benchmark.
For an overnight sweep you can detach + pause + resume:
./arena start --config configs/v1.4-canonical-qwen3.6.yaml \
--strategies always-cloud,always-local,heuristic,cascade \
--seeds 42,7,13 # detaches, returns immediately
./arena status # PID + row count + RUNNING/PAUSED
./arena pause # frees the laptop, keeps Ollama warm
./arena resume # picks up at the next un-written row
./arena stop # also kills Ollama (~19 GB freed)State lives in /tmp/hcev-sweep.json. Resume is row-level (raw.jsonl is
appended to as rows complete) so a crash mid-sweep loses at most one row.
All arena subcommands are documented in arena <cmd> --help.
| Command | Use |
|---|---|
./arena setup |
One-time Docker + npm + venv setup |
./arena sweep |
Run a sweep (auto-spawns the router) |
./arena start / pause / resume / stop / status |
Long-sweep lifecycle |
./arena run |
Single-pass run (no router auto-spawn; advanced) |
./arena analyze |
Per-cell medians + bootstrap CIs + charts |
./arena token-budget |
Token-first matrix re-priced under all scenarios |
./arena rejudge |
Re-run the LLM-judge on completed prose rows |
./arena rescore |
Re-run the functional scorer with a fresh sandbox image |
./arena env-detect |
Capture hardware + software snapshot |
./arena show-config |
Print the merged config + its SHA256 |
./arena schema |
Regenerate configs/schema.json from the Pydantic model |
hybrid-arena/
├── README.md ← you are here
├── AGENTS.md ← canonical guide for AI coding agents reading the codebase
├── CHANGELOG.md ← release history
├── CONTRIBUTING.md ← how to add a model / agent / strategy
├── CODE_OF_CONDUCT.md ← short and direct
├── SECURITY.md ← vuln-report channel
├── LICENSE ← MIT
├── bench ← top-level CLI dispatcher
├── .github/workflows/ci.yml ← pytest + ruff on 3.11 / 3.12
│
├── src/hybrid_arena/
│ ├── core/ ← config, experiment, metrics, pricing, paths, sandbox
│ ├── agents/ ← aider, opencode, mini_swe, cline
│ ├── scorers/ ← functional (Docker), swebench
│ ├── tasks/ ← puzzles, refactors, real_prs
│ ├── analysis/ ← aggregate, bootstrap, decision_matrix, cost_scenarios
│ ├── viz/ ← Pareto + heatmap charts
│ └── cli/ ← bench dispatcher + subcommands
│
├── router/ ← zero-deps Node hybrid proxy on :8787
├── configs/
│ ├── v1.4-canonical-{gemma4,qwen3-coder,qwen3.6}.yaml
│ ├── v1.4-{smoke,strategy-sweep,opencode-fairness,real-prs}.yaml
│ ├── v1.5-hard-{gemma4,qwen3.6,smoke}.yaml
│ ├── pricing/pricing_tables.json ← shared by Python and Node, SHA256-pinned
│ └── router/corpus.json ← embedding-kNN labelled training data
│
├── tests/ ← pytest (CI runs all, 120 fast tests)
├── results/ ← v1.0 to v1.3 datasets tracked; v1.4+ as release tarballs
└── docs/
├── HYBRID_ROUTING_DESIGN.md ← THE design doc (strategies + agents + methodology)
└── release-notes/v1.*.md ← per-release findings
Every row carries task_id, route, router_strategy, seed,
cloud_model_id, local_model_id, config_sha, hardware_profile_ref.
Costs are derived from tokens × pinned pricing at analyze-time, so a pricing
edit flows through ./arena analyze without re-running inference. The harness
logs the pricing table SHA256 with each import.
The Node router and the Python harness both read the same
configs/pricing/pricing_tables.json and compute identical costs
(verified by tests/test_pricing_parity.py).
Code, datasets, charts, and docs prose are all MIT-licensed. See
LICENSE.
If you use this benchmark or its data in your own work, please cite it. A citation is how a small research project gets seen:
@misc{monga2026hybridarena,
author = {Monga, Sanchit and contributors},
title = {Hybrid Coding Arena: reproducible cost/latency/quality
benchmark for local vs cloud vs hybrid LLM routing on
coding tasks},
year = {2026},
howpublished = {\url{https://github.com/RunanywhereAI/hybrid-arena}},
note = {Version 1.6.0}
}Third-party tools driven by this harness:
- aider (Apache 2.0)
- opencode (MIT)
- cline (Apache 2.0)
- mini-swe-agent (MIT)
- Aider polyglot benchmark: source of the 5 puzzle tasks (MIT, derived from Exercism)
docs/REPRODUCING.md: clean-clone to green-charts to compare-against-canonical, step by step.docs/HYBRID_ROUTING_DESIGN.md: the single canonical design doc (strategies, agents, schema, methodology).docs/release-notes/v1.5.0.md: the most recent findings (D6 hard-task stress test).docs/release-notes/v1.4.1.md: the canonical 3-model leaderboard.AGENTS.md: folder-by-folder map for AI coding agents reading the codebase.CONTRIBUTING.md: how to add a model, agent, strategy, or task class.SECURITY.md: the vulnerability-disclosure channel.
Questions, reproduction issues, or new-model requests? File an issue: https://github.com/RunanywhereAI/hybrid-arena/issues



