Hybrid routing design

Single-source reference for what this benchmark measures, how the router decides local-vs-cloud, which coding agents we run, and why the numbers in docs/release-notes/ are defensible. If you're new, read top-to-bottom; if you're benchmarking a new model, jump to §9.

1. The question

Given one developer-class laptop (M-series Mac, 64 GB), one local LLM, and access to a frontier cloud LLM — for which coding task shapes is it worth routing some calls locally instead of sending everything to the cloud?

We answer it empirically. For every cell (local-model, agent, task-class, strategy, seed) we run the same set of tasks and report:

Pass-rate — fraction of tasks that pass functional tests, with a non-parametric bootstrap 95% CI.
Cost — USD spent on the cloud LLM, derived from token counts and a versioned pricing table.
Cloud fraction — Σ cloud tokens ÷ Σ all tokens. Token-based, never call-count-based (call-count is gamed by cascade/llm-classifier meta-calls).
Wall time — median wall-clock ms per task.

Every number in docs/release-notes/*.md is a slice of those four numbers across 1,644 rows in v1.4.

2. Architecture, one diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                      ./arena sweep --config X.yaml                       │
│                                                                          │
│  ┌─────────────────────┐                                                 │
│  │  cli/bench.py       │  spawns one Node router proxy with              │
│  │  (orchestrator)     │  LOCAL_MODEL + CLOUD_MODEL injected             │
│  └─────────┬───────────┘                                                 │
│            │                                                             │
│            ▼                                                             │
│  ┌─────────────────────┐         ┌──────────────────────────┐            │
│  │  core/experiment.py │ for each│  agents/<agent>.py        │           │
│  │  build_task_plan +  │ (task,  │  (one runner per agent —  │ subprocess│
│  │  run_pair() loop    │  agent) │  aider / opencode / mini- │           │
│  └─────────┬───────────┘─────────►  swe-agent / cline)       │           │
│            │                     └─────────────┬────────────┘            │
│            │                                   │                         │
│            │           OpenAI-compatible HTTP  ▼                         │
│            │                          ┌────────────────────────┐         │
│            │                          │ router/server.mjs      │         │
│            │                          │ + router/strategies.mjs│         │
│            │                          │ (decides local|cloud)  │         │
│            │                          └─────┬──────────┬───────┘         │
│            │                                │          │                 │
│            │                                ▼          ▼                 │
│            │                            Ollama      OpenAI               │
│            │                          (local model) (cloud model)        │
│            │                                                             │
│            ▼                                                             │
│  ┌─────────────────────┐                                                 │
│  │ scorers/functional_ │  Docker sandbox runs the agent's diff           │
│  │ python (sandbox)    │  against the fixture's pytest suite             │
│  └─────────┬───────────┘                                                 │
│            ▼                                                             │
│  results/runs/<sweep>/raw.jsonl   (one ResultRow per task × agent ×      │
│                                    strategy × seed; tokens never cost,   │
│                                    cost is derived at analyse-time)      │
└─────────────────────────────────────────────────────────────────────────┘

The Python orchestrator owns the experiment loop. The Node router owns the routing decision. The two communicate via OpenAI-compatible HTTP, which lets every off-the-shelf coding agent talk to the router as if it were OpenAI.

3. The four coding agents

We measure agent-loop routes only — agents that own their own multi-turn tool use. The repo doesn't try to be one. Each agent is a battle-tested, externally-maintained tool we wrap thinly.

Agent	Loop style	Strengths	Caveats
`aider`	architect → editor	Tight diffs, parsable patches	Markdown-fence parse bug on some local models
`opencode`	free-form tool calls	High ceiling when tool calls are clean	gemma4-specific in v1.4 — qwen variants fail parsing
`mini-swe-agent`	minimalist bash-only ReAct	Closest to SWE-bench reference; small surface area	Needs Docker for SWE-bench Verified
`cline`	Plan / Act with 8-14 turns	Iteration wins puzzles + refactors at 30B local	Highest token cost per cell

Each wrapper lives in src/hybrid_arena/agents/<agent>.py and exposes one run(task, *, proxy_url, ...) -> ResultRow function. The orchestrator calls them through core/experiment.py:_runner_for(agent).

Why not Cursor / Continue / Cody? They were considered and dropped because either (a) they don't expose a headless CLI, or (b) they require a managed account. The four agents above can all be driven from a script with no UI.

4. The eight routing strategies

The router proxy reads each request, picks a backend (local or cloud), and forwards the body. Strategies live in router/strategies.mjs. The Python side never knows which backend served a call — it only sees the token counts in the OpenAI usage object.

Strategy	Decision rule	Use when
`always-cloud`	Send every request to the cloud model. Control.	Baseline — set the ceiling
`always-local`	Send every request to the local model. Control.	Baseline — set the floor
`rules`	Keyword + regex rules (e.g. "refactor" → local)	Demos / debugging
`heuristic`	Weighted scoring on prompt length, code-block count, agent state	First strategy to actually try in prod
`llm-classifier`	One `qwen3:0.6b` call returns `SIMPLE` or `COMPLEX`	When you trust a tiny model's taste
`embedding-knn`	`nomic-embed-text` query → kNN vote against a 50-example labelled corpus	Stable, no-LLM-overhead routing
`cascade`	Heuristic decides first; on borderline confidence, llm-classifier tie-breaks	Highest pass-rate hybrid in v1.4
`phase-aware`	Like heuristic, but with an `aider-architect-step` bonus for `aider`	Aider-specific tuning

cascade-tuned is the same as cascade with the ROUTER_CASCADE_THRESHOLD env var injected per pass — used for the v1.3 threshold sweep.

Every routing decision is appended to router/logs/decisions.jsonl with the score, reason, and the picked backend. That file is the single source of truth for routing audits.

Append `!local` / `!cloud` to force

Any agent can override the strategy on a single call by appending !local or !cloud to the model field. We use this in tests; agents in normal sweeps don't.

5. Task classes

Class	Source	Shape	Count in v1.4
`puzzles`	Exercism Python (via Aider polyglot benchmark)	Single-function, single-file tasks with hidden tests	5
`refactors`	Hand-written, real-PR-shaped tasks in single-file fixture repos	Historical class name. The functional canonical cell is feature-adds (D1) + one-shot scripts (D5); v1.5 adds 4 D6 hard single-file builds. True refactor (D3) and code-review (D4) shapes exist in the source but were LLM-judged and are not in the functional cell.	12 (8 D1/D5 + 4 D6)
`real-prs`	SWE-bench Verified subset	Repo-level patches against Docker testbeds	(v1.6+ work; adapter shipped)

Each task adapter lives in src/hybrid_arena/tasks/<class>/. A task is a small dataclass: id, fixture_path, prompt, run_cmd. Scoring is per class — puzzles uses pytest in a Docker sandbox; refactors uses a dispatcher in tasks/refactors/scorers.py that picks the right per-task checker (e.g. "does the diff add the rate-limit guard?").

6. The result schema

Every row in results/runs/<sweep>/raw.jsonl is a ResultRow (see src/hybrid_arena/core/metrics.py):

@dataclass
class ResultRow:
    task_id: str                 # 'exercism-python/grep'
    category: str                # 'puzzles' | 'refactors' | 'real-prs'
    route: str                   # 'aider' | 'opencode' | 'mini-swe-agent' | 'cline'
    router_strategy: str | None  # 'heuristic' | 'cascade' | ...
    seed: int | None             # deterministic seed stamped by the orchestrator
    hardware_profile_ref: str    # 'Apple M4 Max|64GB|git<sha>|mh<self-hash>'

    tokens: TokenUsage           # prompt / completion / cached / local_* / cloud_*
    latency: Latency             # wall_ms + per_call_ms[]
    quality: Quality             # functional_pass, tests_passed/total, composite
    routing: Routing             # total_calls, local_calls, cloud_calls

    cloud_model_id: str | None   # 'gpt-5.5' — stamped from BenchConfig.models.cloud
    local_model_id: str | None   # 'gemma4:31b'
    config_sha: str | None       # SHA256 of the BenchConfig that produced the row
    error: str | None            # set when the runner failed (timeout, parse, etc.)

Three invariants:

Tokens are persisted, cost is derived. No row contains a cost_usd field; cost is computed on read against configs/pricing/pricing_tables.json. That lets you re-price an old dataset against new pricing scenarios.
always-local rows always have cloud_* = 0. Non-zero cloud tokens in an always-local row is a routing bug.
error is set ↔ tokens are zero ↔ quality is all-None. Error rows are excluded from bootstrap CIs by analysis.bootstrap.

7. The analysis pipeline

./arena analyze <sweep_dir> runs these in order:

aggregate.py — per-(category, route, strategy) medians and totals.
bootstrap.py — 95% percentile CIs per cell (non-parametric, 1k resamples). Computes pass_rate, cost_usd, cloud_fraction, and wall_ms. Stratifies by (category, route, router_strategy) by default; pools across seeds.
decision_matrix.py — renders the cell × strategy table as Markdown, with a "recommended" column = highest pass-rate at the lowest tying cost.
cost_scenarios.py — re-prices every row under five pricing scenarios so callers can answer "what would this have cost on gpt-5-mini?".
viz/cost_quality_pareto.py + viz/decision_heatmap.py — PNG charts.

Every step is idempotent and re-readable. Re-running on the same raw.jsonl produces byte-identical JSON outputs.

8. The pricing table

Cloud spend is always computed from tokens × pricing rates. The rates live in configs/pricing/pricing_tables.json (USD per 1,000,000 tokens):

{
  "rates_per_m": {
    "gpt-5.5":                    { "input": 5.0,  "output": 30.0, "cache_read": 0.5 },
    "anthropic-claude-opus-4.7":  { "input": 15.0, "output": 75.0, "cache_read": 1.5 },
    "__local__":                  { "input": 0.0,  "output": 0.0,  "cache_read": 0.0 }
  }
}

router/pricing.mjs reads the same JSON file. Parity is verified by tests/test_pricing_parity.py — the router and the Python harness compute identical costs for any usage payload.

Cost formula:

usd = (prompt − cached) × input/1e6 + cached × cache_read/1e6 + completion × output/1e6

completion_tokens already includes reasoning_tokens — they're surfaced in the row for transparency only, not added again.

9. Add a new local model

Drop-in recipe for benchmarking a new local model against the v1.4 canonical matrix:

ollama pull <new-model>
./arena setup                # idempotent; first run only
./arena sweep \
    --config configs/v1.4-canonical-gemma4.yaml \
    --set models.local=<new-model> \
    --set out_dir=results/runs/v1.4-<new-model> \
    --strategies always-cloud,always-local,heuristic,cascade \
    --seeds 42,7,13

./arena setup checks prereqs (Docker, Ollama, Node, API keys) and builds the sandbox image / installs the agent CLIs. Long-form lifecycle commands (./arena start / pause / resume / stop / status) are documented inline at ./arena --help and exist so you can detach the sweep and reclaim the laptop.

Expected runtime on an M4 Max 64 GB: 10–15 hours, ≈ $30–50 cloud spend at gpt-5.5 list price. The router auto-spawns from models.local; you don't need a separate router terminal.

When the sweep completes:

./arena analyze results/runs/v1.4-<new-model>
jq '.cells["refactors::cline::heuristic"].pass_rate' \
   results/runs/v1.4-<new-model>/bootstrap_cis.json

That cell — cline + your-model + heuristic + refactors — is the headline number. Compare against the v1.4.1 release notes for context.

Cell-key naming. Cell keys use the same human-readable task-class name end-to-end: puzzles::aider::heuristic, refactors::cline::cascade, real-prs::mini-swe-agent::always-local. Pre-v1.4.3 datasets used single-letter codes (A/B/D); those keys are retired — re-render legacy datasets with the v1.4.3+ harness to migrate.

10. What we deliberately do NOT do

No LLM-as-judge for prose tasks. Functional-pass only. The v1.0–v1.3 experiments with Opus-as-judge are documented in CHANGELOG but the judge scorer was deleted in v1.4 — it added too much variance for too little signal on coding tasks.
No Continue / Cursor / Cody. No headless CLI ⇒ no reproducible driver ⇒ no apples-to-apples cell in the matrix.
No "average across vendors" cost claims. Every cost number in the release notes is gpt-5.5 specifically. The cost_scenarios re-pricing exists for what-if analysis only.
No statistical claims past 95% CI. The dataset is small (24-row cells are typical); we publish CIs and let readers decide.
No router-side caching. The cloud's prompt-cache hits are visible in cached_tokens and charged at cache_read, but the router never inserts its own cache layer that could leak across sweeps.

11. Limitations + threats to validity

One laptop, M4 Max 64 GB. Memory bandwidth (546 GB/s) and the 30B-class quantization sweet spot are both Apple-Silicon-specific. A 4090 + Linux will trade more compute for less RAM; numbers may swing.
18 tasks per cell. Bootstrap CIs reflect that — 96% on 24 rows has a CI of roughly [88, 100]. Treat single-point claims with appropriate skepticism; cite the CI.
Python-only fixtures. TypeScript / Go / Java are obvious next steps; they would change the routing-strategy weights (heuristic's "code block count" heuristic is regex-keyed on Python markers).
Cloud model is gpt-5.5 across all sweeps. Anthropic Opus / Sonnet cells haven't been run end-to-end; they exist in the pricing table for re-pricing the gpt-5.5 datasets, not for direct comparison.
Aider's markdown-fence parser rejects some valid completions; the v1.4.0 "23/24 = 96%" cell loses 1/24 to this bug, not a model failure.

12. Pointers

Add a new agent → write a src/hybrid_arena/agents/<name>.py with a run(task, *, proxy_url, ...) -> ResultRow function, register it in core/experiment.py:_runner_for, add <name> to the Agent literal in core/config/schema.py. Tests live in tests/agents/test_<name>.py.
Add a new strategy → write a function in router/strategies.mjs, register it in the STRATEGY_REGISTRY at the bottom of that file, add the name to RouteStrategy in core/config/schema.py.
Add a new task class → create src/hybrid_arena/tasks/<class>/ with adapter.py (loads tasks) and scorers.py (scores a row); add the class name to CATEGORY_SOURCES in core/experiment.py and TaskClass in core/config/schema.py.
Add a new pricing scenario → append to configs/pricing/pricing_tables.json and to PRICING_SCENARIOS in analysis/cost_scenarios.py. Both Python and Node pick it up automatically.

The full repo map lives in AGENTS.md for AI coding agents reading the codebase; README.md is the human-facing entry point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hybrid routing design

1. The question

2. Architecture, one diagram

3. The four coding agents

4. The eight routing strategies

Append `!local` / `!cloud` to force

5. Task classes

6. The result schema

7. The analysis pipeline

8. The pricing table

9. Add a new local model

10. What we deliberately do NOT do

11. Limitations + threats to validity

12. Pointers

Uh oh!

FilesExpand file tree

HYBRID_ROUTING_DESIGN.md

Latest commit

History

HYBRID_ROUTING_DESIGN.md

File metadata and controls

Hybrid routing design

1. The question

2. Architecture, one diagram

3. The four coding agents

4. The eight routing strategies

Append !local / !cloud to force

5. Task classes

6. The result schema

7. The analysis pipeline

8. The pricing table

9. Add a new local model

10. What we deliberately do NOT do

11. Limitations + threats to validity

12. Pointers

Append `!local` / `!cloud` to force