No new benchmark data in this release. v1.4.2 is a code, docs, and reproducibility cleanup pass aimed at making the repository easy for a first-time user to clone, configure, and run.
git clone https://github.com/RunanywhereAI/hybrid-arena
cd hybrid-arena
./scripts/reproduce.sh --smokeThat single command now checks every prerequisite (Python 3.11+, Docker,
Ollama, Node, jq, .env with OPEN_AI_API_KEY), creates the venv,
installs the package editable, runs ./arena setup, runs the 1-task
smoke sweep, and analyses the result. ~30 seconds, ~$0.01 cloud spend.
For a full canonical sweep:
./scripts/reproduce.sh \
--config configs/v1.4-canonical-gemma4.yaml \
--strategies always-cloud,always-local,heuristic,cascade \
--seeds 42,7,13These were already merged on main but are first surfaced in a release:
- Bootstrap cost CI now reads from
configs/pricing/pricing_tables.jsoninstead of an empty per-rowcost_usdfield that older datasets never set. Cost CIs were silently zero for some cells in v1.4.0/v1.4.1 analyses — fixed. - Cloud-fraction is token-based, not call-count-based, and that's the canonical definition everywhere now (router, analysis, release notes).
- Bootstrap
stratify_byis respected. Pre-v1.4.2 the parameter existed but was unused. - Seed is stamped on every row via the new
--seedflag throughrun_pair(seed=...). - Refactor scoring now accepts both
refactorsand the legacyreal_devsource name — a silent skip bug inscore_row. - Router forwards
CLOUD_MODELfrom the config to the spawned proxy. Cloud-model overrides previously needed manual env-var injection. arena analyzewalks subdirectories. Pass it the sweep root and it analyses every<strategy>/seed-<seed>/raw.jsonlit finds.arena setupfails fast (10 s) when the Docker daemon is down, instead of hanging on the image-inspect call.- Aider pytest summary parser correctly extracts
tests_passed/tests_totalfrompytest's output (was always reporting 0/1 or 1/1).
analysis/arqgc.py,analysis/decision_matrix_v2.py— unused; the v1.4 decision-matrix rewrite ranks by pass-rate then median cost.agents/claude_code.py— reserved for a future v1.5; v1.4.2 ships a four-agent surface (aider,opencode,mini-swe-agent,cline).scorers/llm_judge.py— already gone in v1.4.0; release notes updated to reflect that.docs/REPRODUCING.md,docs/ARCHITECTURE.md,docs/METHODOLOGY.md,docs/ROUTING_STRATEGIES.md,docs/AGENTIC_ROUTES.md,docs/HYBRID_ROUTER_DESIGN.md,docs/PRIOR_ART.md,docs/BENCHMARK_NEW_MODEL.md,docs/audits/— consolidated into a single canonical design doc.
| File | What it is |
|---|---|
docs/HYBRID_ROUTING_DESIGN.md |
The single canonical design doc (strategies + agents + schema + recipe). |
README.md |
Rewritten — TL;DR results, 15-min quickstart, repo layout. |
AGENTS.md |
Rewritten — folder-by-folder map for AI agents reading the codebase. |
CONTRIBUTING.md |
Rewritten add-a-model / agent / strategy / task-class recipes. |
SECURITY.md |
New — vulnerability-disclosure channel. |
scripts/reproduce.sh |
New — one-command reproducer. |
LICENSE,LICENSE-DATA,LICENSE.md,NOTICE.mdrewritten — every path in them now actually exists in the repo (no more references to deletedrunners/,EXTERNAL/,vendor/minions/,vendor/lm-eval-harness-judge/,bin/,benchmark/,lib/).CODE_OF_CONDUCT.md— private email reporting channel (conduct@runanywhere.ai) instead of public "open an issue with conduct label".CHANGELOG.md—[1.4.1]reference link restored.pyproject.toml—ruff.extend-excludepoints attasks/(the v1.4 fixture roots), not the deletedbenchmarks/paths.requirements.txt— synced withpyproject.toml's[project.dependencies], grouped Core / Viz / Optional..env.example— dropped the deleted-in-v1.4llm_judgereference; added the v1.4.1ROUTER_LOCAL_*guard env vars.- Two
personal/raw-runs/v4*.yamlfiles that were committed despite thepersonal/gitignore are now untracked.
- No new benchmark rows. All four canonical v1.4 datasets
(
v1.4-canonical-gemma4,v1.4-canonical-qwen3-coder,v1.4-canonical-qwen3.6,v1.4-real-prs) ship verbatim from v1.4.1 — they're still the 1,644-row record. - No agent additions. Same four agents:
aider,opencode,mini-swe-agent,cline. - No routing-strategy additions. Same eight strategies.
-
pytest -m 'not slow'— 102 passed, 9 skipped (Docker-not-built / no-Ollama). With Docker up, 111 passed, 0 skipped. -
ruff check src/ tests/— all clean. -
./arena --help/setup --help/show-config --help— all working post-cleanup. -
End-to-end smoke sweep via
./scripts/reproduce.sh --smokeon an M4 Max 64 GB with Docker Desktop + Ollama running:Run Pass? tests tokens wall cost (gpt-5.5) 1 ✓ PASS 25/25 21,961 46.4 s $0.164 2 ✗ FAIL 0/17 17,585 39.7 s $0.098 3 ✓ PASS 25/25 20,860 33.6 s $0.106 Cost matches the hand-computed pricing-table formula exactly to the sixth decimal place. The pass/fail flip across runs is expected — single-row smoke is not statistically meaningful; the cloud model is non-deterministic even with
seed=42. Use 3+ seeds and a full task class for stable measurement.
Things that were friction during the smoke walkthrough, fixed in this release:
arena setupoutput had legacyR6 / R7 / R8route labels — these are internal tocore/experimentand confusing for a stranger. Now readsaider agent,mini-swe-agent,cline agent, etc.- The reproducer was showing a noisy pip dependency-resolver warning
about
aider-chatdeclaring a stricteropenairange than ours. Cosmetic only (the harness still works because the surface we use is stable across the SDK versions); the noise is now filtered. arena analyzepreviously required a per-(strategy, seed)invocation. Now you can point it at the sweep root and it walks one level deep.
- Cell keys in
bootstrap_cis.json/aggregate.jsonuse the back-compat single-letter category codes:A↔puzzles,D↔refactors,B↔real-prs. The high-level task-class name only appears inBenchmarkConfig.task_classesand in release-notes prose. cloud_model_id,local_model_id, andconfig_shaare not yet stamped on rows produced by the smoke. They're populated correctly for full sweeps via--variant-tag/config_sha; the smoke path bypasses some metadata stamping for speed. Tracked for v1.4.3.
If you're coming from v1.4.0 or v1.4.1:
- The new reproducer is
scripts/reproduce.sh. Thedocs/REPRODUCING.mdpage has been merged intodocs/HYBRID_ROUTING_DESIGN.md §9 (Add a new local model). arena analyze <sweep_root>now walks subdirectories. If you used to call it per<strategy>/seed-<seed>/, you can now just point it at the parent.- Schema cell keys, agent names, and pricing-table keys are unchanged.
- The v1.4.0 and v1.4.1 release-tarball artefacts (
results-v1.4.0.tar.gz,results-v1.4.1.tar.gz) are still the source of truth for the empirical record.
@misc{monga2026hybridcodingeval,
author = {Monga, Sanchit and contributors},
title = {hybrid-arena: reproducible cost/latency/quality
benchmark for local vs cloud vs hybrid LLM routing on
coding tasks},
year = {2026},
howpublished = {\url{https://github.com/RunanywhereAI/hybrid-arena}},
note = {Version 1.4.2}
}