v1.4.2 — OSS readiness cleanup

No new benchmark data in this release. v1.4.2 is a code, docs, and reproducibility cleanup pass aimed at making the repository easy for a first-time user to clone, configure, and run.

What changed

One-command reproducer

git clone https://github.com/RunanywhereAI/hybrid-arena
cd hybrid-arena
./scripts/reproduce.sh --smoke

That single command now checks every prerequisite (Python 3.11+, Docker, Ollama, Node, jq, .env with OPEN_AI_API_KEY), creates the venv, installs the package editable, runs ./arena setup, runs the 1-task smoke sweep, and analyses the result. ~30 seconds, ~$0.01 cloud spend.

For a full canonical sweep:

./scripts/reproduce.sh \
    --config configs/v1.4-canonical-gemma4.yaml \
    --strategies always-cloud,always-local,heuristic,cascade \
    --seeds 42,7,13

Pipeline correctness fixes (carried over from the v1.4.1 audit)

These were already merged on main but are first surfaced in a release:

Bootstrap cost CI now reads from configs/pricing/pricing_tables.json instead of an empty per-row cost_usd field that older datasets never set. Cost CIs were silently zero for some cells in v1.4.0/v1.4.1 analyses — fixed.
Cloud-fraction is token-based, not call-count-based, and that's the canonical definition everywhere now (router, analysis, release notes).
Bootstrap stratify_by is respected. Pre-v1.4.2 the parameter existed but was unused.
Seed is stamped on every row via the new --seed flag through run_pair(seed=...).
Refactor scoring now accepts both refactors and the legacy real_dev source name — a silent skip bug in score_row.
Router forwards CLOUD_MODEL from the config to the spawned proxy. Cloud-model overrides previously needed manual env-var injection.
arena analyze walks subdirectories. Pass it the sweep root and it analyses every <strategy>/seed-<seed>/raw.jsonl it finds.
arena setup fails fast (10 s) when the Docker daemon is down, instead of hanging on the image-inspect call.
Aider pytest summary parser correctly extracts tests_passed / tests_total from pytest's output (was always reporting 0/1 or 1/1).

Deletions

analysis/arqgc.py, analysis/decision_matrix_v2.py — unused; the v1.4 decision-matrix rewrite ranks by pass-rate then median cost.
agents/claude_code.py — reserved for a future v1.5; v1.4.2 ships a four-agent surface (aider, opencode, mini-swe-agent, cline).
scorers/llm_judge.py — already gone in v1.4.0; release notes updated to reflect that.
docs/REPRODUCING.md, docs/ARCHITECTURE.md, docs/METHODOLOGY.md, docs/ROUTING_STRATEGIES.md, docs/AGENTIC_ROUTES.md, docs/HYBRID_ROUTER_DESIGN.md, docs/PRIOR_ART.md, docs/BENCHMARK_NEW_MODEL.md, docs/audits/ — consolidated into a single canonical design doc.

New / consolidated documentation

File	What it is
`docs/HYBRID_ROUTING_DESIGN.md`	The single canonical design doc (strategies + agents + schema + recipe).
`README.md`	Rewritten — TL;DR results, 15-min quickstart, repo layout.
`AGENTS.md`	Rewritten — folder-by-folder map for AI agents reading the codebase.
`CONTRIBUTING.md`	Rewritten add-a-model / agent / strategy / task-class recipes.
`SECURITY.md`	New — vulnerability-disclosure channel.
`scripts/reproduce.sh`	New — one-command reproducer.

Legal / metadata hygiene

LICENSE, LICENSE-DATA, LICENSE.md, NOTICE.md rewritten — every path in them now actually exists in the repo (no more references to deleted runners/, EXTERNAL/, vendor/minions/, vendor/lm-eval-harness-judge/, bin/, benchmark/, lib/).
CODE_OF_CONDUCT.md — private email reporting channel (conduct@runanywhere.ai) instead of public "open an issue with conduct label".
CHANGELOG.md — [1.4.1] reference link restored.
pyproject.toml — ruff.extend-exclude points at tasks/ (the v1.4 fixture roots), not the deleted benchmarks/ paths.
requirements.txt — synced with pyproject.toml's [project.dependencies], grouped Core / Viz / Optional.
.env.example — dropped the deleted-in-v1.4 llm_judge reference; added the v1.4.1 ROUTER_LOCAL_* guard env vars.
Two personal/raw-runs/v4*.yaml files that were committed despite the personal/ gitignore are now untracked.

What did NOT change

No new benchmark rows. All four canonical v1.4 datasets (v1.4-canonical-gemma4, v1.4-canonical-qwen3-coder, v1.4-canonical-qwen3.6, v1.4-real-prs) ship verbatim from v1.4.1 — they're still the 1,644-row record.
No agent additions. Same four agents: aider, opencode, mini-swe-agent, cline.
No routing-strategy additions. Same eight strategies.

Verification

pytest -m 'not slow' — 102 passed, 9 skipped (Docker-not-built / no-Ollama). With Docker up, 111 passed, 0 skipped.
ruff check src/ tests/ — all clean.
./arena --help / setup --help / show-config --help — all working post-cleanup.
End-to-end smoke sweep via ./scripts/reproduce.sh --smoke on an M4 Max 64 GB with Docker Desktop + Ollama running:

Run Pass? tests tokens wall cost (gpt-5.5)

1 ✓ PASS 25/25 21,961 46.4 s $0.164

2 ✗ FAIL 0/17 17,585 39.7 s $0.098

3 ✓ PASS 25/25 20,860 33.6 s $0.106

Cost matches the hand-computed pricing-table formula exactly to the sixth decimal place. The pass/fail flip across runs is expected — single-row smoke is not statistically meaningful; the cloud model is non-deterministic even with seed=42. Use 3+ seeds and a full task class for stable measurement.

Run	Pass?	tests	tokens	wall	cost (gpt-5.5)
1	✓ PASS	25/25	21,961	46.4 s	$0.164
2	✗ FAIL	0/17	17,585	39.7 s	$0.098
3	✓ PASS	25/25	20,860	33.6 s	$0.106

First-time-user observations

Things that were friction during the smoke walkthrough, fixed in this release:

arena setup output had legacy R6 / R7 / R8 route labels — these are internal to core/experiment and confusing for a stranger. Now reads aider agent, mini-swe-agent, cline agent, etc.
The reproducer was showing a noisy pip dependency-resolver warning about aider-chat declaring a stricter openai range than ours. Cosmetic only (the harness still works because the surface we use is stable across the SDK versions); the noise is now filtered.
arena analyze previously required a per-(strategy, seed) invocation. Now you can point it at the sweep root and it walks one level deep.

Known caveats

Cell keys in bootstrap_cis.json / aggregate.json use the back-compat single-letter category codes: A ↔ puzzles, D ↔ refactors, B ↔ real-prs. The high-level task-class name only appears in BenchmarkConfig.task_classes and in release-notes prose.
cloud_model_id, local_model_id, and config_sha are not yet stamped on rows produced by the smoke. They're populated correctly for full sweeps via --variant-tag / config_sha; the smoke path bypasses some metadata stamping for speed. Tracked for v1.4.3.

Upgrade notes

If you're coming from v1.4.0 or v1.4.1:

The new reproducer is scripts/reproduce.sh. The docs/REPRODUCING.md page has been merged into docs/HYBRID_ROUTING_DESIGN.md §9 (Add a new local model).
arena analyze <sweep_root> now walks subdirectories. If you used to call it per <strategy>/seed-<seed>/, you can now just point it at the parent.
Schema cell keys, agent names, and pricing-table keys are unchanged.
The v1.4.0 and v1.4.1 release-tarball artefacts (results-v1.4.0.tar.gz, results-v1.4.1.tar.gz) are still the source of truth for the empirical record.

Citation

@misc{monga2026hybridcodingeval,
  author       = {Monga, Sanchit and contributors},
  title        = {hybrid-arena: reproducible cost/latency/quality
                  benchmark for local vs cloud vs hybrid LLM routing on
                  coding tasks},
  year         = {2026},
  howpublished = {\url{https://github.com/RunanywhereAI/hybrid-arena}},
  note         = {Version 1.4.2}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.4.2 — OSS readiness cleanup

What changed

One-command reproducer

Pipeline correctness fixes (carried over from the v1.4.1 audit)

Deletions

New / consolidated documentation

Legal / metadata hygiene

What did NOT change

Verification

First-time-user observations

Known caveats

Upgrade notes

Citation

Uh oh!

FilesExpand file tree

v1.4.2.md

Latest commit

History

v1.4.2.md

File metadata and controls

v1.4.2 — OSS readiness cleanup

What changed

One-command reproducer

Pipeline correctness fixes (carried over from the v1.4.1 audit)

Deletions

New / consolidated documentation

Legal / metadata hygiene

What did NOT change

Verification

First-time-user observations

Known caveats

Upgrade notes

Citation