Skip to content

Latest commit

 

History

History
186 lines (154 loc) · 8.5 KB

File metadata and controls

186 lines (154 loc) · 8.5 KB

v1.4.2 — OSS readiness cleanup

No new benchmark data in this release. v1.4.2 is a code, docs, and reproducibility cleanup pass aimed at making the repository easy for a first-time user to clone, configure, and run.

What changed

One-command reproducer

git clone https://github.com/RunanywhereAI/hybrid-arena
cd hybrid-arena
./scripts/reproduce.sh --smoke

That single command now checks every prerequisite (Python 3.11+, Docker, Ollama, Node, jq, .env with OPEN_AI_API_KEY), creates the venv, installs the package editable, runs ./arena setup, runs the 1-task smoke sweep, and analyses the result. ~30 seconds, ~$0.01 cloud spend.

For a full canonical sweep:

./scripts/reproduce.sh \
    --config configs/v1.4-canonical-gemma4.yaml \
    --strategies always-cloud,always-local,heuristic,cascade \
    --seeds 42,7,13

Pipeline correctness fixes (carried over from the v1.4.1 audit)

These were already merged on main but are first surfaced in a release:

  • Bootstrap cost CI now reads from configs/pricing/pricing_tables.json instead of an empty per-row cost_usd field that older datasets never set. Cost CIs were silently zero for some cells in v1.4.0/v1.4.1 analyses — fixed.
  • Cloud-fraction is token-based, not call-count-based, and that's the canonical definition everywhere now (router, analysis, release notes).
  • Bootstrap stratify_by is respected. Pre-v1.4.2 the parameter existed but was unused.
  • Seed is stamped on every row via the new --seed flag through run_pair(seed=...).
  • Refactor scoring now accepts both refactors and the legacy real_dev source name — a silent skip bug in score_row.
  • Router forwards CLOUD_MODEL from the config to the spawned proxy. Cloud-model overrides previously needed manual env-var injection.
  • arena analyze walks subdirectories. Pass it the sweep root and it analyses every <strategy>/seed-<seed>/raw.jsonl it finds.
  • arena setup fails fast (10 s) when the Docker daemon is down, instead of hanging on the image-inspect call.
  • Aider pytest summary parser correctly extracts tests_passed / tests_total from pytest's output (was always reporting 0/1 or 1/1).

Deletions

  • analysis/arqgc.py, analysis/decision_matrix_v2.py — unused; the v1.4 decision-matrix rewrite ranks by pass-rate then median cost.
  • agents/claude_code.py — reserved for a future v1.5; v1.4.2 ships a four-agent surface (aider, opencode, mini-swe-agent, cline).
  • scorers/llm_judge.py — already gone in v1.4.0; release notes updated to reflect that.
  • docs/REPRODUCING.md, docs/ARCHITECTURE.md, docs/METHODOLOGY.md, docs/ROUTING_STRATEGIES.md, docs/AGENTIC_ROUTES.md, docs/HYBRID_ROUTER_DESIGN.md, docs/PRIOR_ART.md, docs/BENCHMARK_NEW_MODEL.md, docs/audits/ — consolidated into a single canonical design doc.

New / consolidated documentation

File What it is
docs/HYBRID_ROUTING_DESIGN.md The single canonical design doc (strategies + agents + schema + recipe).
README.md Rewritten — TL;DR results, 15-min quickstart, repo layout.
AGENTS.md Rewritten — folder-by-folder map for AI agents reading the codebase.
CONTRIBUTING.md Rewritten add-a-model / agent / strategy / task-class recipes.
SECURITY.md New — vulnerability-disclosure channel.
scripts/reproduce.sh New — one-command reproducer.

Legal / metadata hygiene

  • LICENSE, LICENSE-DATA, LICENSE.md, NOTICE.md rewritten — every path in them now actually exists in the repo (no more references to deleted runners/, EXTERNAL/, vendor/minions/, vendor/lm-eval-harness-judge/, bin/, benchmark/, lib/).
  • CODE_OF_CONDUCT.md — private email reporting channel (conduct@runanywhere.ai) instead of public "open an issue with conduct label".
  • CHANGELOG.md[1.4.1] reference link restored.
  • pyproject.tomlruff.extend-exclude points at tasks/ (the v1.4 fixture roots), not the deleted benchmarks/ paths.
  • requirements.txt — synced with pyproject.toml's [project.dependencies], grouped Core / Viz / Optional.
  • .env.example — dropped the deleted-in-v1.4 llm_judge reference; added the v1.4.1 ROUTER_LOCAL_* guard env vars.
  • Two personal/raw-runs/v4*.yaml files that were committed despite the personal/ gitignore are now untracked.

What did NOT change

  • No new benchmark rows. All four canonical v1.4 datasets (v1.4-canonical-gemma4, v1.4-canonical-qwen3-coder, v1.4-canonical-qwen3.6, v1.4-real-prs) ship verbatim from v1.4.1 — they're still the 1,644-row record.
  • No agent additions. Same four agents: aider, opencode, mini-swe-agent, cline.
  • No routing-strategy additions. Same eight strategies.

Verification

  • pytest -m 'not slow'102 passed, 9 skipped (Docker-not-built / no-Ollama). With Docker up, 111 passed, 0 skipped.

  • ruff check src/ tests/ — all clean.

  • ./arena --help / setup --help / show-config --help — all working post-cleanup.

  • End-to-end smoke sweep via ./scripts/reproduce.sh --smoke on an M4 Max 64 GB with Docker Desktop + Ollama running:

    Run Pass? tests tokens wall cost (gpt-5.5)
    1 ✓ PASS 25/25 21,961 46.4 s $0.164
    2 ✗ FAIL 0/17 17,585 39.7 s $0.098
    3 ✓ PASS 25/25 20,860 33.6 s $0.106

    Cost matches the hand-computed pricing-table formula exactly to the sixth decimal place. The pass/fail flip across runs is expected — single-row smoke is not statistically meaningful; the cloud model is non-deterministic even with seed=42. Use 3+ seeds and a full task class for stable measurement.

First-time-user observations

Things that were friction during the smoke walkthrough, fixed in this release:

  • arena setup output had legacy R6 / R7 / R8 route labels — these are internal to core/experiment and confusing for a stranger. Now reads aider agent, mini-swe-agent, cline agent, etc.
  • The reproducer was showing a noisy pip dependency-resolver warning about aider-chat declaring a stricter openai range than ours. Cosmetic only (the harness still works because the surface we use is stable across the SDK versions); the noise is now filtered.
  • arena analyze previously required a per-(strategy, seed) invocation. Now you can point it at the sweep root and it walks one level deep.

Known caveats

  • Cell keys in bootstrap_cis.json / aggregate.json use the back-compat single-letter category codes: Apuzzles, Drefactors, Breal-prs. The high-level task-class name only appears in BenchmarkConfig.task_classes and in release-notes prose.
  • cloud_model_id, local_model_id, and config_sha are not yet stamped on rows produced by the smoke. They're populated correctly for full sweeps via --variant-tag / config_sha; the smoke path bypasses some metadata stamping for speed. Tracked for v1.4.3.

Upgrade notes

If you're coming from v1.4.0 or v1.4.1:

  • The new reproducer is scripts/reproduce.sh. The docs/REPRODUCING.md page has been merged into docs/HYBRID_ROUTING_DESIGN.md §9 (Add a new local model).
  • arena analyze <sweep_root> now walks subdirectories. If you used to call it per <strategy>/seed-<seed>/, you can now just point it at the parent.
  • Schema cell keys, agent names, and pricing-table keys are unchanged.
  • The v1.4.0 and v1.4.1 release-tarball artefacts (results-v1.4.0.tar.gz, results-v1.4.1.tar.gz) are still the source of truth for the empirical record.

Citation

@misc{monga2026hybridcodingeval,
  author       = {Monga, Sanchit and contributors},
  title        = {hybrid-arena: reproducible cost/latency/quality
                  benchmark for local vs cloud vs hybrid LLM routing on
                  coding tasks},
  year         = {2026},
  howpublished = {\url{https://github.com/RunanywhereAI/hybrid-arena}},
  note         = {Version 1.4.2}
}