Skip to content

Latest commit

 

History

History
130 lines (100 loc) · 5.37 KB

File metadata and controls

130 lines (100 loc) · 5.37 KB

v1.4.3 — back-compat-free cleanup

Release date: 2026-05-26 Theme: drop every back-compat surface that was carrying old v1.0–v1.3 names into v1.4, document the fresh-user setup in one place (the prerequisites table in README.md), and make scripts/reproduce.sh print the right brew / apt install hint when something is missing.

This is not a new dataset release — no inference was re-run. The canonical 1,644-row v1.4.1 leaderboard is unchanged.

What changed for users

1. Task-class names are now consistent end-to-end

The single-letter category codes (A, B, C, D, X) that v1.0–v1.3 carried in ResultRow.category and in bootstrap_cis.json cell keys are gone. Every surface uses the human-readable names now:

What Before (≤ v1.4.2) After (v1.4.3)
ResultRow.category "A" / "D" / "B" "puzzles" / "refactors" / "real-prs"
bootstrap_cis.json cell key "A::aider::heuristic" "puzzles::aider::heuristic"
aggregate.json cell key "D/cline" "refactors/cline"
decision_matrix.md rows D refactors

If you scripted against the old keys, the change is mechanical: Apuzzles, Drefactors, Breal-prs. The pre-v1.4.3 datasets in results/runs/{01..04, 07, 11}/ keep their original keys (they're immutable); v1.4.0 / v1.4.1 GitHub-release tarballs were generated with the legacy keys and can be re-rendered with the v1.4.3 harness if you want consistent labels.

2. Prerequisites section in README.md

The single most common "I cloned this and it doesn't work" failure was forgetting to install Ollama or opencode before running the sweep. The README.md now leads with a tidy prerequisites table listing every external tool the harness drives, with the exact brew install … / sudo apt install … command on both platforms.

scripts/reproduce.sh echoes the matching install command when it detects a missing prereq, so a first-time user gets:

[reproduce] missing prerequisite: 'ollama'
  → install with: curl -fsSL https://ollama.com/install.sh | sh

…instead of a generic "command not found" failure.

3. Removed: legacy R3 architect pipeline

router/pipelines/architect/ and router/agentic/architect.mjs were the v3 multi-step "plan → execute → synthesise" pipeline that the v1.4 four-agent matrix replaced. No agent in v1.4 calls into it; the special model: "router/architect" pseudo-strategy in server.mjs was unreachable. Both directories + the dispatcher + the import are gone. ~200 lines of dead code, plus 9 vendored example outputs.

4. pair_already_done is strict now

The orchestrator's resume-skip check used to wildcard-match rows with router_strategy=None against any in-progress strategy (a back-compat hack for v0 rows that predated the strategy axis). v1.4.3 requires an exact (task, route, strategy) triple match. The smoke / canonical configs all carry an explicit strategy, so this is invisible at the user surface — but it stops a foot-gun where a --router-strategy heuristic resume could silently skip a stale null-strategy row.

5. Cleaned up code comments + docstrings

Every "R6 / R7 / R8 / R10" reference in docstrings, comments, and test names was replaced with the agent name. The agent modules read as standalone documents now — no implicit knowledge of the v1.0 route numbering required.

Two follow-up paper cuts (per-agent scratch dir names still using the r6_ / r7_ / r8_ / r10_ prefix, and matplotlib / numpy not being declared as runtime deps so arena analyze failed on a fresh pip install -e ".[dev]") were shipped in v1.4.4.

What did NOT change

  • No data was re-run. The v1.4.1 leaderboard (1,644 rows) stands.
  • No public API change. The arena CLI surface is identical; every flag in v1.4.2 still works in v1.4.3.
  • Pricing tables unchanged. Same pricing_tables.json SHA256 as v1.4.2.
  • Tests + ruff still green. pytest -m 'not slow' → 109 passed.

Fresh-user verification

The README prerequisites table was verified against a clean install flow on macOS 25.5.0 (Darwin):

  1. git clone https://github.com/RunanywhereAI/hybrid-arena
  2. Follow the prereqs table → install Python 3.12, Docker Desktop, Node 24, Ollama, jq via Homebrew
  3. cp .env.example .env && $EDITOR .env (add OPEN_AI_API_KEY)
  4. ollama serve & (if Ollama.app not already running)
  5. ollama pull gemma4:31b
  6. ./scripts/reproduce.sh --smoke

The reproducer correctly detected and printed install hints for docker, node, ollama, and jq when they were absent from PATH. ✓ The end-to-end smoke run on this version surfaced the matplotlib + per-agent-dir issues fixed in v1.4.4.

Upgrade notes for v1.4.2 users

git pull
.venv/bin/pip install --upgrade -e ".[dev]"
.venv/bin/pytest tests/ -q -m 'not slow'  # should pass: 109/109

That's it — there is no migration step. The harness now writes the new category names; the legacy datasets in results/runs/ stay at their original keys and are not touched.

Statistics

  • Files changed: 45 (mostly comment/docstring rewrites + tests)
  • Lines added: ~280
  • Lines removed: ~520 (dead architect pipeline + back-compat fallbacks + R-number references)
  • Net delta: −240 lines, despite adding the README prerequisites table and richer reproducer install hints.