Release date: 2026-05-26
Theme: drop every back-compat surface that was carrying old v1.0–v1.3
names into v1.4, document the fresh-user setup in one place (the
prerequisites table in README.md), and make scripts/reproduce.sh
print the right brew / apt install hint when something is missing.
This is not a new dataset release — no inference was re-run. The canonical 1,644-row v1.4.1 leaderboard is unchanged.
The single-letter category codes (A, B, C, D, X) that v1.0–v1.3
carried in ResultRow.category and in bootstrap_cis.json cell keys
are gone. Every surface uses the human-readable names now:
| What | Before (≤ v1.4.2) | After (v1.4.3) |
|---|---|---|
ResultRow.category |
"A" / "D" / "B" |
"puzzles" / "refactors" / "real-prs" |
bootstrap_cis.json cell key |
"A::aider::heuristic" |
"puzzles::aider::heuristic" |
aggregate.json cell key |
"D/cline" |
"refactors/cline" |
decision_matrix.md rows |
D |
refactors |
If you scripted against the old keys, the change is mechanical:
A → puzzles, D → refactors, B → real-prs. The pre-v1.4.3
datasets in results/runs/{01..04, 07, 11}/ keep their original keys
(they're immutable); v1.4.0 / v1.4.1 GitHub-release tarballs were
generated with the legacy keys and can be re-rendered with the v1.4.3
harness if you want consistent labels.
The single most common "I cloned this and it doesn't work" failure was
forgetting to install Ollama or opencode before running the
sweep. The README.md now leads with a tidy prerequisites table
listing every external tool the harness drives, with the exact
brew install … / sudo apt install … command on both platforms.
scripts/reproduce.sh echoes the matching install command when it
detects a missing prereq, so a first-time user gets:
[reproduce] missing prerequisite: 'ollama'
→ install with: curl -fsSL https://ollama.com/install.sh | sh…instead of a generic "command not found" failure.
router/pipelines/architect/ and router/agentic/architect.mjs were
the v3 multi-step "plan → execute → synthesise" pipeline that the
v1.4 four-agent matrix replaced. No agent in v1.4 calls into it; the
special model: "router/architect" pseudo-strategy in server.mjs
was unreachable. Both directories + the dispatcher + the import are
gone. ~200 lines of dead code, plus 9 vendored example outputs.
The orchestrator's resume-skip check used to wildcard-match rows with
router_strategy=None against any in-progress strategy (a back-compat
hack for v0 rows that predated the strategy axis). v1.4.3 requires an
exact (task, route, strategy) triple match. The smoke / canonical
configs all carry an explicit strategy, so this is invisible at the
user surface — but it stops a foot-gun where a --router-strategy heuristic resume could silently skip a stale null-strategy row.
Every "R6 / R7 / R8 / R10" reference in docstrings, comments, and test names was replaced with the agent name. The agent modules read as standalone documents now — no implicit knowledge of the v1.0 route numbering required.
Two follow-up paper cuts (per-agent scratch dir names still using the
r6_/r7_/r8_/r10_prefix, andmatplotlib/numpynot being declared as runtime deps soarena analyzefailed on a freshpip install -e ".[dev]") were shipped in v1.4.4.
- No data was re-run. The v1.4.1 leaderboard (1,644 rows) stands.
- No public API change. The
arenaCLI surface is identical; every flag in v1.4.2 still works in v1.4.3. - Pricing tables unchanged. Same
pricing_tables.jsonSHA256 as v1.4.2. - Tests + ruff still green.
pytest -m 'not slow'→ 109 passed.
The README prerequisites table was verified against a clean install flow on macOS 25.5.0 (Darwin):
git clone https://github.com/RunanywhereAI/hybrid-arena- Follow the prereqs table → install Python 3.12, Docker Desktop, Node 24, Ollama, jq via Homebrew
cp .env.example .env && $EDITOR .env(addOPEN_AI_API_KEY)ollama serve &(if Ollama.app not already running)ollama pull gemma4:31b./scripts/reproduce.sh --smoke
The reproducer correctly detected and printed install hints for
docker, node, ollama, and jq when they were absent from
PATH. ✓ The end-to-end smoke run on this version surfaced the
matplotlib + per-agent-dir issues fixed in v1.4.4.
git pull
.venv/bin/pip install --upgrade -e ".[dev]"
.venv/bin/pytest tests/ -q -m 'not slow' # should pass: 109/109That's it — there is no migration step. The harness now writes the
new category names; the legacy datasets in results/runs/ stay at
their original keys and are not touched.
- Files changed: 45 (mostly comment/docstring rewrites + tests)
- Lines added: ~280
- Lines removed: ~520 (dead architect pipeline + back-compat fallbacks + R-number references)
- Net delta: −240 lines, despite adding the README prerequisites table and richer reproducer install hints.