All notable changes to this project are documented here. The format follows Keep a Changelog and the project follows Semantic Versioning starting with v1.0.0.
Rebrand to Hybrid Coding Arena. Same benchmark, methodology, and dataset (1,704 rows); new name, from RunAnywhere.
- Project renamed
hybrid-coding-evalto Hybrid Coding Arena. - Python package
hybrid_coding_evalrenamed tohybrid_arena(all imports updated). - Distribution and repo slug renamed to
hybrid-arena(the old GitHub URL redirects). - CLI command renamed
benchtoarena(for examplearena sweep,arena analyze). - Headline chart redesigned so pass-rate (the bar) and cloud usage (a separate labeled row) read as two distinct metrics.
- Release workflow now skips draft creation when a release for the tag already exists.
- The dataset, scoring, routing strategies, and all published numbers.
results-v1.6.0.tar.gzis byte-identical to v1.5.0/v1.5.1.
Open-source polish release — addresses every audit finding from the pre-publish review (security, licensing, UX, hygiene). No code-behaviour changes; safe to take.
NOTICE.md,LICENSE-DATA,LICENSE.md— consolidated to a single MITLICENSEthat covers code, data, and docs.scripts/reproduce.sh—./arena setupalready does prereq checks and the smoke sweep is a one-liner (./arena sweep --config configs/v1.4-smoke.yaml).logs/v3.3/— historical sweep logs moved to maintainer-private storage.logs/is now gitignored.pytestruntime dependency promoted to[dev]only (was shipped in both before).pytest -m slowfilter removed from CI and docs — no test is currently markedslow, so the filter was a no-op.
README.mdrewritten end-to-end — six-cell headline table, real quickstart with accurate prereq + timing estimates, "picking a config for real work" section distilled from the v1.5 leaderboard, fullarenaCLI table, MIT-only license + citation block.AGENTS.mdrefreshed for v1.5.0 — D6 task shape documented, v1.5 configs added to the tree, latest-results pointer updated, conventions section reflects single-letter codes are retired.CODE_OF_CONDUCT.mdsimplified to a short, direct version.src/hybrid_arena/__init__.py—__version__is now"1.5.0"(was stuck at"0.1.0").- Source-tree docstrings + READMEs —
lib.*stale references rewritten tocore.*/hybrid_arena.*; "Category D/B/X" rewritten torefactors/real-prs/puzzlesend-to-end. - Tracked
raw.jsonldatasets — sanitized 263 home-directory path leaks to repo-relative form. JSON integrity preserved (json.loads()re-validated on every row).
.github/ISSUE_TEMPLATE/new_model.md— brokenconfigs/variants/_template.yamlreference replaced with a real command.docs/HYBRID_ROUTING_DESIGN.md—jqcell-key examples updated from legacyD::cline::heuristicform to currentrefactors::cline::heuristic. D6 row added to the task-class table.docs/release-notes/v1.4.0.mdandv1.4.1.md— same cell-key fix in the reproduction snippets.- Test aliases
r10_clineandr6_mini_swe_agentrenamed tocline_runner/mini_swe_runnerto remove the last legacy R-numbers from the testsuite.
Hard-task stress test release. Adds a new D6 task shape with 4
deliberately harder problems (LRU+TTL cache, multi-key token bucket,
deterministic toposort + cycle detection, mini templating engine)
that stress 30B local models beyond the D1/D5 calibration. Stress
sweeps the v1.4.1 top-3 configs (aider+gemma4+heuristic,
cline+qwen3.6+cascade, cline+qwen3.6+always-local) against the new
hard tasks with 3 seeds each. Full notes at
docs/release-notes/v1.5.0.md.
- New D6 task shape under
refactors:d6-lru-ttl-cache(23 acceptance tests),d6-token-bucket(14 tests),d6-toposort(16 tests),d6-mini-template(27 tests) = 80 acceptance assertions total. Each task is a single-file implementation challenge with comprehensive pytest coverage, scored via the existing D1/D5 overlay-and-run pipeline. - New configs:
configs/v1.5-hard-gemma4.yamlandconfigs/v1.5-hard-qwen3.6.yamlfor the stress-test sweeps. - New article:
personal/reports/publish-v1.5/article.htmlwith the §4.5 real-world walkthrough section, the §8.5 permutation matrix, and a new §12 hard-task stress test section. - New dataset:
results/runs/v1.5-hard-gemma4/andresults/runs/v1.5-hard-qwen3.6/.
tasks/refactors/adapter.pyacceptsD6as a valid task shape.tasks/refactors/scorers.pydispatchesD6through the existing D1 overlay+pytest path.
- aider pytest-summary parser bug.
_parse_pytest_summaryinsrc/hybrid_arena/agents/aider.pypreviously read the summary line positionally and missed the "failed" count when it preceded "passed" (e.g.2 failed, 21 passedwas scored 0/2 instead of 21/23). New implementation uses independent regexes per token and is order-agnostic. Covered bytests/agents/test_aider_parser.py(18 parametrized cases).
- cline + qwen3.6:35b + always-local on D6 hits 67% with $0 cloud spend — the new headline. 30B local-only solves token-bucket and toposort 3/3, partial-passes lru-ttl-cache and mini-template.
- aider + gemma4 + heuristic on D6 falls to 58% vs always-cloud 100%. The v1.3 marquee profile breaks on harder calibration.
- cline + qwen3.6 + cascade on D6 drops from v1.4.1's 100%/8% to 75%/13% —
the router under-escalates on
d6-mini-template(recursive parser).
Fresh-user reproducibility patch. Targets the last two paper cuts a
brand-new clone hits on the way from git clone to a green
arena analyze chart. Full notes at
docs/release-notes/v1.4.4.md.
scripts/reproduce.shexplicitly preferspython3.12/python3.11over genericpython3and recreates.venvif it was pinned to a different (e.g. 3.13/3.14) interpreter. Python 3.13+ breaks several agent installers (notablyaider-chat) because they depend on asetuptoolsbootstrap that 3.13/3.14 dropped from the stdlib.
arena analyzeworks on a cleanpip install -e ".[dev]".matplotlib+numpyare now declared inpyproject.toml::[project.dependencies](they were previously only inrequirements.txt, so the canonical install path left them missing andarena analyzedied withModuleNotFoundError: No module named 'matplotlib').- Per-agent scratch directories no longer carry the legacy R-prefix.
aider writes to
outputs/aider_<task>_<strategy>/, cline tooutputs/cline_<task>_<strategy>/, opencode tooutputs/opencode_<task>_<strategy>/, mini-swe-agent tooutputs/mini-swe-agent_<task>_<strategy>/. Defaultoutput_dirper agent is nowresults/<agent-name>/instead ofresults/r6/…results/r10/. The v1.4.3 commit cleaned every other R-number surface but missed these inline path templates. - Task adapter dataclass defaults align with the v1.4.3 rename.
refactors.Task.categorydefaults to"refactors"(was"D") andreal_prs.Task.categorydefaults to"real-prs"(was"B"). The parsers already overrode these defaults in v1.4.3, but the dataclass field defaults still leaked the legacy letters if a row was constructed without going through_parse_task.
End-to-end fresh-user replay: rm -rf .venv && ./scripts/reproduce.sh --smoke → 1m 26s wall, 1/1 PASS, all charts emitted, no missing
modules, no R-prefix leaks in output_ref.
1.4.3 — 2026-05-26
Back-compat-free cleanup. Drops every v1.0–v1.3 legacy surface from
the v1.4 harness. No new benchmark data; the v1.4.1 leaderboard
(1,644 rows) stands. Full release notes at
docs/release-notes/v1.4.3.md.
- Task-class names are consistent end-to-end.
ResultRow.category,aggregate.jsoncell keys,bootstrap_cis.jsoncell keys, anddecision_matrix.mdrows all usepuzzles/refactors/real-prsinstead of the legacy single letters (A/D/B). Affects scripts that grep cell keys; the rename is mechanical. README.md— added explicit Prerequisites section with per-platform install commands (macOS Homebrew + Debian/Ubuntu apt) for Python, Docker, Node, Ollama, jq.scripts/reproduce.sh— platform-aware install hints. When a prereq is missing it prints the exactbrew install …orsudo apt install …command for the host OS, plus a hint to start the Ollama daemon when port 11434 isn't reachable.pyproject.toml—matplotlibandnumpyare now first-class runtime dependencies.arena analyzeneeds them for chart generation, but they were previously only inrequirements.txt, sopip install -e ".[dev]"left them missing. Fresh installs now work end-to-end with zero extra steps.- Per-agent scratch directories drop the R-prefix. aider writes
to
outputs/aider_<task>_<strategy>/, opencode tooutputs/opencode_<task>_<strategy>/, cline tooutputs/cline_<task>_<strategy>/, mini-swe-agent tooutputs/mini-swe-agent_<task>_<strategy>/. Defaultoutput_dirper agent isresults/<agent-name>/. core/experiment.pair_already_doneis strict now — requires an exact(task, route, strategy)match instead of treatingrouter_strategy=Noneas a wildcard. Stops a foot-gun where a resume could silently skip a stale legacy row.
router/pipelines/architect/+router/agentic/— the v3 multi-step "plan → execute → synthesise" pipeline. Not referenced by any v1.4 agent; themodel: "router/architect"pseudo-strategy dispatcher inserver.mjswas unreachable. ~200 lines of dead code plus 9 vendored example outputs.- Single-letter category codes (
A/B/C/D/X) from adapter defaults, agent fallbacks, viz colour/marker fall-backs, and the refactor task JSONLs. R6/R7/R8/R10references from every docstring, comment, and test name. Agent modules read as standalone documents now.results/raw.jsonlhistorical round-trip test —tests/test_metrics_new_fields.py::test_historical_dataset_still_loadswas pure v1.0 back-compat coverage; dropped.
agents/aider.py+agents/cline.pydispatcher tests no longer rely onmini-swe-agentto dispatch apuzzlestask. (Previously worked only because the back-compat wildcard inpair_already_doneshort-circuited the runner.)
1.4.2 — 2026-05-26
OSS readiness cleanup. Code, docs, and reproducibility cleanup pass — no new benchmark data. The full release notes live at docs/release-notes/v1.4.2.md.
scripts/reproduce.sh— one-command reproducer that checks every prerequisite, sets up the venv, runs./arena setup, and either runs the smoke sweep (--smoke) or forwards arbitrary./arena sweeparguments. ~30 s end-to-end for the smoke pass.docs/HYBRID_ROUTING_DESIGN.md— single canonical design doc consolidating the eight previously-separatedocs/*.mdfiles (routing strategies, agents, methodology, schema, add-a-model recipe).SECURITY.md— vulnerability-disclosure channel (private email).arena analyzewalks subdirectories. Point it at a sweep root and it analyses every<strategy>/seed-<seed>/raw.jsonlit finds.arena setupfails fast (10 s timeout) when the Docker daemon is down, instead of hanging on thedocker image inspectcall.
README.md,AGENTS.md,CONTRIBUTING.mdrewritten for v1.4.2 reality — TL;DR results table, repo layout, contribution recipes that point at the four-agent surface.LICENSE,LICENSE-DATA,LICENSE.md,NOTICE.mdrewritten — every referenced path now actually exists (no more danglingrunners/,EXTERNAL/,vendor/minions/,vendor/lm-eval-harness-judge/,bin/,benchmark/,lib/).pyproject.toml— version bumped to1.4.2;ruff.extend-excluderepointed to the v1.4 fixture roots undertasks/.requirements.txtsynced withpyproject.toml [project.dependencies], grouped Core / Viz / Optional..env.example— dropped deleted-in-v1.4llm_judgereference; added v1.4.1ROUTER_LOCAL_*guards.CODE_OF_CONDUCT.md— private email reporting channel.
analysis/bootstrap.pycost CI now reads fromconfigs/pricing/pricing_tables.jsoninstead of an empty per-rowcost_usdfield. (Pre-v1.4.2 analyses showed silently-zero cost CIs for some cells.)analysis/bootstrap.pycloud_fractionis token-based, not call-count-based — the canonical definition now applies everywhere (router, analysis, release notes).analysis/bootstrap.pystratify_byparameter now respected (was silently ignored).core/experiment.score_rowaccepts bothrefactorsand the legacyreal_devsource name — silent skip bug.core/experiment.run_pairstampsseedonto theResultRowvia the new--seedflag.cli/bench._cmd_sweepforwardsCLOUD_MODELfrom the config to the spawned router proxy.agents/aider.py_run_tests_localparsespytest's summary line via regex —tests_passed/tests_totalare now correct (was always 0/1 or 1/1).cli/env_detect.pyusescore.paths.repo_rootfor path resolution (single source of truth).docs/release-notes/v1.4.1.md— correctedcline + qwen3.6 + heuristic + refactorsheadline from22/24 = 92%to23/24 = 96%(the prior figure was a transcription error from the raw data).CHANGELOG.md— restored missing[1.4.1]reference link;[Unreleased]compares againstv1.4.1instead ofv1.3.0.
analysis/arqgc.py+analysis/decision_matrix_v2.py— unused. The v1.4 decision matrix ranks cells by pass-rate then median cost; no ARQGC anywhere in the pipeline.agents/claude_code.py— deferred to v1.5. v1.4.2 surfaces four agents:aider,opencode,mini-swe-agent,cline.docs/REPRODUCING.md,docs/ARCHITECTURE.md,docs/METHODOLOGY.md,docs/ROUTING_STRATEGIES.md,docs/AGENTIC_ROUTES.md,docs/HYBRID_ROUTER_DESIGN.md,docs/PRIOR_ART.md,docs/BENCHMARK_NEW_MODEL.md— consolidated intodocs/HYBRID_ROUTING_DESIGN.md.docs/audits/— moved to gitignoredpersonal/audits/(internal review artefacts, not OSS surface).examples/— stale; instructions live inREADME.md+docs/HYBRID_ROUTING_DESIGN.mdnow.- Stale
EXTERNAL/minions//vendor/minions/entries removed from.gitignore. - Two
personal/raw-runs/v4*.yamlfiles untracked (were committed despite thepersonal/gitignore).
1.4.1 — 2026-05-25
3-model agentic leaderboard. v1.4.1 adds 936 rows across two new canonical sweeps (qwen3-coder:30b + qwen3.6:35b) — completing the 3-model leaderboard envisioned in the original v1.4 plan. Combined v1.4 + v1.4.1: 1,644 rows of agentic-only data across 3 local models × 3 agents × 4-8 strategies × 13 tasks × 3 seeds.
| Cell | Pass-rate | Cloud-fraction |
|---|---|---|
| cline + qwen3.6 + cascade + refactors | 24/24 = 100% [100, 100] | low (~5-10%) |
| cline + qwen3.6 + heuristic + refactors | 22/24 = 92% | ~7% |
| cline + qwen3-coder + heuristic + refactors | 22/24 = 92% | ~7% |
| cline + qwen3.6 + always-local + puzzles | 15/15 = 100% [100, 100] | 0% |
cline + qwen3.6 is the new winner for refactors — matches aider's marquee 96% / equals it under cascade, at a fraction of the cloud spend. Two qwen variants (3-coder, 3.6) both deliver cline+heuristic+refactors at 92%.
-
qwen3.6:35b is the unsung champion. cline + qwen3.6 nails everything — 100% on puzzles always-local, 100% on refactors cascade, 92% heuristic. The model that wasn't in v1.4.0 turns out to be the strongest local for cline's protocol.
-
opencode is gemma4-specific. opencode + gemma4 + heuristic + refactors = 71% (v1.4.0 resurrection). opencode + qwen3-coder = 21%. opencode + qwen3.6 = 33%. The v1.4.0 fix doesn't transfer to qwen models — opencode's runLoop requires the model to produce clean tool_calls, which gemma4 does and qwen models don't reliably.
-
Aider is model-sensitive too. aider + heuristic + refactors = 96% on gemma4, 50% on qwen3.6, 33% on qwen3-coder. Aider's architect/editor protocol works best with gemma4's dense-generalist training.
router/server.mjslocal-guard fix (commitc7392db) —ROUTER_LOCAL_NUM_PREDICT_CAP=4096,ROUTER_LOCAL_REQUEST_TIMEOUT_MS=180000,ROUTER_LOCAL_REPEAT_PENALTY=1.1. Three model-agnostic env-overridable guards infetchLocalOllamaAsOpenAI(). Discovered + fixed during v1.4.1 sweep 4 when qwen3-coder's weakrepeat_penalty=1.05+ unboundednum_predict(cline/opencode don't setmax_tokens) caused a runaway repetition loop that crashed Ollama. Full RCA atpersonal/iterations/v1.4.1/qwen3-coder-timeout-rca.md.configs/v1.4-canonical-qwen3-coder.yamlandconfigs/v1.4-canonical-qwen3.6.yaml— the two new canonical configs (came in v1.4-rc1 but actually exercised in v1.4.1).- 2 new release artifacts:
results-v1.4.1.tar.gz(15 MB, both qwen sweep dirs), v1.4.1 article HTML with code-generated visualizations.
- The v1.4.0 marquee aider+heuristic+gemma4+refactors = 96% replicates exactly in the v1.4.1 data — refreshed code, same headline.
- The v1.4.0 cline+gemma4+always-local+puzzles = 100% also confirmed in v1.4.1's gemma4 columns (no change).
v1.4.1 sweeps spent $50 incremental cloud (gpt-5.5 list pricing). Total v1.4 line (v1.4.0 + v1.4.1): **$140 list / ~$80 cache-adjusted across 1,644 rows.**
1.4.0 — 2026-05-22
Cleanup + production-pipeline release. v1.4 deletes the legacy non-agentic R1–R5 routes and the experimental Stanford-Minion / Devminion wrappers — the harness is now agent-only (aider · opencode · mini-swe-agent · claude-code · cline). Drops the Rn prefix; renames runners/ → agents/ and benchmarks/ → tasks/. Adds 5 production lifecycle commands (./arena start|pause|resume|stop|status). arena sweep auto-spawns the router proxy from models.local, so the canonical reproducer is now four copy-paste commands.
Sweeps complete: 468 v1.4-canonical-gemma4 + 48 v1.4-opencode-fairness + 192 v1.4-strategy-sweep. qwen3-coder + qwen3.6 canonical sweeps queued for v1.4.1.
- Marquee Pareto win: aider + heuristic on refactors → 23/24 = 96% [88, 100] at 48% cloud-fraction (~52% token spend reduction vs always-cloud 24/24 = 100%). Replicates v1.3.0's headline with refreshed code.
- NEW: cline + always-local on puzzles → 15/15 = 100% with zero cloud — first 30B local-only result that nails Exercism Python puzzles (vs aider always-local 3/15, opencode 0/15).
- NEW: opencode RESURRECTED with gemma4 → 17/24 = 71% heuristic on refactors (vs v1.1.x's 0/15 with qwen3-coder). The fork-audit's "model + NUDGE" hypothesis verified. Puzzles still 0/15 — runLoop hard-exit ceiling.
- NEW: cascade is dead in agentic regime — heuristic ≥ cascade in every (agent, task-class) cell. Strategy-tuning is not the lever; agent + task-class selection is.
configs/v1.4-canonical.yaml— the single canonical v1.4 sweep config covering 5 agentic routes × 8 strategies × 18 tasks × 3 local models. Drop-in surface for new local models.arena sweepauto-spawns the router proxy — readsmodels.localfrom the config, spawnsnode router/server.mjswithLOCAL_MODEL=<model>, waits for/healthz, runs the sweep, tears the router down on completion. Eliminates the manual(cd router && ./start.sh) &step from the reproducer.--external-routerflag onarena sweep— opt-out for users who want to manage the router proxy themselves.docs/release-notes/v1.4.0.md— tracked-in-git release notes for v1.4.0 (replaces the GH-release-onlyfindings.mdfrom v1.0–v1.3).- "How to read the results" cell→headline map — maps each headline number to its exact
bootstrap_cis.jsoncell key withjqexamples. pydanticandpyyamlinrequirements.txt(were pyproject-only — broke freshpip install -r requirements.txtinstalls).
- README rewritten for v1.4 — v1.4 hero, 4-command quickstart, 5 agentic routes, 8 strategies, 3 local models, CI badge, v1.3.0 carry-over preview headline.
AGENTS.md— reflects post-cleanupagents/directory + auto-spawn-router workflow.pyproject.toml— version bumped to 1.4.0.- Task class names:
X→puzzles(Exercism Python) andD→refactors(real-developer D-tasks). The v1.4 task classes are surfaced inbootstrap_cis.jsoncell keys and headline tables.
- R1 cloud-only, R2 local-only, R3 hybrid-architect runners — legacy non-agentic routes deleted. The canonical v3 dataset (
results/runs/07-v3-devstral-all-routes/) stays tracked for historical reproducibility. - R4 Stanford-Minion, R5 Stanford-DevMinion runners — experimental wrappers around
vendor/minions/deleted. Themcp/rank_bm25/requeststransitive deps are no longer needed. - CI
Clone vendored Minions+Install Minions transitive depssteps — eliminated from.github/workflows/ci.ymlnow that R4/R5 are gone. Freshpip install -e ".[dev]"is enough. - Legacy
runners/directory renamed toagents/to reflect what it actually contains post-cleanup.
requirements.txtwas missingpydanticandpyyaml— they were declared only inpyproject.toml's[project.dependencies]. Fresh-install users hittingpip install -r requirements.txtgotModuleNotFoundError: pydanticfrom the config loader. Now mirrored in both files with the same pins.arena sweeprequired a manually-started router — the reproducer recipe was incomplete (audit finding #4). Auto-spawn-router fixes the "I followed the GH release recipe and got Connection refused" class of bug.
The v1.0.0 → v1.3.0 release lineage is preserved on the GitHub releases page. Highlights below; see each release's notes for full per-version detail.
- v1.3.0 (2026-05-20) — Multi-model + threshold sweep release. 507 rows across 3 sweeps. First hybrid-equals-cloud result with statistical significance: gemma4:31b + heuristic = 96% [88, 100] on real_dev D-tasks. (See GH release
v1.3.0for full notes.) - v1.2.0 (2026-05-19) — Single-agent v1.2 release. Locked in R7 aider as the canonical agentic route. 60-row canonical sweep with qwen3-coder:30b. (See GH release
v1.2.0.) - v1.1.3 / v1.1.2 / v1.1.1 / v1.1.0 (2026-05-19) — Agentic-routes release lineage. Added R8 opencode, the Exercism Python benchmark (category X), the agent-aware
heuristicstrategy,./arena sweep, bootstrap CIs, correlation-id token attribution. (See GH release tagsv1.1.x.) - v1.0.0 (2026-05-18) — First public OSS release. R1–R5 non-agentic surface, 250-row v3 publication sweep,
./arena setup. (See GH releasev1.0.0.) - Pre-1.0 (v0.x → v3.x) — Internal research iterations. The v3.3 sweep (3,581 rows, 33 variants, 6 local models) is the canonical pre-1.0 corpus under
results/runs/. The 250-row v3 sweep atresults/runs/07-v3-devstral-all-routes/is preserved bit-identically.