Release date: 2026-05-27
Theme: finish the v1.4.3 cleanup. A fresh-user replay from
git clone to arena analyze on a clean machine surfaced two last
paper cuts that v1.4.3 didn't catch — this release fixes them.
No new benchmark data; the v1.4.1 leaderboard (1,644 rows) stands.
matplotlib and numpy are needed by the visualization pipeline
(viz/cost_quality_pareto.py, viz/decision_heatmap.py), but in
v1.4.3 they were only listed in requirements.txt. The canonical
install path is pip install -e ".[dev]", which reads
pyproject.toml::[project.dependencies] — and didn't pull them in.
A fresh user ran ./scripts/reproduce.sh --smoke and saw:
[reproduce] analyzing smoke results…
Traceback (most recent call last):
...
ModuleNotFoundError: No module named 'matplotlib'v1.4.4 declares matplotlib>=3.8,<4 and numpy>=1.26,<3 as
first-class runtime dependencies in pyproject.toml. The sweep
finishes and arena analyze emits all three charts on the very
first run, with no extra pip install … step.
The v1.4.3 cleanup renamed every R-number reference in docstrings, comments, tests, and module names — but missed four inline string templates in the agent runners that built per-task scratch directories. v1.4.4 finishes the job:
| Agent | Before (≤ v1.4.3) | After (v1.4.4) |
|---|---|---|
| aider | outputs/r7_<task>_<strategy>/ |
outputs/aider_<task>_<strategy>/ |
| cline | outputs/r10_<task>_<strategy>/ |
outputs/cline_<task>_<strategy>/ |
| opencode | outputs/r8_<task>_<strategy>/ |
outputs/opencode_<task>_<strategy>/ |
| mini-swe-agent | outputs/r6_<task>_<strategy>/ |
outputs/mini-swe-agent_<task>_<strategy>/ |
The default output_dir per agent (when called without --out)
also switches from results/r6/ / results/r7/ / results/r8/ /
results/r10/ to results/<agent-name>/.
The output_ref field in raw.jsonl and aggregate.json reflects
the new paths immediately. Old datasets in
results/runs/{01..04, 07, 11, 17, 18, 26, 27}/ keep their original
r7_* / r6_* paths (they're immutable historical sweeps).
The first thing the reproducer does on a fresh machine is build a
.venv. On macOS, python3 typically resolves to the latest
installed interpreter (3.13 or 3.14 these days), which breaks
several agent installers — aider-chat's pyproject_hooks
bootstrap expects a setuptools shim that 3.13/3.14 dropped from
the stdlib bootstrap. The result was a confusing
ModuleNotFoundError: No module named 'pyproject_hooks' on a fresh
install for anyone whose python3 happened to be too new.
v1.4.4 makes the reproducer:
- Probe
python3.12first, thenpython3.11, and exit with a clear "missing prerequisite" message +brew install python@3.12hint if neither is available. - Detect a stale
.venvpinned to an incompatible interpreter (e.g. one left behind by an earlierpython3invocation) and recreate it automatically.
End result: ./scripts/reproduce.sh --smoke Just Works on a machine
where python3 points to 3.14, as long as the user installed
3.11/3.12 from Homebrew per the README prerequisites table.
refactors.Task.category defaulted to "D" and
real_prs.Task.category defaulted to "B" in the dataclass field
definitions, even though the parsers in both adapters override them
to "refactors" / "real-prs". If a downstream caller constructed
a Task directly (no _parse_task), the legacy letter would leak
back in. v1.4.4 sets the dataclass defaults to the modern names.
rm -rf .venv
./scripts/reproduce.sh --smokeObserved timing + output on macOS 25.5.0 (Apple M4 Max, Python 3.12.x via Homebrew):
[reproduce] running SMOKE sweep (configs/v1.4-smoke.yaml)…
[ 1/ 1] puzzles aider exercism-python/grep wall=48827ms tokens=17392 PASS
=== sweep summary ===
total passes : 1 successful : 1 failed : 0
output : results/runs/v1.4-smoke
next step : ./arena analyze results/runs/v1.4-smoke
[reproduce] analyzing smoke results…
pipeline complete — 1 rows
aggregate.json ✓
bootstrap_cis.json ✓
decision_matrix.md ✓
charts/pareto.png ✓
charts/heatmap_quality.png ✓
charts/heatmap_cost.png ✓
./scripts/reproduce.sh --smoke 1m 26s totalraw.jsonl row:
{
"task_id": "exercism-python/grep",
"category": "puzzles",
"route": "aider",
"output_ref": "results/runs/v1.4-smoke/.../outputs/aider_exercism-python__grep_always-cloud/answer.py",
"router_strategy": "always-cloud",
"tokens": {"prompt": 15811, "completion": 2159, "cached": 12544, "reasoning": 1030, ...},
"latency": {"wall_ms": 47761, ...},
"quality": {"functional_pass": true, "tests_passed": 25, "tests_total": 25, ...}
}No r7_ / r6_ / r8_ / r10_ prefixes anywhere in the surface.
- No data was re-run. v1.4.1 leaderboard (1,644 rows) is the current canonical dataset.
- No public API change. Every flag in
./arena …works identically.output_refis the only field whose value template changed, and only on newly generated rows. - Pricing tables unchanged. Same
pricing_tables.jsonSHA256. - Tests + ruff still green.
pytest -m 'not slow'→ 109 passed.
git pull
.venv/bin/pip install --upgrade -e ".[dev]"
.venv/bin/pytest tests/ -q -m 'not slow' # 109/109If you wrote anything that grepped r7_ / r6_ / r8_ / r10_ out
of output_ref, switch to the agent-name prefix (aider_ /
mini-swe-agent_ / opencode_ / cline_). Historical sweeps still
use the legacy paths.
- Files changed: 6 (deps + 4 agent runners + 2 adapter defaults)
- Lines added: ~15
- Lines removed: ~15
- Net delta: ~0 — surgical patch, but it's the difference
between "fresh user pip-installs and
arena analyzecrashes" and "fresh user pip-installs andarena analyzeships charts on the first try."