Skip to content

Latest commit

 

History

History
169 lines (132 loc) · 6.21 KB

File metadata and controls

169 lines (132 loc) · 6.21 KB

v1.4.4 — fresh-user reproducibility patch

Release date: 2026-05-27 Theme: finish the v1.4.3 cleanup. A fresh-user replay from git clone to arena analyze on a clean machine surfaced two last paper cuts that v1.4.3 didn't catch — this release fixes them.

No new benchmark data; the v1.4.1 leaderboard (1,644 rows) stands.

What changed

1. arena analyze works on a clean pip install -e ".[dev]"

matplotlib and numpy are needed by the visualization pipeline (viz/cost_quality_pareto.py, viz/decision_heatmap.py), but in v1.4.3 they were only listed in requirements.txt. The canonical install path is pip install -e ".[dev]", which reads pyproject.toml::[project.dependencies] — and didn't pull them in.

A fresh user ran ./scripts/reproduce.sh --smoke and saw:

[reproduce] analyzing smoke results…
Traceback (most recent call last):
  ...
ModuleNotFoundError: No module named 'matplotlib'

v1.4.4 declares matplotlib>=3.8,<4 and numpy>=1.26,<3 as first-class runtime dependencies in pyproject.toml. The sweep finishes and arena analyze emits all three charts on the very first run, with no extra pip install … step.

2. Per-agent scratch dirs drop the legacy R-prefix

The v1.4.3 cleanup renamed every R-number reference in docstrings, comments, tests, and module names — but missed four inline string templates in the agent runners that built per-task scratch directories. v1.4.4 finishes the job:

Agent Before (≤ v1.4.3) After (v1.4.4)
aider outputs/r7_<task>_<strategy>/ outputs/aider_<task>_<strategy>/
cline outputs/r10_<task>_<strategy>/ outputs/cline_<task>_<strategy>/
opencode outputs/r8_<task>_<strategy>/ outputs/opencode_<task>_<strategy>/
mini-swe-agent outputs/r6_<task>_<strategy>/ outputs/mini-swe-agent_<task>_<strategy>/

The default output_dir per agent (when called without --out) also switches from results/r6/ / results/r7/ / results/r8/ / results/r10/ to results/<agent-name>/.

The output_ref field in raw.jsonl and aggregate.json reflects the new paths immediately. Old datasets in results/runs/{01..04, 07, 11, 17, 18, 26, 27}/ keep their original r7_* / r6_* paths (they're immutable historical sweeps).

3. scripts/reproduce.sh picks the right Python automatically

The first thing the reproducer does on a fresh machine is build a .venv. On macOS, python3 typically resolves to the latest installed interpreter (3.13 or 3.14 these days), which breaks several agent installers — aider-chat's pyproject_hooks bootstrap expects a setuptools shim that 3.13/3.14 dropped from the stdlib bootstrap. The result was a confusing ModuleNotFoundError: No module named 'pyproject_hooks' on a fresh install for anyone whose python3 happened to be too new.

v1.4.4 makes the reproducer:

  1. Probe python3.12 first, then python3.11, and exit with a clear "missing prerequisite" message + brew install python@3.12 hint if neither is available.
  2. Detect a stale .venv pinned to an incompatible interpreter (e.g. one left behind by an earlier python3 invocation) and recreate it automatically.

End result: ./scripts/reproduce.sh --smoke Just Works on a machine where python3 points to 3.14, as long as the user installed 3.11/3.12 from Homebrew per the README prerequisites table.

4. Adapter dataclass defaults align with the v1.4.3 rename

refactors.Task.category defaulted to "D" and real_prs.Task.category defaulted to "B" in the dataclass field definitions, even though the parsers in both adapters override them to "refactors" / "real-prs". If a downstream caller constructed a Task directly (no _parse_task), the legacy letter would leak back in. v1.4.4 sets the dataclass defaults to the modern names.

Fresh-user verification

rm -rf .venv
./scripts/reproduce.sh --smoke

Observed timing + output on macOS 25.5.0 (Apple M4 Max, Python 3.12.x via Homebrew):

[reproduce] running SMOKE sweep (configs/v1.4-smoke.yaml)…
[  1/  1] puzzles aider exercism-python/grep   wall=48827ms tokens=17392 PASS

=== sweep summary ===
  total passes : 1   successful : 1   failed : 0
  output       : results/runs/v1.4-smoke
  next step    : ./arena analyze results/runs/v1.4-smoke

[reproduce] analyzing smoke results…
pipeline complete — 1 rows
  aggregate.json        ✓
  bootstrap_cis.json    ✓
  decision_matrix.md    ✓
  charts/pareto.png     ✓
  charts/heatmap_quality.png  ✓
  charts/heatmap_cost.png     ✓

./scripts/reproduce.sh --smoke   1m 26s total

raw.jsonl row:

{
  "task_id": "exercism-python/grep",
  "category": "puzzles",
  "route": "aider",
  "output_ref": "results/runs/v1.4-smoke/.../outputs/aider_exercism-python__grep_always-cloud/answer.py",
  "router_strategy": "always-cloud",
  "tokens": {"prompt": 15811, "completion": 2159, "cached": 12544, "reasoning": 1030, ...},
  "latency": {"wall_ms": 47761, ...},
  "quality": {"functional_pass": true, "tests_passed": 25, "tests_total": 25, ...}
}

No r7_ / r6_ / r8_ / r10_ prefixes anywhere in the surface.

What did NOT change

  • No data was re-run. v1.4.1 leaderboard (1,644 rows) is the current canonical dataset.
  • No public API change. Every flag in ./arena … works identically. output_ref is the only field whose value template changed, and only on newly generated rows.
  • Pricing tables unchanged. Same pricing_tables.json SHA256.
  • Tests + ruff still green. pytest -m 'not slow' → 109 passed.

Upgrade notes for v1.4.3 users

git pull
.venv/bin/pip install --upgrade -e ".[dev]"
.venv/bin/pytest tests/ -q -m 'not slow'  # 109/109

If you wrote anything that grepped r7_ / r6_ / r8_ / r10_ out of output_ref, switch to the agent-name prefix (aider_ / mini-swe-agent_ / opencode_ / cline_). Historical sweeps still use the legacy paths.

Statistics

  • Files changed: 6 (deps + 4 agent runners + 2 adapter defaults)
  • Lines added: ~15
  • Lines removed: ~15
  • Net delta: ~0 — surgical patch, but it's the difference between "fresh user pip-installs and arena analyze crashes" and "fresh user pip-installs and arena analyze ships charts on the first try."