Date: 2026-03-13
Batch:
- mode:
write_only - runner:
treatmentwithyoyo - tasks:
ripgrep-global-gitignore-rust-001uuid-go-001httprouter-go-001
Results:
ripgrep-global-gitignore-rust-001- completed successfully
- exact regression passed:
cargo test --test integration regression::r3179_global_gitignore_cwd -- --exact
- baseline worktree after setup already included
tests/regression.rs - agent delta beyond the injected puncture:
crates/core/flags/hiargs.rscrates/ignore/src/dir.rscrates/ignore/src/walk.rs
- metrics:
- total tool calls:
40 yoyoMCP tool calls:37- shell tool calls:
3 - retries:
33
- total tool calls:
uuid-go-001- completed successfully
- task verify passed with a temp Go cache:
GOCACHE=/tmp/go-cache-uuid-manual go test ./...
- patch stayed scoped to
uuid.go - metrics:
- total tool calls:
11 yoyoMCP tool calls:9- shell tool calls:
2 - retries:
4
- total tool calls:
httprouter-go-001- completed successfully
- task verify passed with a temp Go cache:
GOCACHE=/tmp/go-cache-httprouter-manual go test ./...
- patch stayed scoped to the intended surface and left no tracked diff after restoring punctured files to
HEAD - metrics:
- total tool calls:
12 yoyoMCP tool calls:9- shell tool calls:
3 - retries:
5
- total tool calls:
Harness design:
- These are directed write-only evals, not autonomous repair runs.
- The engineer first narrows the fix surface in the task definition.
- The model is then asked to:
- make the minimal patch now
- use
changeif available - do at most 2 confirming reads before the first edit
- avoid unrelated refactors
- run the narrowest relevant verification command
write_onlyis intentionally separated fromread_onlyandread_then_writeso write quality can be measured once the target surface is known.- Scope quality for puncture-backed write evals should be judged relative to the post-setup baseline, not raw final
git status, because the correct outcome can restore punctured files back toHEAD.
Notes:
- The Go tasks exposed a sandbox detail rather than a product bug:
go test ./...needs a writableGOCACHEwhen run outside the model's own temp path. semver-rust-001is intentionally excluded from this batch and kept in a separate note because its fixture verify command still needs repair.
Separate note: