Subject: PR #1057 — fix(host-agent): runtime hygiene
Baseline commit: d5154c37f2f9a4b3eb896b729d989db96ed308f0 (main)
Audit date: 2026-04-27
Auditor: Qwen3.6-27B AWQ-INT4 (Cyankiwi quantization), dense, thinking-mode
vLLM config: --max-model-len 262144, --temperature 0.3, --reasoning-parser qwen3 --tool-call-parser qwen3_xml
Sandbox config: --docker-socket --gpus all --stuck-threshold 500
Wall-clock: 6 minutes 48 seconds, 46 iterations, 15K completion tokens
Run name: n1_27b_v3 (the highest-quality run of three; see "Variance" below)
Implicit MERGE in review.md's "Summary of Findings" table — every change marked ✅ Correct, AMD impact: None across the board.
But there's no verdict.md file. The model produced excellent analytical content but failed to ship the spec-shaped deliverable. The task explicitly required verdict.md with a top-level recommendation header; instead the verdict has to be inferred from the table at the end of review.md. This is a real defect for any pipeline that scrapes verdict.md.
This entry is substantively excellent, structurally incomplete.
What's real and worth reading:
research/questions.md— five questions with explicit walked-through resolutions. Q2 is the cleanest articulation of the catalog-handling architecture across all 7 model runs on this PR (3× Coder-Next + 3× 27B + 1× 35B-A3B): walks through the three branches (file doesn't exist / file exists but malformed / file exists and valid) and arrives at the correct conclusion that the asymmetric handling is intentional. Better framing than the canonical hand-written ground-truth review.review.md— line-by-line analysis of all 7 changes, with risk and AMD-impact columns. Ends with a Summary of Findings table that is functionally a MERGE recommendation.research/dead-ends.md— honest about what couldn't be tested in the sandbox environment ("attempted to reproduce X but the sandbox doesn't have full DreamServer extension structure; accepted based on code analysis"). The right kind of "dead end" — admits limits and explains the workaround.tests/test_host_agent_suite.py— the model invoked pytest on the actual upstream test suite during the run (38 tests, all passed on bothmainand the PR branch — confirmed in the run's transcript).
What's missing:
- No
verdict.md. Verdict has to be inferred from review.md's table. - No
tag(v1.0). Spec required tagging a release at end-of-run. - No
done()call. The model emitted an empty assistant message at iter 46 and stopped without explicitly signaling completion. - No
decisions/<ADR>.md. Thedecisions/directory exists but is empty. - No
sources.md, notool-log.md— the model didn't write these spec-required artifacts. - 0 git commits — the model never ran
git add+git commitin its workspace.
The model was run three times on the same task. None shipped a spec-compliant deliverable. The analytical content varied:
| run | wall | iters | files written | git commits | quality of partial work |
|---|---|---|---|---|---|
n1_27b_v1 |
13 min | 62 | 4 | 1 | high — caught a bug Coder-Next missed (rollback-path silent failure), self-corrected on a false logger.info indentation claim |
n1_27b_v2 |
5 min | 45 | 2 | 0 | medium — analysis correct on catalog, but the rollback-bug claim from v1 flipped to "this is fine" |
n1_27b_v3 |
7 min | 46 | 7 ← this entry | 0 | excellent — cleanest catalog walk-through, all 7 changes correctly assessed |
So 27B's three runs span "best analytical content of any local model" (v3) to "minimal partial output" (v2). Within research notes, specific claims (like the rollback-path concern) flipped between runs — meaning any individual claim warrants cross-checking against either the actual diff or another run.
What's consistent across all three: none ship the spec-shaped deliverable. No verdict.md, no tag, no done() in any run.
Daily-driver implication: 27B is meaningfully more trustworthy than Coder-Next on this PR (no incorrect verdicts shipped, since no verdicts shipped period), but if you're using output that depends on the spec-required file structure, you can't rely on 27B to produce it. Cost: you read review.md + research/ to extract the answer.
review.md line-by-line review with verdict-shaped Summary of Findings table
summary.md what the PR claims
diff-analysis.md claimed vs actual changes
research/
notes.md working notes
questions.md five questions with resolutions (Q2 is the catalog walk-through)
dead-ends.md what the sandbox couldn't reproduce, and why
tests/
test_host_agent_suite.py pytest invocation against upstream tests (38/38 passed both branches)
decisions/ empty (the model didn't write any ADRs in this run)
Compared to the Qwen3-Coder-Next-AWQ/ sibling: Coder-Next has the spec-compliant 13-file structure (verdict.md, tag, done()) but its verdict was wrong 2 of 3 times. 27B has correct analysis but skips the spec scaffolding.
Source-of-truth is agent-pilot/logs/n1_27b_v3/ in the bench repo. Receipt has the exact vLLM args, harness git SHA, task file SHA, GPU snapshot.
To replay:
python3 agent-pilot/harness.py replay_n1_27b_v3 agent-pilot/task_pr_audit_n1.md \
--model qwen3.6-27b-awq --port 8000 \
--temperature 0.3 --stuck-threshold 500 \
--docker-socket --gpus allvLLM bf16 paths aren't bitwise-deterministic. Expect divergence between runs.