Run name: p2_ci_coder_v1 (1 of 3 — see task family README for the full N=3 picture)
Wall: 62.7 s
Cost upper: $0.0014
Verdict: PASS
Clean — same. 1.2 min median wall (faster), 2× cheaper than 27B.
grade.json— verdict + per-dimension scores. Hand-graded subjective dimensions (where the grader has them) are filled inhand_rating_placeholderswith_GRADER_: claude-opus-4.7-1m-context.cost.json— wall, tokens, GPU, energy upper boundlabel.json— failure-mode classification pertooling/FAILURE-TAXONOMY.mdsummary.json— finish reason, iteration count, total tokensreceipt.json— vLLM args, harness git SHA, GPU snapshot
This is a lean entry. The transcript and deliverable artifacts for this specific run aren't mirrored in MMBT (lean entry — saves repo space). But the task prompt, input starter, ground truth, and grader script for this task family are all in ../../../../tooling/ — with those + your own GPU you can rerun this task family at N=3 yourself. The original bench-side run name was p2_ci_coder_v1. The 3 highest-signal task families (adversarial-hallucination, market-research, doc-synthesis) have full per-model entries with transcripts + deliverables; this one doesn't, to keep the repo size manageable. See ../README.md for the rationale.