Skip to content

Commit 6702367

Browse files
authored
Register 3 cubes + importlib fix + entry-auto-merge proposal (doc-truth) (#50)
* feat: register arithmetic, swebench-live, swebench-verified, terminalbench2 Adds entry YAMLs for four cubes that pass quick_check.py with a clean scaffold. All four live in The-AI-Alliance/cube-harness today and are installable via the standard dev_install_url pattern; they were discovered + validated by looping the registry's quick_check.py (--no-install) against every cube directory in cube-harness/cubes/ as a registration-UX probe. Specifics: • arithmetic 4 tasks — math benchmark, smoke-test fixture • swebench-live 1895 tasks — continuously-refreshed GitHub-issue resolution, contamination-resistant • swebench-verified 500 tasks — Princeton + OpenAI human-validated subset • terminalbench2 89 tasks — Laude Institute terminal tasks (16 categories, pytest-validated) Each entry passed quick_check.py --no-install locally with the cube package installed editable from cube-harness/cubes/<name>; the script's write-back populated task_count, has_debug_task, has_debug_agent, features, action_space, status, resources without further hand-editing. Author handles are the actual cube wrapper maintainers (per git log on each cubes/<name>/ subtree): • arithmetic NicolasAG, recursix • swebench-live NicolasAG, recursix, josancamon19 • swebench-verified NicolasAG, recursix, josancamon19 • terminalbench2 recursix Other cubes investigated in the same probe but NOT included here: • browsercomp — module-load fails (BrowseCompBenchmarkConfig requires scorer_model that the default doesn't pass). Filed cube-harness#479 against younik + NicolasAG. • windows-agent-arena — task_metadata count (152) != declared num_tasks (154). Filed cube-harness#480 against kushasareen + amanjaiswal73892. • webarena-verified — already registered; module-load now fails after a BgymToolConfig → ToolboxConfig rename that didn't propagate to the cube's configs.py. Will surface on next periodic health-check. • terminalbench-cube — empty directory in cube-harness (build artefacts only); dev branch doesn't track any source. Apparent leftover from a tb → tb2 rename. Drive-by fix in scripts/quick_check.py: the script calls importlib.metadata.entry_points(...) at lines 122 + 167 but only does import importlib at line 21. In Python ≥ 3.10 importlib.metadata is a submodule that needs an explicit import; without it the try/except at find_benchmark_class:121 catches an AttributeError ("module 'importlib' has no attribute 'metadata'") and silently degrades to the by-name lookup path. Added one line: import importlib.metadata. Pure correctness fix; the by-name fallback path remains as defense in depth. Companion issues for the cubes that didn't make this batch: • The-AI-Alliance/cube-harness#479 (browsercomp) • The-AI-Alliance/cube-harness#480 (windows-agent-arena) Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * drop arithmetic entry — debug/dev cube, not appropriate for public registry Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * docs: stop promising entry auto-merge until it exists + draft entry-auto-merge proposal Two problems in registration UX, fixed honestly in this PR + designed for in a follow-up: 1. README and the PR template both claim "CI validates the entry and auto-merges if it passes. No human review needed." That contradicts CI spec invariant #6 ("Entries never auto-merge — a maintainer reviews every PR") and the workflow itself (quick-check.yml labels `ready-for-review` and stops). The docs were aspirational; reality requires a maintainer click. This commit corrects the docs to describe what actually happens today. 2. The reason the spec disabled auto-merge — fear of a buggy/compromised quick-check rubber-stamping anything — is the wrong threat model. The real threats auto-merge exposes are semantic (id-squatting, brand impersonation, author impersonation, description deception, legal misrepresentation), not install-time. None of those are catchable by `pip install` + `Benchmark` import; all of them are exactly what an LLM reviewer with structured output is good at. `openspec/changes/entry-auto-merge/` proposes adding `entry-review.yml` (Claude action returning verdict PASS|CONCERN with per-check booleans) as a CI gate, and enabling auto-merge when ownership + quick + entry-review all pass. Phased: this PR is Phase 0 (doc truth); Phase 1 ships the reviewer + auto-merge as a follow-up registry PR; Phase 2 moves slow-check pre-merge with a fork-PR label-gate. The proposal documents the threat model deltas explicitly so the security trade-off has explicit sign-off rather than being implied by a workflow comment. Companion cube-standard work (separate PR, out of scope here): • `cube registry add` rerun-overwrites-edits fix • Post-submit CLI guidance update to match the new flow Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * feat: Phase-1 entry auto-merge with LLM semantic review Implements the entry-auto-merge proposal's Phase 1. The submission flow now runs four pre-merge gates and auto-merges on green: ownership-check → quick-compliance → slow-compliance → entry-review ├─ PASS + path-isolated + same-repo → auto-merge └─ everything else → ready-for-review (manual) Pieces: scripts/entry_review.py + scripts/entry_review_prompt.md LLM reviewer that consults the entry, the package's PyPI metadata, the linked repo's README, and the existing entries/ + known-authors.yaml. Calls Claude with a forced tool call to return a structured verdict: verdict: PASS | CONCERN checks: description_matches_package, authors_consistent_with_git, no_id_squat_vs_existing, no_brand_impersonation, wrapper_license_plausible notes Writes verdict.json + a Markdown PR comment. Pure stdlib + pyyaml + anthropic; no responses-mock-style ceremony. 19 unit tests covering context assembly, repo-URL parsing, verdict-comment rendering, and main() file I/O (all mocked — no real API call from CI). .github/workflows/quick-check.yml (significant rewrite) Renamed to "Entry Pipeline" (filename kept for branch-protection compatibility). Added four new jobs: classify-paths (path-isolation gate for auto-merge eligibility), slow-compliance (debug task on the runner via provider=local), entry-review (the LLM gate), auto-merge (gh pr merge --auto --squash), request-review (fallback labeling + comment listing the specific reasons auto-merge was blocked). Graceful degrades when ANTHROPIC_API_KEY is unset — verdict=UNKNOWN routes to ready-for-review, same as today's behavior. slow-check.yml is unchanged. Its post-merge push-to-main trigger remains as the canonical cloud-VM stress run (writes stress-results/), with the pre-merge slow-compliance acting as the fast local-only gate. openspec/specs/ci/spec.md Pipeline-overview diagram updated to four pre-merge gates. New sections for slow-compliance (PR-time, local), entry-review (with the verdict schema inline), auto-merge (with the five required conditions), and request-human-review (fallback). Invariant #6 changes from "Entries never auto-merge" to the iff-condition matching the workflow. Invariant #7 added: prompt is checked in, reviewer runs in its own job so a compromised LLM step alone cannot bypass ownership/schema/install/ slow-check. openspec/specs/entry/spec.md One added bullet under "Contracts for submitters" describing what the LLM reviewer checks and how CONCERN routes. README.md + .github/pull_request_template.md Both now describe the four-gate flow with the auto-merge condition. Replaces the Phase-0 "ready-for-review only" copy from the previous commit. Fork PRs are explicitly noted as falling back to manual merge. openspec/changes/entry-auto-merge/proposal.md Phasing table updated — Phase 0 + Phase 1 marked shipped. Phase 1 becomes active once ANTHROPIC_API_KEY is set in repo secrets. To activate (post-merge): Set ANTHROPIC_API_KEY in repo secrets. Next entry PR will be reviewed + auto-merged on green. Until the secret is set, the workflow runs ownership + quick + slow gates exactly as today and routes everything to ready-for-review. 187/187 tests pass locally (19 new + 168 pre-existing). Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * fix: soften pre-merge slow-compliance to informational signal The first live run of Phase 1 exposed a design flaw: making slow-compliance a hard gate excludes almost every real cube from auto-merge. Cubes like SWE-bench, OSWorld, terminal-bench, WebArena need Docker/VM/large-disk environments that don't fit on a GHA runner with provider=local. Concretely, this PR's terminalbench2 entry crashes slow-check at PR-time with ModuleNotFoundError: No module named cube.tools.terminal — a real cube-side stale-import bug, but one that would only ever surface once the PR actually ran on real infra anyway. Fix: continue-on-error: true on the slow-compliance job. - Job still runs; red check still appears on PR if it fails (useful signal for cubes that DO support local-provider) - Failure no longer propagates to success() upstream, so auto-merge gate is now (ownership ∧ quick ∧ review=PASS ∧ path-isolated ∧ same-repo), not (ownership ∧ quick ∧ slow ∧ review=PASS ∧ ...) - Post-merge slow-check.yml on cloud VMs remains the canonical execution gate; weekly health-check catches drift Updated: - openspec/specs/ci/spec.md: invariant #6 dropped the slow-compliance conjunct; section labels itself informational; auto-merge condition list revised - README.md + .github/pull_request_template.md: gate tables now have Hard-gate? column; pre-merge slow-compliance marked informational with a paragraph explaining why Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * fix: address code-review findings on Phase 1 Six fixes from the multi-agent code review of PR #50: S1 (security, NIT) — Path-isolation regex tightened from `^entries/` to `^entries/[^/]*\.yaml$`. The old check accepted `entries/subdir/foo.yaml` and `entries/foo.txt`; the new pattern enforces flat `entries/<id>.yaml` only, matching the auto-merge eligibility contract. (.github/workflows/quick-check.yml ~L94) S2 (security, BLOCKING) — Dropped `--auto` from `gh pr merge`. The flag requires branch-protection rules to be configured on `main`; without them it silently no-ops. The workflow's `if:` already enforces all gates (ownership ∧ quick ∧ review=PASS ∧ path-isolated ∧ same-repo), so an unconditional `--squash` is both correct and direct. Comment added explaining the rationale. (.github/workflows/quick-check.yml ~L351) C1 (correctness, BLOCKING) — Fixed misleading copy in the request-review fallback comment. Old: "ownership-check, quick-compliance, slow-compliance passed." but slow-compliance is informational and may have failed. New: "ownership-check + quick-compliance passed (slow-compliance is informational)." (.github/workflows/quick-check.yml ~L399) C3 (correctness, BLOCKING) — Hardened entry_review.py against transient LLM errors. Old: a network blip / API 5xx / schema-validation 400 would propagate as a Python exception → workflow step exits non-zero → downstream success() is false → BOTH auto-merge AND request-review skip, leaving the PR stuck with a red check and no actionable feedback. Fix: wrap call_claude in try/except at the boundary; on any error emit verdict=UNKNOWN with the error type+message in notes; always exit 0. Behaviorally identical to "API key not set" — request-review fires with the appropriate fallback comment. Also distinguished PASS/CONCERN/UNKNOWN in the workflow's per-entry loop. Previously OVERALL collapsed CONCERN+UNKNOWN onto CONCERN, which mislabeled "Claude couldn't run" as "Claude flagged a concern" in the fallback comment. Now: ANY entry CONCERN → OVERALL=CONCERN; otherwise ANY entry UNKNOWN → OVERALL=UNKNOWN; else PASS. (scripts/entry_review.py ~L260; .github/workflows/quick-check.yml ~L301) C5 (test coverage) — Added 4 new tests for the regex hardening surface + the LLM-error path: - test_rejects_typosquat_host (github.com.evil.com → no match, no fetch) - test_rejects_non_https_scheme (https://, git+http://, git+ssh://, file:// — all rejected) - test_captures_owner_repo_only (pathological owner stays in URL, no shell exposure) - test_llm_error_writes_unknown_verdict (ConnectionError → verdict=UNKNOWN, rc=0) Tests: 19 → 23 entry_review tests; 187 → 191 suite total. St5+St6 (spec drift) — `openspec/changes/entry-auto-merge/proposal.md` phasing table no longer claims auto-merge gates on `∧ slow ∧`; correctly notes slow-compliance is informational. deltas.md invariant #6 wording aligned with the spec.md version (mentions same-repo and continue-on-error properties). Not addressed (intentional): C2 — Docs-only PRs (no entries changed) being unlabeled is not a real bug. They aren't auto-merge candidates anyway; GitHub's normal review flow applies. Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * fix: strip GHA multi-line-output backslash continuations in cross-job loops The previous CI run with ANTHROPIC_API_KEY now configured revealed a silent multi-entry bug: only terminalbench2 got reviewed (and earlier slow-checked); swebench-live and swebench-verified were silently skipped. Root cause: GitHub Actions encodes multi-line job-output values with backslash-newline line continuations when forwarded across jobs via needs.JOB.outputs.NAME. The tj-actions/changed-files step inside quick-compliance is invoked with separator: newline, which works fine when consumed in the SAME job (step-internal interpolation), but the cross-job hop into entry-review and slow-compliance wraps each intermediate line with a trailing backslash. while IFS= read -r entry_file with -r preserves the backslash. The existence check [ ! -f ] then succeeds for the literal trailing-backslash path (no such file) and the iteration silently continues. Only the LAST entry — which has no newline-and-backslash after it — is read cleanly. That's why three back-to-back live runs produced exactly one review (and exactly one slow-check failure, which we mis-attributed to terminalbench2 having a stale cube.tools.terminal import — that diagnosis stands but masked this loop bug). Fix: entry_file=${entry_file%\\} to strip the trailing backslash before the existence check. Applied to both jobs with inline comments to prevent recurrence. Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> --------- Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
1 parent 5a18b6d commit 6702367

14 files changed

Lines changed: 1466 additions & 55 deletions

File tree

.github/pull_request_template.md

Lines changed: 22 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
## CUBE Registry Submission
22

33
Thank you for submitting a benchmark to the CUBE Registry!
4-
CI will validate your entry automatically — no human review needed in the happy path.
4+
CI runs three pre-merge hard gates (ownership, quick-compliance, LLM
5+
semantic review) plus an informational slow-compliance signal. On hard
6+
gates green and a path-isolated diff under `entries/<id>.yaml`, the PR
7+
auto-merges.
58

69
---
710

@@ -19,16 +22,24 @@ CI will validate your entry automatically — no human review needed in the happ
1922

2023
### What CI will check
2124

22-
| Check | When | ~Time |
23-
|---|---|---|
24-
| Schema validation | On PR | <1 min |
25-
| PyPI install + API introspection | On PR | ~2 min |
26-
| Full debug episode on real infra | Post-merge (async) | ~5-15 min |
27-
28-
Auto-merge triggers when `ownership-check` and `quick-compliance` both pass.
29-
30-
The slow check runs asynchronously after merge. A failure will open a GitHub issue tagging
31-
your GitHub handle from `authors[].github`.
25+
| Check | When | ~Time | Hard gate? |
26+
|---|---|---|---|
27+
| ownership-check | On PR | <1 min | Yes |
28+
| quick-compliance (schema + install + introspect) | On PR | ~2 min | Yes |
29+
| slow-compliance (debug task on runner) | On PR | ~5 min | No (informational) |
30+
| entry-review (LLM semantic check) | On PR | ~1 min | Yes |
31+
| Full stress run on `supported_infra` cloud VMs | Post-merge (async) | ~5-30 min | Post-merge canonical |
32+
33+
When all hard gates pass AND the diff is strictly under
34+
`entries/<id>.yaml` AND the PR is from this repo (not a fork), the PR
35+
auto-merges. Otherwise it's labeled `ready-for-review` for a maintainer
36+
(the comment will list the specific reasons). slow-compliance failing
37+
shows as a red check but does not block — cubes that need Docker/VM/etc.
38+
naturally can't run with `provider=local`.
39+
40+
The post-merge stress run runs asynchronously after merge. A failure opens
41+
a GitHub issue tagging your `authors[].github` handles; the entry stays in
42+
the registry with `status: degraded` until fixed.
3243

3344
---
3445

.github/workflows/quick-check.yml

Lines changed: 264 additions & 25 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 31 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,16 @@ This generates the entry YAML from your `pyproject.toml`, forks this repo, commi
5151
2. Create `entries/<your-benchmark-id>.yaml` (see template below)
5252
3. Open a pull request
5353

54-
Either way, CI validates the entry and auto-merges if it passes. No human review needed.
54+
Either way, CI runs three hard gates (ownership, quick-compliance, LLM
55+
semantic review) plus an informational slow-compliance signal. If the hard
56+
gates pass and the PR diff is strictly under `entries/<id>.yaml`, the PR
57+
auto-merges. If the LLM flags a `CONCERN`, the PR is labeled
58+
`ready-for-review` and a maintainer finishes the merge.
59+
60+
Fork PRs and PRs that touch paths outside `entries/` always fall back to
61+
maintainer-merge. See
62+
[openspec/specs/ci/spec.md](openspec/specs/ci/spec.md) for the full pipeline
63+
contract.
5564

5665
### Entry template
5766

@@ -89,22 +98,34 @@ Fields populated automatically by CI (do not fill):
8998

9099
### Updating your entry
91100

92-
Open a PR modifying your existing YAML. CI verifies you are a registered author
93-
(via `OWNERS.yaml`) and auto-merges if checks pass.
101+
Open a PR modifying your existing YAML. CI verifies you are a registered
102+
author (via `OWNERS.yaml`) and runs the same four gates as a new submission.
103+
On all gates green, the PR auto-merges.
94104

95105
---
96106

97107
## Compliance checks
98108

99-
Every submission goes through two tiers:
109+
Every submission goes through four pre-merge gates and one post-merge gate:
100110

101-
| Tier | When | What | Cost |
111+
| Gate | When | What | Hard gate? |
102112
|---|---|---|---|
103-
| Quick check | On PR (~2 min) | Schema, PyPI install, API introspection | Free |
104-
| Slow check | Post-merge (async) | Full debug episode on real infra | ~$0.05/VM cube |
105-
106-
A slow check failure opens a GitHub issue tagging the entry authors.
107-
Entries remain in the registry regardless — platforms decide which tier they require.
113+
| ownership-check | On PR (~10s) | Submitter is in `OWNERS.yaml` for the entry (or it's new) | Yes |
114+
| quick-compliance | On PR (~2 min) | Schema, PyPI install, API introspection — hardened Docker sandbox, no credentials | Yes |
115+
| slow-compliance | On PR (~5 min) | Debug task with `provider=local` on the GHA runner | **No** — informational |
116+
| entry-review | On PR (~1 min) | LLM semantic check — verdict `PASS` or `CONCERN` | Yes |
117+
| slow-check (cloud) | Post-merge (async, ~5–30 min) | Full stress run on cloud VMs across `supported_infra`; writes `stress-results/` | Post-merge canonical |
118+
119+
Pre-merge slow-compliance is informational: most real cubes need
120+
Docker/VM/large-disk environments that don't fit on a GHA runner, so
121+
hard-gating there would exclude almost every real benchmark from auto-merge.
122+
Failure surfaces as a red check (useful signal for cubes that *do* support
123+
`local`) but doesn't block. The post-merge slow-check on cloud VMs is the
124+
canonical execution gate.
125+
126+
Hard-gate failures → check shows red on the PR; submitter pushes a fix.
127+
Post-merge slow-check failing → opens a GitHub issue tagging the entry authors;
128+
entry remains in the registry with `status: degraded` until fixed.
108129

109130
---
110131

entries/swebench-live.yaml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
id: swebench-live
2+
name: "SWE-bench Live"
3+
version: "0.1.0"
4+
description: >
5+
SWE-bench Live ported to the CUBE protocol — 1,895 continuously-updated,
6+
contamination-resistant GitHub issue resolution tasks across many
7+
open-source repositories. Each task pairs a real issue with its merged
8+
fix; the agent receives the problem statement plus a git checkout at the
9+
base commit and must produce a patch that makes the upstream
10+
fail_to_pass tests pass without breaking pass_to_pass. The task pool
11+
is refreshed continuously, making the benchmark useful for testing
12+
contamination resistance.
13+
package: swebench-live-cube
14+
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/swebench-live-cube"
15+
16+
authors:
17+
- github: NicolasAG
18+
name: Nicolas Gontier
19+
- github: recursix
20+
name: Alexandre Lacoste
21+
- github: josancamon19
22+
name: Joan Cabezas
23+
24+
legal:
25+
wrapper_license: MIT
26+
benchmark_license:
27+
reported: MIT
28+
source_url: "https://github.com/microsoft/SWE-bench-Live/blob/main/LICENSE"
29+
verified_by_original_authors: false
30+
31+
paper: "https://arxiv.org/abs/2505.23419"
32+
getting_started_url: "https://swe-bench-live.github.io/"
33+
tags:
34+
- coding
35+
- science
36+
status: active
37+
resources: []
38+
task_count: 1895
39+
has_debug_task: true
40+
has_debug_agent: true
41+
action_space: []
42+
features:
43+
async: false
44+
streaming: false
45+
multi_agent: false
46+
multi_dim_reward: false

entries/swebench-verified.yaml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
id: swebench-verified
2+
name: "SWE-bench Verified"
3+
version: "0.1.0"
4+
description: >
5+
SWE-bench Verified ported to the CUBE protocol — 500 human-validated
6+
GitHub issues with test-based resolution criteria. Princeton + OpenAI's
7+
curated subset of the broader SWE-bench dataset where every task was
8+
manually checked for an unambiguous problem statement and a reliable
9+
test-based reward signal. The agent receives the problem statement +
10+
a git checkout at the base commit and must produce a patch that makes
11+
the upstream fail_to_pass tests pass without breaking pass_to_pass.
12+
package: swebench-verified-cube
13+
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/swebench-verified-cube"
14+
15+
authors:
16+
- github: NicolasAG
17+
name: Nicolas Gontier
18+
- github: recursix
19+
name: Alexandre Lacoste
20+
- github: josancamon19
21+
name: Joan Cabezas
22+
23+
legal:
24+
wrapper_license: MIT
25+
benchmark_license:
26+
reported: MIT
27+
source_url: "https://github.com/SWE-bench/SWE-bench/blob/main/LICENSE"
28+
verified_by_original_authors: false
29+
30+
paper: "https://arxiv.org/abs/2310.06770"
31+
getting_started_url: "https://openai.com/index/introducing-swe-bench-verified/"
32+
tags:
33+
- coding
34+
status: active
35+
resources: []
36+
task_count: 500
37+
has_debug_task: true
38+
has_debug_agent: true
39+
action_space: []
40+
features:
41+
async: false
42+
streaming: false
43+
multi_agent: false
44+
multi_dim_reward: false

entries/terminalbench2.yaml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
id: terminalbench2
2+
name: "Terminal-Bench 2"
3+
version: "0.1.0"
4+
description: >
5+
Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the
6+
CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy,
7+
query, modernize) with pytest-based validation. Each task hands the
8+
agent a Linux shell pre-loaded with a project, asks for a concrete
9+
deliverable (a fixed bug, a passing test, a compiled binary, an
10+
inferred answer), and verifies the result by running an upstream pytest
11+
test suite the agent never sees. Tasks span 16 categories with
12+
difficulty levels easy / medium / hard.
13+
package: terminalbench2-cube
14+
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/terminalbench2-cube"
15+
16+
authors:
17+
- github: recursix
18+
name: Alexandre Lacoste
19+
20+
legal:
21+
wrapper_license: MIT
22+
benchmark_license:
23+
reported: Apache-2.0
24+
source_url: "https://github.com/harbor-framework/terminal-bench-2"
25+
verified_by_original_authors: false
26+
27+
getting_started_url: "https://github.com/harbor-framework/terminal-bench-2"
28+
tags:
29+
- coding
30+
- os
31+
status: active
32+
resources: []
33+
task_count: 89
34+
has_debug_task: true
35+
has_debug_agent: true
36+
action_space: []
37+
features:
38+
async: false
39+
streaming: false
40+
multi_agent: false
41+
multi_dim_reward: false
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# Deltas: Auto-merge entries via LLM-reviewed CI
2+
3+
## `openspec/specs/ci/spec.md`
4+
5+
### Pipeline overview
6+
7+
**Before**:
8+
```
9+
PR opened
10+
├─ ownership-check (scripts/ownership_check.py, ~10s) ─┐
11+
└─ quick-check (scripts/quick_check.py, ~2 min) ├─ both pass → ready-for-review label
12+
13+
maintainer reviews + merges
14+
├─ update-owners
15+
├─ generate-site
16+
└─ slow-check (async, post-merge)
17+
```
18+
19+
**After (Phase 1)**:
20+
```
21+
PR opened
22+
├─ ownership-check (~10s) ─┐
23+
├─ quick-check (~2 min) ─┤
24+
└─ entry-review (Claude action, ~1 min) ├─ all pass + verdict PASS → auto-merge
25+
├─ verdict CONCERN → ready-for-review (manual)
26+
27+
post-merge:
28+
├─ update-owners
29+
├─ generate-site
30+
└─ slow-check (async, post-merge)
31+
```
32+
33+
### New section: `## Entry review (every PR touching entries/*.yaml)`
34+
35+
Runs after ownership + quick checks pass. Invokes Claude with a checked-in
36+
prompt (`scripts/entry_review_prompt.md`) plus the entry YAML, the package's
37+
PyPI page + README, the linked repo (if `dev_install_url` is set), and the
38+
existing `entries/` directory + `known-authors.yaml` for cross-reference.
39+
40+
Returns structured verdict:
41+
42+
```yaml
43+
verdict: PASS | CONCERN
44+
checks:
45+
description_matches_package: pass | fail | unverified
46+
authors_consistent_with_git: pass | fail | unverified
47+
no_id_squat_vs_existing: pass | fail | unverified
48+
no_brand_impersonation: pass | fail | unverified
49+
wrapper_license_plausible: pass | fail | unverified
50+
notes: <freeform>
51+
```
52+
53+
**Triggers auto-merge when all of:**
54+
55+
- ownership-check ✅
56+
- quick-compliance ✅
57+
- entry-review verdict = `PASS` (explicit; defaults are NOT merge-permissive)
58+
- PR diff is strictly additions/modifications under `entries/<id>.yaml`
59+
for an id the submitter owns (or a brand-new id)
60+
- PR is from the same repo (fork PRs lack the GITHUB_TOKEN scope to merge)
61+
62+
slow-compliance runs as an informational signal (`continue-on-error: true`)
63+
— its failure does not block auto-merge. Most real cubes need
64+
Docker/VM/large-disk environments that don't fit `provider: local`.
65+
66+
**On `CONCERN`**: post review as PR comment, apply `human-review-needed`,
67+
do NOT merge.
68+
69+
**Security boundary**: review job runs with read-only `pull_request`
70+
permissions. Merge happens in a separate job that consumes the verdict —
71+
compromising the LLM step alone does not bypass ownership / schema / install
72+
checks (separate jobs).
73+
74+
### Invariants
75+
76+
**Invariant #6 changes from**:
77+
78+
> 6. Entries never auto-merge — a maintainer reviews every PR.
79+
80+
**To**:
81+
82+
> 6. Entries auto-merge iff (ownership-check ∧ quick-compliance ∧
83+
> entry-review verdict=PASS) AND the diff is strictly
84+
> additions/modifications under `entries/<id>.yaml` AND the PR is from
85+
> the same repo. slow-compliance is informational
86+
> (`continue-on-error`); its failure does not block. Any deviation
87+
> falls back to `ready-for-review` + manual merge.
88+
89+
## `openspec/specs/entry/spec.md`
90+
91+
### `## Contracts for submitters`
92+
93+
**Add bullet**:
94+
95+
> - On submission, an LLM reviewer checks that the entry's description matches
96+
> the package, the GitHub handles in `authors[]` are plausibly tied to the
97+
> linked repo's commit history, the wrapper license is consistent with the
98+
> source, and the `id`/`name` doesn't collide with or impersonate an existing
99+
> entry. PRs that fail any of these are labeled `human-review-needed` and
100+
> held for a maintainer.
101+
102+
## `README.md`
103+
104+
### "Submission steps" section
105+
106+
Replace the auto-merge promise that PR #50 already softened with the actual
107+
Phase-1 behavior:
108+
109+
> Either way, CI validates the entry. On all checks green, including the LLM
110+
> semantic-review verdict, the PR auto-merges. PRs flagged `human-review-needed`
111+
> are held for a maintainer.
112+
113+
## `.github/pull_request_template.md`
114+
115+
### "What CI will check" section
116+
117+
Add a row:
118+
119+
| Check | When | ~Time |
120+
|---|---|---|
121+
| LLM semantic review (Claude) | On PR (after ownership + quick) | ~1 min |
122+
123+
### Below the table
124+
125+
Replace the "ready-for-review" copy with:
126+
127+
> When ownership-check, quick-compliance, and the LLM review verdict all
128+
> pass, the PR auto-merges. A `human-review-needed` label means the LLM
129+
> reviewer surfaced a concern; a maintainer will follow up.

0 commit comments

Comments
 (0)