Skip to content

Commit 0dec327

Browse files
committed
docs: stop promising entry auto-merge until it exists + draft entry-auto-merge proposal
Two problems in registration UX, fixed honestly in this PR + designed for in a follow-up: 1. README and the PR template both claim "CI validates the entry and auto-merges if it passes. No human review needed." That contradicts CI spec invariant #6 ("Entries never auto-merge — a maintainer reviews every PR") and the workflow itself (quick-check.yml labels `ready-for-review` and stops). The docs were aspirational; reality requires a maintainer click. This commit corrects the docs to describe what actually happens today. 2. The reason the spec disabled auto-merge — fear of a buggy/compromised quick-check rubber-stamping anything — is the wrong threat model. The real threats auto-merge exposes are semantic (id-squatting, brand impersonation, author impersonation, description deception, legal misrepresentation), not install-time. None of those are catchable by `pip install` + `Benchmark` import; all of them are exactly what an LLM reviewer with structured output is good at. `openspec/changes/entry-auto-merge/` proposes adding `entry-review.yml` (Claude action returning verdict PASS|CONCERN with per-check booleans) as a CI gate, and enabling auto-merge when ownership + quick + entry-review all pass. Phased: this PR is Phase 0 (doc truth); Phase 1 ships the reviewer + auto-merge as a follow-up registry PR; Phase 2 moves slow-check pre-merge with a fork-PR label-gate. The proposal documents the threat model deltas explicitly so the security trade-off has explicit sign-off rather than being implied by a workflow comment. Companion cube-standard work (separate PR, out of scope here): • `cube registry add` rerun-overwrites-edits fix • Post-submit CLI guidance update to match the new flow Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
1 parent c253bb9 commit 0dec327

4 files changed

Lines changed: 233 additions & 4 deletions

File tree

.github/pull_request_template.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
## CUBE Registry Submission
22

33
Thank you for submitting a benchmark to the CUBE Registry!
4-
CI will validate your entry automatically — no human review needed in the happy path.
4+
CI will validate your entry automatically. On all checks green, the PR is labeled
5+
`ready-for-review` and a maintainer merges (typically within a business day).
56

67
---
78

@@ -25,7 +26,8 @@ CI will validate your entry automatically — no human review needed in the happ
2526
| PyPI install + API introspection | On PR | ~2 min |
2627
| Full debug episode on real infra | Post-merge (async) | ~5-15 min |
2728

28-
Auto-merge triggers when `ownership-check` and `quick-compliance` both pass.
29+
When `ownership-check` and `quick-compliance` both pass, the PR is labeled
30+
`ready-for-review` and a maintainer completes the merge.
2931

3032
The slow check runs asynchronously after merge. A failure will open a GitHub issue tagging
3133
your GitHub handle from `authors[].github`.

README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,11 @@ This generates the entry YAML from your `pyproject.toml`, forks this repo, commi
5151
2. Create `entries/<your-benchmark-id>.yaml` (see template below)
5252
3. Open a pull request
5353

54-
Either way, CI validates the entry and auto-merges if it passes. No human review needed.
54+
Either way, CI validates the entry. On all checks green, the PR is labeled
55+
`ready-for-review` and a maintainer one-clicks merge — typically within a
56+
business day. Auto-merge for entries is on the roadmap; see
57+
[openspec/changes/entry-auto-merge/](openspec/changes/entry-auto-merge/proposal.md)
58+
for the design.
5559

5660
### Entry template
5761

@@ -90,7 +94,8 @@ Fields populated automatically by CI (do not fill):
9094
### Updating your entry
9195

9296
Open a PR modifying your existing YAML. CI verifies you are a registered author
93-
(via `OWNERS.yaml`) and auto-merges if checks pass.
97+
(via `OWNERS.yaml`). On checks green, the PR is labeled `ready-for-review` for
98+
a maintainer to merge.
9499

95100
---
96101

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Deltas: Auto-merge entries via LLM-reviewed CI
2+
3+
## `openspec/specs/ci/spec.md`
4+
5+
### Pipeline overview
6+
7+
**Before**:
8+
```
9+
PR opened
10+
├─ ownership-check (scripts/ownership_check.py, ~10s) ─┐
11+
└─ quick-check (scripts/quick_check.py, ~2 min) ├─ both pass → ready-for-review label
12+
13+
maintainer reviews + merges
14+
├─ update-owners
15+
├─ generate-site
16+
└─ slow-check (async, post-merge)
17+
```
18+
19+
**After (Phase 1)**:
20+
```
21+
PR opened
22+
├─ ownership-check (~10s) ─┐
23+
├─ quick-check (~2 min) ─┤
24+
└─ entry-review (Claude action, ~1 min) ├─ all pass + verdict PASS → auto-merge
25+
├─ verdict CONCERN → ready-for-review (manual)
26+
27+
post-merge:
28+
├─ update-owners
29+
├─ generate-site
30+
└─ slow-check (async, post-merge)
31+
```
32+
33+
### New section: `## Entry review (every PR touching entries/*.yaml)`
34+
35+
Runs after ownership + quick checks pass. Invokes Claude with a checked-in
36+
prompt (`scripts/entry_review_prompt.md`) plus the entry YAML, the package's
37+
PyPI page + README, the linked repo (if `dev_install_url` is set), and the
38+
existing `entries/` directory + `known-authors.yaml` for cross-reference.
39+
40+
Returns structured verdict:
41+
42+
```yaml
43+
verdict: PASS | CONCERN
44+
checks:
45+
description_matches_package: pass | fail | unverified
46+
authors_consistent_with_git: pass | fail | unverified
47+
no_id_squat_vs_existing: pass | fail | unverified
48+
no_brand_impersonation: pass | fail | unverified
49+
wrapper_license_plausible: pass | fail | unverified
50+
notes: <freeform>
51+
```
52+
53+
**Triggers auto-merge when all of:**
54+
55+
- ownership-check ✅
56+
- quick-compliance ✅
57+
- entry-review verdict = `PASS` (explicit; defaults are NOT merge-permissive)
58+
- PR diff is strictly additions/modifications under `entries/<id>.yaml`
59+
for an id the submitter owns (or a brand-new id)
60+
61+
**On `CONCERN`**: post review as PR comment, apply `human-review-needed`,
62+
do NOT merge.
63+
64+
**Security boundary**: review job runs with read-only `pull_request`
65+
permissions. Merge happens in a separate job that consumes the verdict —
66+
compromising the LLM step alone does not bypass ownership / schema / install
67+
checks (separate jobs).
68+
69+
### Invariants
70+
71+
**Invariant #6 changes from**:
72+
73+
> 6. Entries never auto-merge — a maintainer reviews every PR.
74+
75+
**To**:
76+
77+
> 6. Entries auto-merge iff (ownership-check ∧ quick-compliance ∧
78+
> entry-review verdict=PASS) AND the diff is strictly
79+
> additions/modifications under `entries/<id>.yaml`. Any deviation
80+
> falls back to `ready-for-review` + manual merge.
81+
82+
## `openspec/specs/entry/spec.md`
83+
84+
### `## Contracts for submitters`
85+
86+
**Add bullet**:
87+
88+
> - On submission, an LLM reviewer checks that the entry's description matches
89+
> the package, the GitHub handles in `authors[]` are plausibly tied to the
90+
> linked repo's commit history, the wrapper license is consistent with the
91+
> source, and the `id`/`name` doesn't collide with or impersonate an existing
92+
> entry. PRs that fail any of these are labeled `human-review-needed` and
93+
> held for a maintainer.
94+
95+
## `README.md`
96+
97+
### "Submission steps" section
98+
99+
Replace the auto-merge promise that PR #50 already softened with the actual
100+
Phase-1 behavior:
101+
102+
> Either way, CI validates the entry. On all checks green, including the LLM
103+
> semantic-review verdict, the PR auto-merges. PRs flagged `human-review-needed`
104+
> are held for a maintainer.
105+
106+
## `.github/pull_request_template.md`
107+
108+
### "What CI will check" section
109+
110+
Add a row:
111+
112+
| Check | When | ~Time |
113+
|---|---|---|
114+
| LLM semantic review (Claude) | On PR (after ownership + quick) | ~1 min |
115+
116+
### Below the table
117+
118+
Replace the "ready-for-review" copy with:
119+
120+
> When ownership-check, quick-compliance, and the LLM review verdict all
121+
> pass, the PR auto-merges. A `human-review-needed` label means the LLM
122+
> reviewer surfaced a concern; a maintainer will follow up.
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Proposal: Auto-merge entries via LLM-reviewed CI
2+
3+
## Context
4+
5+
README and PR template promise auto-merge for entries. The actual workflow
6+
labels `ready-for-review` and waits for a maintainer click. CI spec invariant
7+
#6 codifies the manual-review rule on the grounds that a buggy or
8+
compromised quick-check could rubber-stamp anything.
9+
10+
The security argument addresses the wrong threat model. The threats auto-merge
11+
exposes are **semantic**, not install-time:
12+
13+
| Threat | Catchable by `pip install` + `Benchmark` import? |
14+
|---|---|
15+
| Malicious code at install | ✅ (hardened sandbox catches outbound calls in import) |
16+
| Privilege escalation via `OWNERS.yaml` | ✅ (ownership-check reads `origin/main`) |
17+
| id-squatting ||
18+
| Brand impersonation ||
19+
| Author-handle impersonation ||
20+
| Description deception (entry text doesn't match package) ||
21+
| Legal misrepresentation (wrong SPDX, stolen content) ||
22+
| Sophisticated supply-chain backdoor in PyPI release | ❌ (same for manual review) |
23+
24+
A maintainer eyeballing a YAML for 30 seconds doesn't catch a supply-chain
25+
backdoor either — so the bar for "what manual review buys" is the semantic
26+
checks. Those are exactly what an LLM reviewer with structured output is
27+
good at. As the registry scales, manual review degrades to rubber-stamping
28+
anyway, and rubber-stamped human review is strictly worse than structured
29+
automated review.
30+
31+
## What
32+
33+
A new workflow `entry-review.yml` runs after ownership + quick-check pass on
34+
any PR touching `entries/*.yaml`. It invokes a Claude action with a checked-in
35+
prompt (`scripts/entry_review_prompt.md`) that returns a structured verdict:
36+
37+
```yaml
38+
verdict: PASS | CONCERN
39+
checks:
40+
description_matches_package: pass | fail | unverified
41+
authors_consistent_with_git: pass | fail | unverified
42+
no_id_squat_vs_existing: pass | fail | unverified
43+
no_brand_impersonation: pass | fail | unverified
44+
wrapper_license_plausible: pass | fail | unverified
45+
notes: <freeform>
46+
```
47+
48+
The reviewer reads: the entry YAML, the package's PyPI page + README, the
49+
linked repo (`dev_install_url` if present) including `git log` of the cube
50+
subdirectory, the existing `entries/` and `known-authors.yaml` for cross-reference.
51+
52+
**Auto-merge fires when all of:**
53+
54+
- `ownership-check`
55+
- `quick-compliance`
56+
- `entry-review` verdict = `PASS` (explicit positive, not absence-of-failure)
57+
- PR diff is strictly additions / modifications under `entries/<id>.yaml`
58+
for an id the submitter owns (or a brand-new id)
59+
60+
On `CONCERN`: post the review as a PR comment, apply `human-review-needed`,
61+
do NOT merge. On any check `unverified`: surfaces in `notes`; verdict is up
62+
to the reviewer.
63+
64+
## What stays manual
65+
66+
- `OWNERS.yaml` edits (already CI-bot-only)
67+
- Workflow / script / spec changes
68+
- Any PR touching files outside `entries/`
69+
- Verdict = `CONCERN` (label-gated)
70+
71+
## Threat model deltas
72+
73+
- Review prompt is checked into the repo — diffable, auditable, blameable.
74+
- Verdict defaults to NOT merge-permissive (`PASS` must be explicit).
75+
- The LLM step runs in a separate job from ownership/schema/install — compromising
76+
one doesn't compromise the others.
77+
- Action runs under read-only `pull_request` permissions; merge happens in a
78+
follow-up job that requires the verdict.
79+
- Cost: ~$0.50–3 per PR with Opus. At low hundreds of submissions / year, modest.
80+
81+
## Phasing
82+
83+
| Phase | Scope | PR scope |
84+
|---|---|---|
85+
| 0 | Doc-truth fix: stop promising auto-merge until it exists | PR #50 (this PR) |
86+
| 1 | `entry-review.yml` + auto-merge gated on `ownership ∧ quick ∧ review=PASS` | Follow-up registry PR |
87+
| 2 | Slow-check moved pre-merge with `/ok-to-test` label-gate for fork PRs | Later registry PR |
88+
| 3 | Per-cube `review_overrides` (e.g. cubes with unusual provenance) | Later, optional |
89+
90+
Phase 1 doesn't lower the bar from today's manual review (which also only
91+
sees quick-check pass before merge). It removes the human-click step and adds
92+
the semantic-checks the human wasn't catching anyway.
93+
94+
Phase 2 is independently valuable but bigger — fork PRs running cloud-infra
95+
jobs is a known abuse vector, so it needs a label-gate.
96+
97+
## Not in scope (handled in cube-standard)
98+
99+
- `cube registry add` rerun-overwrites-edits fix
100+
- Post-submit CLI guidance ("a maintainer will merge…" → "auto-merging on green…")

0 commit comments

Comments
 (0)