The-AI-Alliance
diff --git a/‎.github/pull_request_template.md‎
Lines changed: 22 additions & 11 deletions b/‎.github/pull_request_template.md‎
Lines changed: 22 additions & 11 deletions
diff --git a/‎.github/workflows/quick-check.yml‎
Lines changed: 264 additions & 25 deletions b/‎.github/workflows/quick-check.yml‎
Lines changed: 264 additions & 25 deletions
diff --git a/‎README.md‎
Lines changed: 31 additions & 10 deletions b/‎README.md‎
Lines changed: 31 additions & 10 deletions
diff --git a/‎entries/swebench-live.yaml‎
Lines changed: 46 additions & 0 deletions b/‎entries/swebench-live.yaml‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎entries/swebench-verified.yaml‎
Lines changed: 44 additions & 0 deletions b/‎entries/swebench-verified.yaml‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎entries/terminalbench2.yaml‎
Lines changed: 41 additions & 0 deletions b/‎entries/terminalbench2.yaml‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎openspec/changes/entry-auto-merge/deltas.md‎
Lines changed: 129 additions & 0 deletions b/‎openspec/changes/entry-auto-merge/deltas.md‎
Lines changed: 129 additions & 0 deletions
@@ -1,7 +1,10 @@
 ## CUBE Registry Submission
 
 Thank you for submitting a benchmark to the CUBE Registry!
-CI will validate your entry automatically — no human review needed in the happy path.
+CI runs three pre-merge hard gates (ownership, quick-compliance, LLM
+semantic review) plus an informational slow-compliance signal. On hard
+gates green and a path-isolated diff under `entries/<id>.yaml`, the PR
+auto-merges.
 
 ---
 
@@ -19,16 +22,24 @@ CI will validate your entry automatically — no human review needed in the happ
 
 ### What CI will check
 
-| Check | When | ~Time |
-|---|---|---|
-| Schema validation | On PR | <1 min |
-| PyPI install + API introspection | On PR | ~2 min |
-| Full debug episode on real infra | Post-merge (async) | ~5-15 min |
-
-Auto-merge triggers when `ownership-check` and `quick-compliance` both pass.
-
-The slow check runs asynchronously after merge. A failure will open a GitHub issue tagging
-your GitHub handle from `authors[].github`.
+| Check | When | ~Time | Hard gate? |
+|---|---|---|---|
+| ownership-check | On PR | <1 min | Yes |
+| quick-compliance (schema + install + introspect) | On PR | ~2 min | Yes |
+| slow-compliance (debug task on runner) | On PR | ~5 min | No (informational) |
+| entry-review (LLM semantic check) | On PR | ~1 min | Yes |
+| Full stress run on `supported_infra` cloud VMs | Post-merge (async) | ~5-30 min | Post-merge canonical |
+
+When all hard gates pass AND the diff is strictly under
+`entries/<id>.yaml` AND the PR is from this repo (not a fork), the PR
+auto-merges. Otherwise it's labeled `ready-for-review` for a maintainer
+(the comment will list the specific reasons). slow-compliance failing
+shows as a red check but does not block — cubes that need Docker/VM/etc.
+naturally can't run with `provider=local`.
+
+The post-merge stress run runs asynchronously after merge. A failure opens
+a GitHub issue tagging your `authors[].github` handles; the entry stays in
+the registry with `status: degraded` until fixed.
 
 ---
 
 
@@ -51,7 +51,16 @@ This generates the entry YAML from your `pyproject.toml`, forks this repo, commi
 2. Create `entries/<your-benchmark-id>.yaml` (see template below)
 3. Open a pull request
 
-Either way, CI validates the entry and auto-merges if it passes. No human review needed.
+Either way, CI runs three hard gates (ownership, quick-compliance, LLM
+semantic review) plus an informational slow-compliance signal. If the hard
+gates pass and the PR diff is strictly under `entries/<id>.yaml`, the PR
+auto-merges. If the LLM flags a `CONCERN`, the PR is labeled
+`ready-for-review` and a maintainer finishes the merge.
+
+Fork PRs and PRs that touch paths outside `entries/` always fall back to
+maintainer-merge. See
+[openspec/specs/ci/spec.md](openspec/specs/ci/spec.md) for the full pipeline
+contract.
 
 ### Entry template
 
@@ -89,22 +98,34 @@ Fields populated automatically by CI (do not fill):
 
 ### Updating your entry
 
-Open a PR modifying your existing YAML. CI verifies you are a registered author
-(via `OWNERS.yaml`) and auto-merges if checks pass.
+Open a PR modifying your existing YAML. CI verifies you are a registered
+author (via `OWNERS.yaml`) and runs the same four gates as a new submission.
+On all gates green, the PR auto-merges.
 
 ---
 
 ## Compliance checks
 
-Every submission goes through two tiers:
+Every submission goes through four pre-merge gates and one post-merge gate:
 
-| Tier | When | What | Cost |
+| Gate | When | What | Hard gate? |
 |---|---|---|---|
-| Quick check | On PR (~2 min) | Schema, PyPI install, API introspection | Free |
-| Slow check | Post-merge (async) | Full debug episode on real infra | ~$0.05/VM cube |
-
-A slow check failure opens a GitHub issue tagging the entry authors.
-Entries remain in the registry regardless — platforms decide which tier they require.
+| ownership-check | On PR (~10s) | Submitter is in `OWNERS.yaml` for the entry (or it's new) | Yes |
+| quick-compliance | On PR (~2 min) | Schema, PyPI install, API introspection — hardened Docker sandbox, no credentials | Yes |
+| slow-compliance | On PR (~5 min) | Debug task with `provider=local` on the GHA runner | **No** — informational |
+| entry-review | On PR (~1 min) | LLM semantic check — verdict `PASS` or `CONCERN` | Yes |
+| slow-check (cloud) | Post-merge (async, ~5–30 min) | Full stress run on cloud VMs across `supported_infra`; writes `stress-results/` | Post-merge canonical |
+
+Pre-merge slow-compliance is informational: most real cubes need
+Docker/VM/large-disk environments that don't fit on a GHA runner, so
+hard-gating there would exclude almost every real benchmark from auto-merge.
+Failure surfaces as a red check (useful signal for cubes that *do* support
+`local`) but doesn't block. The post-merge slow-check on cloud VMs is the
+canonical execution gate.
+
+Hard-gate failures → check shows red on the PR; submitter pushes a fix.
+Post-merge slow-check failing → opens a GitHub issue tagging the entry authors;
+entry remains in the registry with `status: degraded` until fixed.
 
 ---
 
 
@@ -0,0 +1,46 @@
+id: swebench-live
+name: "SWE-bench Live"
+version: "0.1.0"
+description: >
+  SWE-bench Live ported to the CUBE protocol — 1,895 continuously-updated,
+  contamination-resistant GitHub issue resolution tasks across many
+  open-source repositories. Each task pairs a real issue with its merged
+  fix; the agent receives the problem statement plus a git checkout at the
+  base commit and must produce a patch that makes the upstream
+  fail_to_pass tests pass without breaking pass_to_pass. The task pool
+  is refreshed continuously, making the benchmark useful for testing
+  contamination resistance.
+package: swebench-live-cube
+dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/swebench-live-cube"
+
+authors:
+- github: NicolasAG
+  name: Nicolas Gontier
+- github: recursix
+  name: Alexandre Lacoste
+- github: josancamon19
+  name: Joan Cabezas
+
+legal:
+  wrapper_license: MIT
+  benchmark_license:
+    reported: MIT
+    source_url: "https://github.com/microsoft/SWE-bench-Live/blob/main/LICENSE"
+    verified_by_original_authors: false
+
+paper: "https://arxiv.org/abs/2505.23419"
+getting_started_url: "https://swe-bench-live.github.io/"
+tags:
+- coding
+- science
+status: active
+resources: []
+task_count: 1895
+has_debug_task: true
+has_debug_agent: true
+action_space: []
+features:
+  async: false
+  streaming: false
+  multi_agent: false
+  multi_dim_reward: false
@@ -0,0 +1,44 @@
+id: swebench-verified
+name: "SWE-bench Verified"
+version: "0.1.0"
+description: >
+  SWE-bench Verified ported to the CUBE protocol — 500 human-validated
+  GitHub issues with test-based resolution criteria. Princeton + OpenAI's
+  curated subset of the broader SWE-bench dataset where every task was
+  manually checked for an unambiguous problem statement and a reliable
+  test-based reward signal. The agent receives the problem statement +
+  a git checkout at the base commit and must produce a patch that makes
+  the upstream fail_to_pass tests pass without breaking pass_to_pass.
+package: swebench-verified-cube
+dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/swebench-verified-cube"
+
+authors:
+- github: NicolasAG
+  name: Nicolas Gontier
+- github: recursix
+  name: Alexandre Lacoste
+- github: josancamon19
+  name: Joan Cabezas
+
+legal:
+  wrapper_license: MIT
+  benchmark_license:
+    reported: MIT
+    source_url: "https://github.com/SWE-bench/SWE-bench/blob/main/LICENSE"
+    verified_by_original_authors: false
+
+paper: "https://arxiv.org/abs/2310.06770"
+getting_started_url: "https://openai.com/index/introducing-swe-bench-verified/"
+tags:
+- coding
+status: active
+resources: []
+task_count: 500
+has_debug_task: true
+has_debug_agent: true
+action_space: []
+features:
+  async: false
+  streaming: false
+  multi_agent: false
+  multi_dim_reward: false
@@ -0,0 +1,41 @@
+id: terminalbench2
+name: "Terminal-Bench 2"
+version: "0.1.0"
+description: >
+  Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the
+  CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy,
+  query, modernize) with pytest-based validation. Each task hands the
+  agent a Linux shell pre-loaded with a project, asks for a concrete
+  deliverable (a fixed bug, a passing test, a compiled binary, an
+  inferred answer), and verifies the result by running an upstream pytest
+  test suite the agent never sees. Tasks span 16 categories with
+  difficulty levels easy / medium / hard.
+package: terminalbench2-cube
+dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/terminalbench2-cube"
+
+authors:
+- github: recursix
+  name: Alexandre Lacoste
+
+legal:
+  wrapper_license: MIT
+  benchmark_license:
+    reported: Apache-2.0
+    source_url: "https://github.com/harbor-framework/terminal-bench-2"
+    verified_by_original_authors: false
+
+getting_started_url: "https://github.com/harbor-framework/terminal-bench-2"
+tags:
+- coding
+- os
+status: active
+resources: []
+task_count: 89
+has_debug_task: true
+has_debug_agent: true
+action_space: []
+features:
+  async: false
+  streaming: false
+  multi_agent: false
+  multi_dim_reward: false
@@ -0,0 +1,129 @@
+# Deltas: Auto-merge entries via LLM-reviewed CI
+
+## `openspec/specs/ci/spec.md`
+
+### Pipeline overview
+
+**Before**:
+```
+PR opened
+ ├─ ownership-check  (scripts/ownership_check.py, ~10s)  ─┐
+ └─ quick-check      (scripts/quick_check.py, ~2 min)     ├─ both pass → ready-for-review label
+
+maintainer reviews + merges
+ ├─ update-owners
+ ├─ generate-site
+ └─ slow-check (async, post-merge)
+```
+
+**After (Phase 1)**:
+```
+PR opened
+ ├─ ownership-check  (~10s)                              ─┐
+ ├─ quick-check      (~2 min)                            ─┤
+ └─ entry-review     (Claude action, ~1 min)              ├─ all pass + verdict PASS → auto-merge
+                                                          ├─ verdict CONCERN → ready-for-review (manual)
+                                                          │
+post-merge:
+ ├─ update-owners
+ ├─ generate-site
+ └─ slow-check (async, post-merge)
+```
+
+### New section: `## Entry review (every PR touching entries/*.yaml)`
+
+Runs after ownership + quick checks pass. Invokes Claude with a checked-in
+prompt (`scripts/entry_review_prompt.md`) plus the entry YAML, the package's
+PyPI page + README, the linked repo (if `dev_install_url` is set), and the
+existing `entries/` directory + `known-authors.yaml` for cross-reference.
+
+Returns structured verdict:
+
+```yaml
+verdict: PASS | CONCERN
+checks:
+  description_matches_package: pass | fail | unverified
+  authors_consistent_with_git:  pass | fail | unverified
+  no_id_squat_vs_existing:      pass | fail | unverified
+  no_brand_impersonation:       pass | fail | unverified
+  wrapper_license_plausible:    pass | fail | unverified
+notes: <freeform>
+```
+
+**Triggers auto-merge when all of:**
+
+- ownership-check ✅
+- quick-compliance ✅
+- entry-review verdict = `PASS` (explicit; defaults are NOT merge-permissive)
+- PR diff is strictly additions/modifications under `entries/<id>.yaml`
+  for an id the submitter owns (or a brand-new id)
+- PR is from the same repo (fork PRs lack the GITHUB_TOKEN scope to merge)
+
+slow-compliance runs as an informational signal (`continue-on-error: true`)
+— its failure does not block auto-merge. Most real cubes need
+Docker/VM/large-disk environments that don't fit `provider: local`.
+
+**On `CONCERN`**: post review as PR comment, apply `human-review-needed`,
+do NOT merge.
+
+**Security boundary**: review job runs with read-only `pull_request`
+permissions. Merge happens in a separate job that consumes the verdict —
+compromising the LLM step alone does not bypass ownership / schema / install
+checks (separate jobs).
+
+### Invariants
+
+**Invariant #6 changes from**:
+
+> 6. Entries never auto-merge — a maintainer reviews every PR.
+
+**To**:
+
+> 6. Entries auto-merge iff (ownership-check ∧ quick-compliance ∧
+>    entry-review verdict=PASS) AND the diff is strictly
+>    additions/modifications under `entries/<id>.yaml` AND the PR is from
+>    the same repo. slow-compliance is informational
+>    (`continue-on-error`); its failure does not block. Any deviation
+>    falls back to `ready-for-review` + manual merge.
+
+## `openspec/specs/entry/spec.md`
+
+### `## Contracts for submitters`
+
+**Add bullet**:
+
+> - On submission, an LLM reviewer checks that the entry's description matches
+>   the package, the GitHub handles in `authors[]` are plausibly tied to the
+>   linked repo's commit history, the wrapper license is consistent with the
+>   source, and the `id`/`name` doesn't collide with or impersonate an existing
+>   entry. PRs that fail any of these are labeled `human-review-needed` and
+>   held for a maintainer.
+
+## `README.md`
+
+### "Submission steps" section
+
+Replace the auto-merge promise that PR #50 already softened with the actual
+Phase-1 behavior:
+
+> Either way, CI validates the entry. On all checks green, including the LLM
+> semantic-review verdict, the PR auto-merges. PRs flagged `human-review-needed`
+> are held for a maintainer.
+
+## `.github/pull_request_template.md`
+
+### "What CI will check" section
+
+Add a row:
+
+| Check | When | ~Time |
+|---|---|---|
+| LLM semantic review (Claude) | On PR (after ownership + quick) | ~1 min |
+
+### Below the table
+
+Replace the "ready-for-review" copy with:
+
+> When ownership-check, quick-compliance, and the LLM review verdict all
+> pass, the PR auto-merges. A `human-review-needed` label means the LLM
+> reviewer surfaced a concern; a maintainer will follow up.