Commit 6702367
authored
Register 3 cubes + importlib fix + entry-auto-merge proposal (doc-truth) (#50)
* feat: register arithmetic, swebench-live, swebench-verified, terminalbench2
Adds entry YAMLs for four cubes that pass quick_check.py with a clean
scaffold. All four live in The-AI-Alliance/cube-harness today and are
installable via the standard dev_install_url pattern; they were
discovered + validated by looping the registry's quick_check.py
(--no-install) against every cube directory in cube-harness/cubes/ as a
registration-UX probe.
Specifics:
• arithmetic 4 tasks — math benchmark, smoke-test fixture
• swebench-live 1895 tasks — continuously-refreshed GitHub-issue
resolution, contamination-resistant
• swebench-verified 500 tasks — Princeton + OpenAI human-validated
subset
• terminalbench2 89 tasks — Laude Institute terminal tasks
(16 categories, pytest-validated)
Each entry passed quick_check.py --no-install locally with the cube
package installed editable from cube-harness/cubes/<name>; the script's
write-back populated task_count, has_debug_task, has_debug_agent,
features, action_space, status, resources without further hand-editing.
Author handles are the actual cube wrapper maintainers (per git log on
each cubes/<name>/ subtree):
• arithmetic NicolasAG, recursix
• swebench-live NicolasAG, recursix, josancamon19
• swebench-verified NicolasAG, recursix, josancamon19
• terminalbench2 recursix
Other cubes investigated in the same probe but NOT included here:
• browsercomp — module-load fails (BrowseCompBenchmarkConfig
requires scorer_model that the default doesn't
pass). Filed cube-harness#479 against younik
+ NicolasAG.
• windows-agent-arena — task_metadata count (152) != declared num_tasks
(154). Filed cube-harness#480 against
kushasareen + amanjaiswal73892.
• webarena-verified — already registered; module-load now fails after
a BgymToolConfig → ToolboxConfig rename that
didn't propagate to the cube's configs.py. Will
surface on next periodic health-check.
• terminalbench-cube — empty directory in cube-harness (build artefacts
only); dev branch doesn't track any source.
Apparent leftover from a tb → tb2 rename.
Drive-by fix in scripts/quick_check.py: the script calls
importlib.metadata.entry_points(...) at lines 122 + 167 but only does
import importlib at line 21. In Python ≥ 3.10 importlib.metadata is a
submodule that needs an explicit import; without it the try/except at
find_benchmark_class:121 catches an AttributeError ("module 'importlib'
has no attribute 'metadata'") and silently degrades to the by-name lookup
path. Added one line: import importlib.metadata. Pure correctness fix;
the by-name fallback path remains as defense in depth.
Companion issues for the cubes that didn't make this batch:
• The-AI-Alliance/cube-harness#479 (browsercomp)
• The-AI-Alliance/cube-harness#480 (windows-agent-arena)
Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
* drop arithmetic entry — debug/dev cube, not appropriate for public registry
Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
* docs: stop promising entry auto-merge until it exists + draft entry-auto-merge proposal
Two problems in registration UX, fixed honestly in this PR + designed for in a
follow-up:
1. README and the PR template both claim "CI validates the entry and auto-merges
if it passes. No human review needed." That contradicts CI spec invariant #6
("Entries never auto-merge — a maintainer reviews every PR") and the
workflow itself (quick-check.yml labels `ready-for-review` and stops). The
docs were aspirational; reality requires a maintainer click. This commit
corrects the docs to describe what actually happens today.
2. The reason the spec disabled auto-merge — fear of a buggy/compromised
quick-check rubber-stamping anything — is the wrong threat model. The real
threats auto-merge exposes are semantic (id-squatting, brand impersonation,
author impersonation, description deception, legal misrepresentation), not
install-time. None of those are catchable by `pip install` + `Benchmark`
import; all of them are exactly what an LLM reviewer with structured output
is good at.
`openspec/changes/entry-auto-merge/` proposes adding `entry-review.yml`
(Claude action returning verdict PASS|CONCERN with per-check booleans) as
a CI gate, and enabling auto-merge when ownership + quick + entry-review
all pass. Phased: this PR is Phase 0 (doc truth); Phase 1 ships the
reviewer + auto-merge as a follow-up registry PR; Phase 2 moves slow-check
pre-merge with a fork-PR label-gate.
The proposal documents the threat model deltas explicitly so the security
trade-off has explicit sign-off rather than being implied by a workflow
comment.
Companion cube-standard work (separate PR, out of scope here):
• `cube registry add` rerun-overwrites-edits fix
• Post-submit CLI guidance update to match the new flow
Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
* feat: Phase-1 entry auto-merge with LLM semantic review
Implements the entry-auto-merge proposal's Phase 1. The submission flow now
runs four pre-merge gates and auto-merges on green:
ownership-check → quick-compliance → slow-compliance → entry-review
├─ PASS + path-isolated + same-repo → auto-merge
└─ everything else → ready-for-review (manual)
Pieces:
scripts/entry_review.py + scripts/entry_review_prompt.md
LLM reviewer that consults the entry, the package's PyPI metadata, the
linked repo's README, and the existing entries/ + known-authors.yaml.
Calls Claude with a forced tool call to return a structured verdict:
verdict: PASS | CONCERN
checks:
description_matches_package, authors_consistent_with_git,
no_id_squat_vs_existing, no_brand_impersonation,
wrapper_license_plausible
notes
Writes verdict.json + a Markdown PR comment. Pure stdlib + pyyaml +
anthropic; no responses-mock-style ceremony. 19 unit tests covering
context assembly, repo-URL parsing, verdict-comment rendering, and main()
file I/O (all mocked — no real API call from CI).
.github/workflows/quick-check.yml (significant rewrite)
Renamed to "Entry Pipeline" (filename kept for branch-protection
compatibility). Added four new jobs: classify-paths (path-isolation gate
for auto-merge eligibility), slow-compliance (debug task on the runner
via provider=local), entry-review (the LLM gate), auto-merge (gh pr merge
--auto --squash), request-review (fallback labeling + comment listing
the specific reasons auto-merge was blocked). Graceful degrades when
ANTHROPIC_API_KEY is unset — verdict=UNKNOWN routes to ready-for-review,
same as today's behavior.
slow-check.yml is unchanged. Its post-merge push-to-main trigger remains
as the canonical cloud-VM stress run (writes stress-results/), with the
pre-merge slow-compliance acting as the fast local-only gate.
openspec/specs/ci/spec.md
Pipeline-overview diagram updated to four pre-merge gates. New sections
for slow-compliance (PR-time, local), entry-review (with the verdict
schema inline), auto-merge (with the five required conditions), and
request-human-review (fallback). Invariant #6 changes from "Entries
never auto-merge" to the iff-condition matching the workflow.
Invariant #7 added: prompt is checked in, reviewer runs in its own job
so a compromised LLM step alone cannot bypass ownership/schema/install/
slow-check.
openspec/specs/entry/spec.md
One added bullet under "Contracts for submitters" describing what the
LLM reviewer checks and how CONCERN routes.
README.md + .github/pull_request_template.md
Both now describe the four-gate flow with the auto-merge condition.
Replaces the Phase-0 "ready-for-review only" copy from the previous
commit. Fork PRs are explicitly noted as falling back to manual merge.
openspec/changes/entry-auto-merge/proposal.md
Phasing table updated — Phase 0 + Phase 1 marked shipped. Phase 1
becomes active once ANTHROPIC_API_KEY is set in repo secrets.
To activate (post-merge):
Set ANTHROPIC_API_KEY in repo secrets. Next entry PR will be reviewed +
auto-merged on green. Until the secret is set, the workflow runs
ownership + quick + slow gates exactly as today and routes everything
to ready-for-review.
187/187 tests pass locally (19 new + 168 pre-existing).
Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
* fix: soften pre-merge slow-compliance to informational signal
The first live run of Phase 1 exposed a design flaw: making slow-compliance
a hard gate excludes almost every real cube from auto-merge. Cubes like
SWE-bench, OSWorld, terminal-bench, WebArena need Docker/VM/large-disk
environments that don't fit on a GHA runner with provider=local.
Concretely, this PR's terminalbench2 entry crashes slow-check at PR-time
with ModuleNotFoundError: No module named cube.tools.terminal — a real
cube-side stale-import bug, but one that would only ever surface once the
PR actually ran on real infra anyway.
Fix: continue-on-error: true on the slow-compliance job.
- Job still runs; red check still appears on PR if it fails (useful
signal for cubes that DO support local-provider)
- Failure no longer propagates to success() upstream, so auto-merge
gate is now (ownership ∧ quick ∧ review=PASS ∧ path-isolated
∧ same-repo), not (ownership ∧ quick ∧ slow ∧ review=PASS ∧ ...)
- Post-merge slow-check.yml on cloud VMs remains the canonical
execution gate; weekly health-check catches drift
Updated:
- openspec/specs/ci/spec.md: invariant #6 dropped the slow-compliance
conjunct; section labels itself informational; auto-merge condition
list revised
- README.md + .github/pull_request_template.md: gate tables now have
Hard-gate? column; pre-merge slow-compliance marked informational
with a paragraph explaining why
Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
* fix: address code-review findings on Phase 1
Six fixes from the multi-agent code review of PR #50:
S1 (security, NIT) — Path-isolation regex tightened from `^entries/` to
`^entries/[^/]*\.yaml$`. The old check accepted `entries/subdir/foo.yaml`
and `entries/foo.txt`; the new pattern enforces flat `entries/<id>.yaml`
only, matching the auto-merge eligibility contract.
(.github/workflows/quick-check.yml ~L94)
S2 (security, BLOCKING) — Dropped `--auto` from `gh pr merge`. The flag
requires branch-protection rules to be configured on `main`; without them
it silently no-ops. The workflow's `if:` already enforces all gates
(ownership ∧ quick ∧ review=PASS ∧ path-isolated ∧ same-repo), so an
unconditional `--squash` is both correct and direct. Comment added
explaining the rationale.
(.github/workflows/quick-check.yml ~L351)
C1 (correctness, BLOCKING) — Fixed misleading copy in the request-review
fallback comment. Old: "ownership-check, quick-compliance, slow-compliance
passed." but slow-compliance is informational and may have failed. New:
"ownership-check + quick-compliance passed (slow-compliance is
informational)."
(.github/workflows/quick-check.yml ~L399)
C3 (correctness, BLOCKING) — Hardened entry_review.py against transient
LLM errors. Old: a network blip / API 5xx / schema-validation 400 would
propagate as a Python exception → workflow step exits non-zero →
downstream success() is false → BOTH auto-merge AND request-review skip,
leaving the PR stuck with a red check and no actionable feedback.
Fix: wrap call_claude in try/except at the boundary; on any error emit
verdict=UNKNOWN with the error type+message in notes; always exit 0.
Behaviorally identical to "API key not set" — request-review fires with
the appropriate fallback comment.
Also distinguished PASS/CONCERN/UNKNOWN in the workflow's per-entry
loop. Previously OVERALL collapsed CONCERN+UNKNOWN onto CONCERN, which
mislabeled "Claude couldn't run" as "Claude flagged a concern" in the
fallback comment. Now: ANY entry CONCERN → OVERALL=CONCERN; otherwise
ANY entry UNKNOWN → OVERALL=UNKNOWN; else PASS.
(scripts/entry_review.py ~L260; .github/workflows/quick-check.yml ~L301)
C5 (test coverage) — Added 4 new tests for the regex hardening surface +
the LLM-error path:
- test_rejects_typosquat_host (github.com.evil.com → no match, no fetch)
- test_rejects_non_https_scheme (https://, git+http://, git+ssh://, file:// — all rejected)
- test_captures_owner_repo_only (pathological owner stays in URL, no shell exposure)
- test_llm_error_writes_unknown_verdict (ConnectionError → verdict=UNKNOWN, rc=0)
Tests: 19 → 23 entry_review tests; 187 → 191 suite total.
St5+St6 (spec drift) — `openspec/changes/entry-auto-merge/proposal.md`
phasing table no longer claims auto-merge gates on `∧ slow ∧`; correctly
notes slow-compliance is informational. deltas.md invariant #6 wording
aligned with the spec.md version (mentions same-repo and continue-on-error
properties).
Not addressed (intentional):
C2 — Docs-only PRs (no entries changed) being unlabeled is not a real
bug. They aren't auto-merge candidates anyway; GitHub's normal review
flow applies.
Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
* fix: strip GHA multi-line-output backslash continuations in cross-job loops
The previous CI run with ANTHROPIC_API_KEY now configured revealed a
silent multi-entry bug: only terminalbench2 got reviewed (and earlier
slow-checked); swebench-live and swebench-verified were silently skipped.
Root cause: GitHub Actions encodes multi-line job-output values with
backslash-newline line continuations when forwarded across jobs via
needs.JOB.outputs.NAME. The tj-actions/changed-files step inside
quick-compliance is invoked with separator: newline, which works fine
when consumed in the SAME job (step-internal interpolation), but the
cross-job hop into entry-review and slow-compliance wraps each
intermediate line with a trailing backslash.
while IFS= read -r entry_file with -r preserves the backslash. The
existence check [ ! -f ] then succeeds for the literal trailing-backslash
path (no such file) and the iteration silently continues. Only the LAST
entry — which has no newline-and-backslash after it — is read cleanly.
That's why three back-to-back live runs produced exactly one review (and
exactly one slow-check failure, which we mis-attributed to terminalbench2
having a stale cube.tools.terminal import — that diagnosis stands but
masked this loop bug).
Fix: entry_file=${entry_file%\\} to strip the trailing backslash before
the existence check. Applied to both jobs with inline comments to
prevent recurrence.
Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
---------
Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>1 parent 5a18b6d commit 6702367
14 files changed
Lines changed: 1466 additions & 55 deletions
File tree
- .github
- workflows
- entries
- openspec
- changes/entry-auto-merge
- specs
- ci
- entry
- scripts
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
5 | 8 | | |
6 | 9 | | |
7 | 10 | | |
| |||
19 | 22 | | |
20 | 23 | | |
21 | 24 | | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
32 | 43 | | |
33 | 44 | | |
34 | 45 | | |
| |||
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
55 | 64 | | |
56 | 65 | | |
57 | 66 | | |
| |||
89 | 98 | | |
90 | 99 | | |
91 | 100 | | |
92 | | - | |
93 | | - | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
94 | 104 | | |
95 | 105 | | |
96 | 106 | | |
97 | 107 | | |
98 | 108 | | |
99 | | - | |
| 109 | + | |
100 | 110 | | |
101 | | - | |
| 111 | + | |
102 | 112 | | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
108 | 129 | | |
109 | 130 | | |
110 | 131 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
0 commit comments