The official catalog of CUBE-compliant benchmarks.
The CUBE Registry is a community-maintained index of benchmarks that implement the CUBE standard. Any CUBE-compliant evaluation platform or training harness can discover and run registered benchmarks without custom integration.
Each benchmark is a single YAML file in entries/. The registry does not host benchmark
code or data — it points to PyPI packages that do.
Your benchmark package must:
- Be published on PyPI
- Implement the CUBE
BenchmarkandTaskinterfaces - Expose at least one debug task via
cube/debug_tasks
Ready to wrap a benchmark into a CUBE? See the Authoring a CUBE guide. The easiest path is the /new-cube skill for coding agents, which interviews you, scaffolds the package, fills TODOs, validates, and produces a registry entry end-to-end:
/new-cube
Before submitting, self-audit with /review-cube ./path/to/cube — it installs your package, runs pytest + cube test, audits against cube-standard invariants, and reports blocking issues locally so you can fix them before CI does.
Automated (recommended): from your cube package directory, run:
cube registry add --submitThis generates the entry YAML from your pyproject.toml, forks this repo, commits the entry, and opens a PR via the gh CLI. Run cube registry add without --submit first if you want to edit the YAML locally before opening the PR.
Manual:
- Fork this repository
- Create
entries/<your-benchmark-id>.yaml(see template below) - Open a pull request
Either way, CI runs three hard gates (ownership, quick-compliance, LLM
semantic review) plus an informational slow-compliance signal. If the hard
gates pass and the PR diff is strictly under entries/<id>.yaml, the PR
auto-merges. If the LLM flags a CONCERN, the PR is labeled
ready-for-review and a maintainer finishes the merge.
Fork PRs and PRs that touch paths outside entries/ always fall back to
maintainer-merge. See
openspec/specs/ci/spec.md for the full pipeline
contract.
id: your-benchmark-id # must match filename, globally unique
name: Your Benchmark Name
version: "1.0.0" # must match PyPI version exactly
description: >
One paragraph describing what your benchmark tests and why it matters.
package: your-pypi-package-name
authors:
- github: your-github-handle
name: Your Name
legal:
wrapper_license: MIT # license of this cube wrapper code
benchmark_license:
reported: "CC-BY-4.0" # SPDX identifier, as you understand it
source_url: "https://github.com/you/benchmark/blob/main/LICENSE"
notices: [] # see spec for notice types
paper: "https://arxiv.org/abs/..." # optional
tags: [web, coding, os, gui, mobile, science, math, multi-agent]
getting_started_url: "https://..."
supported_infra: [aws] # providers to run compliance checks on
max_concurrent_tasks: 1
parallelization_mode: sequential # sequential | task-parallel | benchmark-parallelFields populated automatically by CI (do not fill):
status, resources, task_count, has_debug_task, has_debug_agent,
action_space, features, stress_results_url
Open a PR modifying your existing YAML. CI verifies you are a registered
author (via OWNERS.yaml) and runs the same four gates as a new submission.
On all gates green, the PR auto-merges.
Every submission goes through four pre-merge gates and one post-merge gate:
| Gate | When | What | Hard gate? |
|---|---|---|---|
| ownership-check | On PR (~10s) | Submitter is in OWNERS.yaml for the entry (or it's new) |
Yes |
| quick-compliance | On PR (~2 min) | Schema, PyPI install, API introspection — hardened Docker sandbox, no credentials | Yes |
| slow-compliance | On PR (~5 min) | Debug task with provider=local on the GHA runner |
No — informational |
| entry-review | On PR (~1 min) | LLM semantic check — verdict PASS or CONCERN |
Yes |
| slow-check (cloud) | Post-merge (async, ~5–30 min) | Full stress run on cloud VMs across supported_infra; writes stress-results/ |
Post-merge canonical |
Pre-merge slow-compliance is informational: most real cubes need
Docker/VM/large-disk environments that don't fit on a GHA runner, so
hard-gating there would exclude almost every real benchmark from auto-merge.
Failure surfaces as a red check (useful signal for cubes that do support
local) but doesn't block. The post-merge slow-check on cloud VMs is the
canonical execution gate.
Hard-gate failures → check shows red on the PR; submitter pushes a fix.
Post-merge slow-check failing → opens a GitHub issue tagging the entry authors;
entry remains in the registry with status: degraded until fixed.
The registry also hosts a per-cube community results journal — one JSON file per evaluation run, browsable on each cube's page. Submissions are self-reported; scores are not independently verified.
One file per run at results/<cube-id>/<sanitized-evaluation-id>.json,
conforming to results-schema.json. The file captures
what was evaluated (benchmark + version + subset + n_tasks denominator) and
what happened (aggregate score + std err + mutually-exclusive outcome counts:
success / clean failure / max-steps / system error / missing). No per-task
data — keep raw trajectories on your own infra.
The easiest path is from cube-harness:
uv run scripts/submit_to_journal.py <experiment_dir>…which builds the JSON, clones cube-registry, and opens a PR via gh.
Or manually:
- Fork cube-registry.
- Create
results/<cube-id>/<your-evaluation-id>.json(slashes in the evaluation-id are replaced with__in the filename). - Open a PR. CI validates schema + cross-references the registry + checks
outcome consistency. If everything passes and the PR strictly adds files
under
results/, the PR auto-merges.
⚠ Fork PRs auto-merge only when push permissions exist. GitHub's default
GITHUB_TOKENis read-only forpull_requestworkflows triggered from a fork, so thegh pr merge --autostep is a no-op there. Validation still runs and posts a green summary — a maintainer will need to complete the merge by hand. To get true auto-merge, submit from a branch of this repo (cube registry addand the cube-harness submitter both support this) or wait for the maintainer review.
The journal is append-only. Corrections are made by submitting a new
record with a supersedes field referencing the prior evaluation_id.
License information in this registry is self-reported by cube developers and has not been
verified by the AI Alliance. Always consult the benchmark_license.source_url and the
original benchmark authors for authoritative terms.
By submitting an entry, contributors attest that license information is accurate to the best of their knowledge. See CONTRIBUTOR_AGREEMENT.md for full terms.
Full registry design: cube-standard/design/registry_specs.md