CUBE Registry

The official catalog of CUBE-compliant benchmarks.

What is this?

The CUBE Registry is a community-maintained index of benchmarks that implement the CUBE standard. Any CUBE-compliant evaluation platform or training harness can discover and run registered benchmarks without custom integration.

Each benchmark is a single YAML file in entries/. The registry does not host benchmark code or data — it points to PyPI packages that do.

Submitting a benchmark

Prerequisites

Your benchmark package must:

Be published on PyPI
Implement the CUBE Benchmark and Task interfaces
Expose at least one debug task via cube/debug_tasks

Ready to wrap a benchmark into a CUBE? See the Authoring a CUBE guide. The easiest path is the /new-cube skill for coding agents, which interviews you, scaffolds the package, fills TODOs, validates, and produces a registry entry end-to-end:

/new-cube

Before submitting, self-audit with /review-cube ./path/to/cube — it installs your package, runs pytest + cube test, audits against cube-standard invariants, and reports blocking issues locally so you can fix them before CI does.

Submission steps

Automated (recommended): from your cube package directory, run:

cube registry add --submit

This generates the entry YAML from your pyproject.toml, forks this repo, commits the entry, and opens a PR via the gh CLI. Run cube registry add without --submit first if you want to edit the YAML locally before opening the PR.

Manual:

Fork this repository
Create entries/<your-benchmark-id>.yaml (see template below)
Open a pull request

Either way, CI runs three hard gates (ownership, quick-compliance, LLM semantic review) plus an informational slow-compliance signal. If the hard gates pass and the PR diff is strictly under entries/<id>.yaml, the PR auto-merges. If the LLM flags a CONCERN, the PR is labeled ready-for-review and a maintainer finishes the merge.

Fork PRs and PRs that touch paths outside entries/ always fall back to maintainer-merge. See openspec/specs/ci/spec.md for the full pipeline contract.

Entry template

id: your-benchmark-id          # must match filename, globally unique
name: Your Benchmark Name
version: "1.0.0"               # must match PyPI version exactly
description: >
  One paragraph describing what your benchmark tests and why it matters.
package: your-pypi-package-name

authors:
  - github: your-github-handle
    name: Your Name

legal:
  wrapper_license: MIT          # license of this cube wrapper code
  benchmark_license:
    reported: "CC-BY-4.0"      # SPDX identifier, as you understand it
    source_url: "https://github.com/you/benchmark/blob/main/LICENSE"
  notices: []                   # see spec for notice types

paper: "https://arxiv.org/abs/..."   # optional
tags: [web, coding, os, gui, mobile, science, math, multi-agent]
getting_started_url: "https://..."

supported_infra: [aws]          # providers to run compliance checks on
max_concurrent_tasks: 1
parallelization_mode: sequential  # sequential | task-parallel | benchmark-parallel

Fields populated automatically by CI (do not fill): status, resources, task_count, has_debug_task, has_debug_agent, action_space, features, stress_results_url

Updating your entry

Open a PR modifying your existing YAML. CI verifies you are a registered author (via OWNERS.yaml) and runs the same four gates as a new submission. On all gates green, the PR auto-merges.

Compliance checks

Every submission goes through four pre-merge gates and one post-merge gate:

Gate	When	What	Hard gate?
ownership-check	On PR (~10s)	Submitter is in `OWNERS.yaml` for the entry (or it's new)	Yes
quick-compliance	On PR (~2 min)	Schema, PyPI install, API introspection — hardened Docker sandbox, no credentials	Yes
slow-compliance	On PR (~5 min)	Debug task with `provider=local` on the GHA runner	No — informational
entry-review	On PR (~1 min)	LLM semantic check — verdict `PASS` or `CONCERN`	Yes
slow-check (cloud)	Post-merge (async, ~5–30 min)	Full stress run on cloud VMs across `supported_infra`; writes `stress-results/`	Post-merge canonical

Pre-merge slow-compliance is informational: most real cubes need Docker/VM/large-disk environments that don't fit on a GHA runner, so hard-gating there would exclude almost every real benchmark from auto-merge. Failure surfaces as a red check (useful signal for cubes that do support local) but doesn't block. The post-merge slow-check on cloud VMs is the canonical execution gate.

Hard-gate failures → check shows red on the PR; submitter pushes a fix. Post-merge slow-check failing → opens a GitHub issue tagging the entry authors; entry remains in the registry with status: degraded until fixed.

Submitting a result

The registry also hosts a per-cube community results journal — one JSON file per evaluation run, browsable on each cube's page. Submissions are self-reported; scores are not independently verified.

Submission format

One file per run at results/<cube-id>/<sanitized-evaluation-id>.json, conforming to results-schema.json. The file captures what was evaluated (benchmark + version + subset + n_tasks denominator) and what happened (aggregate score + std err + mutually-exclusive outcome counts: success / clean failure / max-steps / system error / missing). No per-task data — keep raw trajectories on your own infra.

How to submit

The easiest path is from cube-harness:

uv run scripts/submit_to_journal.py <experiment_dir>

…which builds the JSON, clones cube-registry, and opens a PR via gh.

Or manually:

Fork cube-registry.
Create results/<cube-id>/<your-evaluation-id>.json (slashes in the evaluation-id are replaced with __ in the filename).
Open a PR. CI validates schema + cross-references the registry + checks outcome consistency. If everything passes and the PR strictly adds files under results/, the PR auto-merges.

⚠ Fork PRs auto-merge only when push permissions exist. GitHub's default GITHUB_TOKEN is read-only for pull_request workflows triggered from a fork, so the gh pr merge --auto step is a no-op there. Validation still runs and posts a green summary — a maintainer will need to complete the merge by hand. To get true auto-merge, submit from a branch of this repo (cube registry add and the cube-harness submitter both support this) or wait for the maintainer review.

The journal is append-only. Corrections are made by submitting a new record with a supersedes field referencing the prior evaluation_id.

Legal

License information in this registry is self-reported by cube developers and has not been verified by the AI Alliance. Always consult the benchmark_license.source_url and the original benchmark authors for authoritative terms.

By submitting an entry, contributors attest that license information is accurate to the best of their knowledge. See CONTRIBUTOR_AGREEMENT.md for full terms.

Specification

Full registry design: cube-standard/design/registry_specs.md

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.github		.github
docs		docs
entries		entries
openspec		openspec
results/terminalbench2		results/terminalbench2
scripts		scripts
site-src		site-src
stress-results		stress-results
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTOR_AGREEMENT.md		CONTRIBUTOR_AGREEMENT.md
DEPRECATED.md		DEPRECATED.md
LICENSE		LICENSE
OWNERS.yaml		OWNERS.yaml
README.md		README.md
known-authors.yaml		known-authors.yaml
pyproject.toml		pyproject.toml
registry-schema.json		registry-schema.json
results-schema.json		results-schema.json
samples-schema.json		samples-schema.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUBE Registry

What is this?

Submitting a benchmark

Prerequisites

Submission steps

Entry template

Updating your entry

Compliance checks

Submitting a result

Submission format

How to submit

Legal

Specification

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUBE Registry

What is this?

Submitting a benchmark

Prerequisites

Submission steps

Entry template

Updating your entry

Compliance checks

Submitting a result

Submission format

How to submit

Legal

Specification

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages