feat: register arithmetic, swebench-live, swebench-verified, terminalbench2

recursix · recursix · commit 3d8ce961bbe8 · 2026-06-03T16:07:54.000-04:00
Adds entry YAMLs for four cubes that pass quick_check.py with a clean scaffold. All four live in The-AI-Alliance/cube-harness today and are installable via the standard dev_install_url pattern; they were discovered + validated by looping the registry's quick_check.py (--no-install) against every cube directory in cube-harness/cubes/ as a registration-UX probe. Specifics: • arithmetic 4 tasks — math benchmark, smoke-test fixture • swebench-live 1895 tasks — continuously-refreshed GitHub-issue resolution, contamination-resistant • swebench-verified 500 tasks — Princeton + OpenAI human-validated subset • terminalbench2 89 tasks — Laude Institute terminal tasks (16 categories, pytest-validated) Each entry passed quick_check.py --no-install locally with the cube package installed editable from cube-harness/cubes/<name>; the script's write-back populated task_count, has_debug_task, has_debug_agent, features, action_space, status, resources without further hand-editing. Author handles are the actual cube wrapper maintainers (per git log on each cubes/<name>/ subtree): • arithmetic NicolasAG, recursix • swebench-live NicolasAG, recursix, josancamon19 • swebench-verified NicolasAG, recursix, josancamon19 • terminalbench2 recursix Other cubes investigated in the same probe but NOT included here: • browsercomp — module-load fails (BrowseCompBenchmarkConfig requires scorer_model that the default doesn't pass). Filed cube-harness#479 against younik + NicolasAG. • windows-agent-arena — task_metadata count (152) != declared num_tasks (154). Filed cube-harness#480 against kushasareen + amanjaiswal73892. • webarena-verified — already registered; module-load now fails after a BgymToolConfig → ToolboxConfig rename that didn't propagate to the cube's configs.py. Will surface on next periodic health-check. • terminalbench-cube — empty directory in cube-harness (build artefacts only); dev branch doesn't track any source. Apparent leftover from a tb → tb2 rename. Drive-by fix in scripts/quick_check.py: the script calls importlib.metadata.entry_points(...) at lines 122 + 167 but only does import importlib at line 21. In Python ≥ 3.10 importlib.metadata is a submodule that needs an explicit import; without it the try/except at find_benchmark_class:121 catches an AttributeError ("module 'importlib' has no attribute 'metadata'") and silently degrades to the by-name lookup path. Added one line: import importlib.metadata. Pure correctness fix; the by-name fallback path remains as defense in depth. Companion issues for the cubes that didn't make this batch: • The-AI-Alliance/cube-harness#479 (browsercomp) • The-AI-Alliance/cube-harness#480 (windows-agent-arena) Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
diff --git a/entries/arithmetic.yaml b/entries/arithmetic.yaml
@@ -0,0 +1,35 @@
+id: arithmetic
+name: "Arithmetic"
+version: "0.1.0"
+description: >
+  Simple arithmetic benchmark used for smoke-testing the CUBE protocol —
+  the agent is given a math problem and must submit the correct numeric
+  answer. Intentionally lightweight and infrastructure-free; useful as
+  a first-cube template and as a sanity check that an agent's tool-use
+  loop works end-to-end before pointing it at a real benchmark.
+package: arithmetic-cube
+dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/arithmetic-cube"
+
+authors:
+- github: NicolasAG
+  name: Nicolas Gontier
+- github: recursix
+  name: Alexandre Lacoste
+
+legal:
+  wrapper_license: MIT
+
+tags:
+- math
+- reasoning
+status: active
+resources: []
+task_count: 4
+has_debug_task: true
+has_debug_agent: true
+action_space: []
+features:
+  async: false
+  streaming: false
+  multi_agent: false
+  multi_dim_reward: false
diff --git a/entries/swebench-live.yaml b/entries/swebench-live.yaml
@@ -0,0 +1,46 @@
+id: swebench-live
+name: "SWE-bench Live"
+version: "0.1.0"
+description: >
+  SWE-bench Live ported to the CUBE protocol — 1,895 continuously-updated,
+  contamination-resistant GitHub issue resolution tasks across many
+  open-source repositories. Each task pairs a real issue with its merged
+  fix; the agent receives the problem statement plus a git checkout at the
+  base commit and must produce a patch that makes the upstream
+  fail_to_pass tests pass without breaking pass_to_pass. The task pool
+  is refreshed continuously, making the benchmark useful for testing
+  contamination resistance.
+package: swebench-live-cube
+dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/swebench-live-cube"
+
+authors:
+- github: NicolasAG
+  name: Nicolas Gontier
+- github: recursix
+  name: Alexandre Lacoste
+- github: josancamon19
+  name: Joan Cabezas
+
+legal:
+  wrapper_license: MIT
+  benchmark_license:
+    reported: MIT
+    source_url: "https://github.com/microsoft/SWE-bench-Live/blob/main/LICENSE"
+    verified_by_original_authors: false
+
+paper: "https://arxiv.org/abs/2505.23419"
+getting_started_url: "https://swe-bench-live.github.io/"
+tags:
+- coding
+- science
+status: active
+resources: []
+task_count: 1895
+has_debug_task: true
+has_debug_agent: true
+action_space: []
+features:
+  async: false
+  streaming: false
+  multi_agent: false
+  multi_dim_reward: false
diff --git a/entries/swebench-verified.yaml b/entries/swebench-verified.yaml
@@ -0,0 +1,44 @@
+id: swebench-verified
+name: "SWE-bench Verified"
+version: "0.1.0"
+description: >
+  SWE-bench Verified ported to the CUBE protocol — 500 human-validated
+  GitHub issues with test-based resolution criteria. Princeton + OpenAI's
+  curated subset of the broader SWE-bench dataset where every task was
+  manually checked for an unambiguous problem statement and a reliable
+  test-based reward signal. The agent receives the problem statement +
+  a git checkout at the base commit and must produce a patch that makes
+  the upstream fail_to_pass tests pass without breaking pass_to_pass.
+package: swebench-verified-cube
+dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/swebench-verified-cube"
+
+authors:
+- github: NicolasAG
+  name: Nicolas Gontier
+- github: recursix
+  name: Alexandre Lacoste
+- github: josancamon19
+  name: Joan Cabezas
+
+legal:
+  wrapper_license: MIT
+  benchmark_license:
+    reported: MIT
+    source_url: "https://github.com/SWE-bench/SWE-bench/blob/main/LICENSE"
+    verified_by_original_authors: false
+
+paper: "https://arxiv.org/abs/2310.06770"
+getting_started_url: "https://openai.com/index/introducing-swe-bench-verified/"
+tags:
+- coding
+status: active
+resources: []
+task_count: 500
+has_debug_task: true
+has_debug_agent: true
+action_space: []
+features:
+  async: false
+  streaming: false
+  multi_agent: false
+  multi_dim_reward: false
diff --git a/entries/terminalbench2.yaml b/entries/terminalbench2.yaml
@@ -0,0 +1,41 @@
+id: terminalbench2
+name: "Terminal-Bench 2"
+version: "0.1.0"
+description: >
+  Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the
+  CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy,
+  query, modernize) with pytest-based validation. Each task hands the
+  agent a Linux shell pre-loaded with a project, asks for a concrete
+  deliverable (a fixed bug, a passing test, a compiled binary, an
+  inferred answer), and verifies the result by running an upstream pytest
+  test suite the agent never sees. Tasks span 16 categories with
+  difficulty levels easy / medium / hard.
+package: terminalbench2-cube
+dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/terminalbench2-cube"
+
+authors:
+- github: recursix
+  name: Alexandre Lacoste
+
+legal:
+  wrapper_license: MIT
+  benchmark_license:
+    reported: Apache-2.0
+    source_url: "https://github.com/harbor-framework/terminal-bench-2"
+    verified_by_original_authors: false
+
+getting_started_url: "https://github.com/harbor-framework/terminal-bench-2"
+tags:
+- coding
+- os
+status: active
+resources: []
+task_count: 89
+has_debug_task: true
+has_debug_agent: true
+action_space: []
+features:
+  async: false
+  streaming: false
+  multi_agent: false
+  multi_dim_reward: false
diff --git a/scripts/quick_check.py b/scripts/quick_check.py
@@ -19,6 +19,7 @@
 
 import argparse
 import importlib
+import importlib.metadata  # explicit so importlib.metadata.entry_points works on Python ≥ 3.10
 import inspect
 import json
 import re