Skip to content

Commit 3d8ce96

Browse files
committed
feat: register arithmetic, swebench-live, swebench-verified, terminalbench2
Adds entry YAMLs for four cubes that pass quick_check.py with a clean scaffold. All four live in The-AI-Alliance/cube-harness today and are installable via the standard dev_install_url pattern; they were discovered + validated by looping the registry's quick_check.py (--no-install) against every cube directory in cube-harness/cubes/ as a registration-UX probe. Specifics: • arithmetic 4 tasks — math benchmark, smoke-test fixture • swebench-live 1895 tasks — continuously-refreshed GitHub-issue resolution, contamination-resistant • swebench-verified 500 tasks — Princeton + OpenAI human-validated subset • terminalbench2 89 tasks — Laude Institute terminal tasks (16 categories, pytest-validated) Each entry passed quick_check.py --no-install locally with the cube package installed editable from cube-harness/cubes/<name>; the script's write-back populated task_count, has_debug_task, has_debug_agent, features, action_space, status, resources without further hand-editing. Author handles are the actual cube wrapper maintainers (per git log on each cubes/<name>/ subtree): • arithmetic NicolasAG, recursix • swebench-live NicolasAG, recursix, josancamon19 • swebench-verified NicolasAG, recursix, josancamon19 • terminalbench2 recursix Other cubes investigated in the same probe but NOT included here: • browsercomp — module-load fails (BrowseCompBenchmarkConfig requires scorer_model that the default doesn't pass). Filed cube-harness#479 against younik + NicolasAG. • windows-agent-arena — task_metadata count (152) != declared num_tasks (154). Filed cube-harness#480 against kushasareen + amanjaiswal73892. • webarena-verified — already registered; module-load now fails after a BgymToolConfig → ToolboxConfig rename that didn't propagate to the cube's configs.py. Will surface on next periodic health-check. • terminalbench-cube — empty directory in cube-harness (build artefacts only); dev branch doesn't track any source. Apparent leftover from a tb → tb2 rename. Drive-by fix in scripts/quick_check.py: the script calls importlib.metadata.entry_points(...) at lines 122 + 167 but only does import importlib at line 21. In Python ≥ 3.10 importlib.metadata is a submodule that needs an explicit import; without it the try/except at find_benchmark_class:121 catches an AttributeError ("module 'importlib' has no attribute 'metadata'") and silently degrades to the by-name lookup path. Added one line: import importlib.metadata. Pure correctness fix; the by-name fallback path remains as defense in depth. Companion issues for the cubes that didn't make this batch: • The-AI-Alliance/cube-harness#479 (browsercomp) • The-AI-Alliance/cube-harness#480 (windows-agent-arena) Signed-off-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
1 parent 5a18b6d commit 3d8ce96

5 files changed

Lines changed: 167 additions & 0 deletions

File tree

entries/arithmetic.yaml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
id: arithmetic
2+
name: "Arithmetic"
3+
version: "0.1.0"
4+
description: >
5+
Simple arithmetic benchmark used for smoke-testing the CUBE protocol —
6+
the agent is given a math problem and must submit the correct numeric
7+
answer. Intentionally lightweight and infrastructure-free; useful as
8+
a first-cube template and as a sanity check that an agent's tool-use
9+
loop works end-to-end before pointing it at a real benchmark.
10+
package: arithmetic-cube
11+
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/arithmetic-cube"
12+
13+
authors:
14+
- github: NicolasAG
15+
name: Nicolas Gontier
16+
- github: recursix
17+
name: Alexandre Lacoste
18+
19+
legal:
20+
wrapper_license: MIT
21+
22+
tags:
23+
- math
24+
- reasoning
25+
status: active
26+
resources: []
27+
task_count: 4
28+
has_debug_task: true
29+
has_debug_agent: true
30+
action_space: []
31+
features:
32+
async: false
33+
streaming: false
34+
multi_agent: false
35+
multi_dim_reward: false

entries/swebench-live.yaml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
id: swebench-live
2+
name: "SWE-bench Live"
3+
version: "0.1.0"
4+
description: >
5+
SWE-bench Live ported to the CUBE protocol — 1,895 continuously-updated,
6+
contamination-resistant GitHub issue resolution tasks across many
7+
open-source repositories. Each task pairs a real issue with its merged
8+
fix; the agent receives the problem statement plus a git checkout at the
9+
base commit and must produce a patch that makes the upstream
10+
fail_to_pass tests pass without breaking pass_to_pass. The task pool
11+
is refreshed continuously, making the benchmark useful for testing
12+
contamination resistance.
13+
package: swebench-live-cube
14+
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/swebench-live-cube"
15+
16+
authors:
17+
- github: NicolasAG
18+
name: Nicolas Gontier
19+
- github: recursix
20+
name: Alexandre Lacoste
21+
- github: josancamon19
22+
name: Joan Cabezas
23+
24+
legal:
25+
wrapper_license: MIT
26+
benchmark_license:
27+
reported: MIT
28+
source_url: "https://github.com/microsoft/SWE-bench-Live/blob/main/LICENSE"
29+
verified_by_original_authors: false
30+
31+
paper: "https://arxiv.org/abs/2505.23419"
32+
getting_started_url: "https://swe-bench-live.github.io/"
33+
tags:
34+
- coding
35+
- science
36+
status: active
37+
resources: []
38+
task_count: 1895
39+
has_debug_task: true
40+
has_debug_agent: true
41+
action_space: []
42+
features:
43+
async: false
44+
streaming: false
45+
multi_agent: false
46+
multi_dim_reward: false

entries/swebench-verified.yaml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
id: swebench-verified
2+
name: "SWE-bench Verified"
3+
version: "0.1.0"
4+
description: >
5+
SWE-bench Verified ported to the CUBE protocol — 500 human-validated
6+
GitHub issues with test-based resolution criteria. Princeton + OpenAI's
7+
curated subset of the broader SWE-bench dataset where every task was
8+
manually checked for an unambiguous problem statement and a reliable
9+
test-based reward signal. The agent receives the problem statement +
10+
a git checkout at the base commit and must produce a patch that makes
11+
the upstream fail_to_pass tests pass without breaking pass_to_pass.
12+
package: swebench-verified-cube
13+
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/swebench-verified-cube"
14+
15+
authors:
16+
- github: NicolasAG
17+
name: Nicolas Gontier
18+
- github: recursix
19+
name: Alexandre Lacoste
20+
- github: josancamon19
21+
name: Joan Cabezas
22+
23+
legal:
24+
wrapper_license: MIT
25+
benchmark_license:
26+
reported: MIT
27+
source_url: "https://github.com/SWE-bench/SWE-bench/blob/main/LICENSE"
28+
verified_by_original_authors: false
29+
30+
paper: "https://arxiv.org/abs/2310.06770"
31+
getting_started_url: "https://openai.com/index/introducing-swe-bench-verified/"
32+
tags:
33+
- coding
34+
status: active
35+
resources: []
36+
task_count: 500
37+
has_debug_task: true
38+
has_debug_agent: true
39+
action_space: []
40+
features:
41+
async: false
42+
streaming: false
43+
multi_agent: false
44+
multi_dim_reward: false

entries/terminalbench2.yaml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
id: terminalbench2
2+
name: "Terminal-Bench 2"
3+
version: "0.1.0"
4+
description: >
5+
Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the
6+
CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy,
7+
query, modernize) with pytest-based validation. Each task hands the
8+
agent a Linux shell pre-loaded with a project, asks for a concrete
9+
deliverable (a fixed bug, a passing test, a compiled binary, an
10+
inferred answer), and verifies the result by running an upstream pytest
11+
test suite the agent never sees. Tasks span 16 categories with
12+
difficulty levels easy / medium / hard.
13+
package: terminalbench2-cube
14+
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/terminalbench2-cube"
15+
16+
authors:
17+
- github: recursix
18+
name: Alexandre Lacoste
19+
20+
legal:
21+
wrapper_license: MIT
22+
benchmark_license:
23+
reported: Apache-2.0
24+
source_url: "https://github.com/harbor-framework/terminal-bench-2"
25+
verified_by_original_authors: false
26+
27+
getting_started_url: "https://github.com/harbor-framework/terminal-bench-2"
28+
tags:
29+
- coding
30+
- os
31+
status: active
32+
resources: []
33+
task_count: 89
34+
has_debug_task: true
35+
has_debug_agent: true
36+
action_space: []
37+
features:
38+
async: false
39+
streaming: false
40+
multi_agent: false
41+
multi_dim_reward: false

scripts/quick_check.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919

2020
import argparse
2121
import importlib
22+
import importlib.metadata # explicit so importlib.metadata.entry_points works on Python ≥ 3.10
2223
import inspect
2324
import json
2425
import re

0 commit comments

Comments
 (0)