cube-harness

Open source harness for building and evaluating AI agents using the CUBE Standard.

CUBE Standard defines the benchmark protocol. cube-harness is the evaluation runtime: it runs agents against any CUBE-compatible benchmark, records trajectories, and scales execution with Ray.

Note

cube-harness is in active development (alpha). Interfaces may change. We welcome early adopters and contributors who want to shape the framework, not just use it. See our Roadmap and Contributing Guide.

Have a benchmark to contribute? Fill out this short form — no commitment required. Want to go deeper? Apply to join the core team.

Quickstart

Installation

# Clone the repository
git clone https://github.com/The-AI-Alliance/cube-harness.git
cd cube-harness

# Install dependencies
make install

API Keys

Set your OpenAI API key:

export OPENAI_API_KEY=your-key-here

Any LiteLLM-supported provider works — just change model_name in the recipe.

Run Tests

make test

Run Hello Example

The hello_miniwob recipe demonstrates running a ReAct agent on the MiniWob benchmark.

Start here — first 2 tasks, in-process (fast, no Ray required):

make debug          # → uv run recipes/hello_miniwob.py --limit 2

Full benchmark (parallel via Ray):

make hello          # → uv run recipes/hello_miniwob.py

Configuration

A recipe is a declarative config file: it imports canonical configs by name, tweaks a few attributes, builds one or more Experiment objects, and ends with run(...). Copy a recipe from recipes/ and edit it — recipes are documentation-by-example, not a CLI.

from cube_harness.agents.genny_configs import GENNY_CONFIGS  # "default", "swe"
from cube_harness.infra import INFRA_CONFIGS                  # ~/.cube/infra.py; "local" built in
from cube_harness.recipe import run

agent = GENNY_CONFIGS["swe"]          # every lookup is a fresh deep copy
agent.budget.cost_limit = 2.0         # validated at the assignment site

exp = Experiment(name="x", agent_config=agent, benchmark_config=..., infra=INFRA_CONFIGS["local"])
if __name__ == "__main__":
    run(exp)                          # or run(exp_a, exp_b)

run() is the only CLI, identical for every recipe and not extensible: --limit N (first N tasks, in-process), --ray N (worker count), --set dotted.path=value (ad-hoc override). For anything structural, clone the file. Config objects are typed Pydantic models, serialized with every experiment for reproducibility.

Infra is machine-local in ~/.cube/infra.py (a dict[str, InfraConfig], never committed; credentials come from env). "local" works with zero setup. To use a cluster/cloud, copy recipes/infra_template.py to ~/.cube/infra.py and edit it — it documents the process and shows LocalInfraConfig plus commented Toolkit/Azure examples.

See docs/configuration.md for the full philosophy, a comparison with Hydra/YAML/CLI approaches, and how to run sweeps.

Experiment Viewer

cube-harness includes a Gradio-based XRay UI for exploring experiment results, trajectories, and OpenTelemetry spans:

make xray
# or: uv run ch-xray

The viewer displays:

Trajectory list — all runs with task ID, steps, reward, and duration
Visual timeline — color-coded steps (blue=environment, green=agent) with duration-based widths
Screenshots — environment state at each step
Step details — observations, agent actions, and LLM reasoning
Debug data — raw JSON, LLM calls, and tool configurations

Architecture Overview

cube-harness is a universal evaluation platform for agentic benchmarks and an RL data generation framework built on top of the CUBE Standard.

Core Components

Agent — LLM-powered decision maker that receives observations and produces actions
Environment — Executes actions, provides observations and rewards (tool + task composition)
Tool — Modular action provider that exposes an action space, reusable across benchmarks
ActionSpace — Defines the set of possible actions a tool can execute
Task — Defines goals, validation logic, and action subsets
Trajectory — Stores interaction history (observations, actions, rewards)
Episode — Single agent-environment loop for one task; records a trajectory
Benchmark — Collection of tasks; produces env configs for episodes
Experiment — Coordinates execution of multiple episodes across a benchmark
ExpRunner — Execution runtime (sequential or parallel via Ray)

Design Goals

Benchmark Agnostic — Plug in any CUBE-standard benchmark (MiniWob, WebArena, OSWorld, …) via the Benchmark interface
Agent Agnostic — Support any agent architecture by implementing the Agent protocol
RL-Ready — Trajectory format designed for training data generation with full LLM call logging
Scalable — Ray integration for parallel episode execution across multiple workers
Observable — Structured trajectory output for analysis and debugging

Development

make format    # Format code
make lint      # Lint and auto-fix
make help      # Show all commands
make test      # Run tests
make coverage  # Run tests with coverage report

Pre-commit hooks

Install once after cloning to get ruff lint/format, trailing-whitespace checks, and DCO sign-off enforcement on every commit:

pre-commit install --hook-type pre-commit --hook-type commit-msg --hook-type prepare-commit-msg

The prepare-commit-msg hook automatically appends Signed-off-by: Your Name <email> to every commit message (required by the DCO). You can also sign off manually with git commit -s.

Project Structure

cube-harness/
├── src/cube_harness/   # Source code for the framework
├── tests/              # Test suite
├── recipes/            # Example recipes and configurations
├── docs/               # Project documentation
└── Makefile            # Common task shortcuts

Getting Involved

All contributions are welcome — open an issue, submit a PR, or wrap a new benchmark. See CONTRIBUTING.md for the development guide, DCO requirements, and RFC process.

Want deeper involvement? Join the core team, shape the roadmap, and get credit for what you build. Apply here.

For general AI Alliance contribution guidelines, see the community repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cube-harness

Quickstart

Installation

API Keys

Run Tests

Run Hello Example

Configuration

Experiment Viewer

Architecture Overview

Core Components

Design Goals

Development

Pre-commit hooks

Project Structure

Getting Involved

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

cube-harness

Quickstart

Installation

API Keys

Run Tests

Run Hello Example

Configuration

Experiment Viewer

Architecture Overview

Core Components

Design Goals

Development

Pre-commit hooks

Project Structure

Getting Involved